Add support for reading SAM/BAM files with contigs larger than max int #1788

gbggrant · 2022-03-16T18:52:22Z

Instructions

In #1783 a user described a situation in which MarkDuplicates failed because the size of the contig in their BAM was larger than the maximum value of an int (the value in question was 2364278061).

This ticket is to investigate modifications to htsjdk and picard to add support for increasing the allowable size for a contig to be greater than an int.

Bug Report

Affected tool(s)

Tool name(s), special parameters?

Affected version(s)

Latest public release version [version?]
Latest development/master branch as of [date of test?]

Description

Describe the problem below. Provide screenshots , stacktrace , logs where appropriate.

Steps to reproduce

Tell us how to reproduce this issue. If possible, include command lines that reproduce the problem and provide a minimal test case.

Expected behavior

Tell us what should happen

Actual behavior

Tell us what happens instead

Feature request

Tool(s) involved

Tool name(s), special parameters?

Description

Specify whether you want a modification of an existing behavior or addition of a new capability.
Provide examples, screenshots, where appropriate.

Documentation request

Tool(s) involved

Tool name(s), parameters?

Description

Describe what needs to be added or modified.

yfarjoun · 2022-03-16T19:46:24Z

I think that this is a big lift with very little upside....
from the sam spec:

yfarjoun · 2022-03-16T19:49:42Z

This has come up every now and again in the hts-specs call, but there are so few cases where this is desired and there's a reasonable workaround (split your large contig into smaller parts) so it has been decided over and over again not to persue this.

yfarjoun · 2022-03-16T19:55:35Z

For the record, bam indexing seems to break at 2^29 (samtools/hts-specs#35).

But the length requested by the original post is very very large (it's almost the size of the entire human reference....just one contig!)

gbggrant · 2022-03-16T20:03:15Z

Thank you Yossi.

jmarshall · 2022-03-16T23:50:22Z

The BAI format is limited to 2²⁹ due to its unchangeable minshift=14 / depth=5 geometry. This limitation is not inherent in the indexing scheme, and CSI can use geometries that allow maximum coordinates larger than 2³² to be indexed.

IMHO the 32-bit ranges for LN / POS / PNEXT / TLEN that Yossi quoted from the SAM spec are a bit misleading. In BAM these fields are signed 32 bits so the values are indeed limited to 2³¹ in BAM, but in SAM these fields are just text strings of digits so there is no inherent reason for them to be limited in range at all. I'm not sure whether these fields are limited to 32 bits in CRAM. So really these 32-bit ranges only necessarily apply to BAM\1 (see samtools/hts-specs#240).

So an implementer could choose to represent chromosome positions internally as e.g. 64 bits, read and write them with their full range in SAM and (perhaps) CRAM, and raise an error if an attempt is made to write a position greater than 2³¹ as BAM.

HTSlib does this since samtools/htslib#709, which landed in 1.10.

(On reading further, I think current CRAM is also limited to 2³¹ but CRAMv4 will in future lift this to 64 bits.)

JuiTse · 2022-03-17T02:19:17Z

Hi, @yfarjoun
Could you give me some hint to split the large contig into smaller parts?
Thank for your help.
Best,
Jui-Tse

yfarjoun · 2022-03-17T03:10:17Z

Firstly: this is completely experimental!!!! I have not done this!!! Someone else may have, but not I.

I do know that something like this was done for the circular mitochondrial genome, for very different reasons.

There's no tool (that I'm aware of) that will do this, here's what I would do if I had to do this:

find a "boring" (i.e. far from genes, perhaps an area that you wouldn't trust variants there anyway) and just cut the reference sequence there.
you'll need to give the parts new names (xxx-part1 xxx-part2?) and you'll disagree with the rest of the world about the reference you used for the following steps
you'll need to index the genome and remap
run the pipeline on "small" references
create a liftOver chain and bring your variants/annotations back to the "normal" genome assuming.
let us know how it worked!

JuiTse · 2022-03-18T11:17:34Z

Hi, @yfarjoun
I will try it and if there is anything new, I will post here.
Thank again for the kind response and suggestions.

TAKosch · 2023-06-06T05:27:26Z

Hello,

Could you please tell me if contig size is still limited for running Picard MarkDuplicates? I work mainly with amphibian genomes, which tend to have large chromosomes.

Recently, when I tried to run Picard MarkDuplicates, I got the following error:

[Tue Jun 06 11:46:58 AEST 2023] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2172649472
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 334, Read name A00119:894:H5LMLDSX7:4:2668:13684:32800, Insert size out of range
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:441)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:665)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:650)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:620)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:436)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:220)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I've read several older threads that discuss contig size limits. One said that contigs must be <512MB and another that sam contig size limit is 2^31. The largest contig in my dataset exceeds the Byte limit, but not the sam contig size limit. This contig is 1,251,518,389bp and 1,272MB.

Thank you,
Tiffany

gbggrant closed this as completed Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reading SAM/BAM files with contigs larger than max int #1788

Add support for reading SAM/BAM files with contigs larger than max int #1788

gbggrant commented Mar 16, 2022

yfarjoun commented Mar 16, 2022

yfarjoun commented Mar 16, 2022

yfarjoun commented Mar 16, 2022

gbggrant commented Mar 16, 2022

jmarshall commented Mar 16, 2022 •

edited

Loading

JuiTse commented Mar 17, 2022

yfarjoun commented Mar 17, 2022

JuiTse commented Mar 18, 2022

TAKosch commented Jun 6, 2023

Add support for reading SAM/BAM files with contigs larger than max int #1788

Add support for reading SAM/BAM files with contigs larger than max int #1788

Comments

gbggrant commented Mar 16, 2022

Instructions

Bug Report

Affected tool(s)

Affected version(s)

Description

Steps to reproduce

Expected behavior

Actual behavior

Feature request

Tool(s) involved

Description

Documentation request

Tool(s) involved

Description

yfarjoun commented Mar 16, 2022

yfarjoun commented Mar 16, 2022

yfarjoun commented Mar 16, 2022

gbggrant commented Mar 16, 2022

jmarshall commented Mar 16, 2022 • edited Loading

JuiTse commented Mar 17, 2022

yfarjoun commented Mar 17, 2022

JuiTse commented Mar 18, 2022

TAKosch commented Jun 6, 2023

jmarshall commented Mar 16, 2022 •

edited

Loading