Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading SAM/BAM files with contigs larger than max int #1788

Closed
2 tasks
gbggrant opened this issue Mar 16, 2022 · 9 comments
Closed
2 tasks

Comments

@gbggrant
Copy link
Contributor

Instructions

In #1783 a user described a situation in which MarkDuplicates failed because the size of the contig in their BAM was larger than the maximum value of an int (the value in question was 2364278061).

This ticket is to investigate modifications to htsjdk and picard to add support for increasing the allowable size for a contig to be greater than an int.


Bug Report

Affected tool(s)

Tool name(s), special parameters?

Affected version(s)

  • Latest public release version [version?]
  • Latest development/master branch as of [date of test?]

Description

Describe the problem below. Provide screenshots , stacktrace , logs where appropriate.

Steps to reproduce

Tell us how to reproduce this issue. If possible, include command lines that reproduce the problem and provide a minimal test case.

Expected behavior

Tell us what should happen

Actual behavior

Tell us what happens instead


Feature request

Tool(s) involved

Tool name(s), special parameters?

Description

Specify whether you want a modification of an existing behavior or addition of a new capability.
Provide examples, screenshots, where appropriate.


Documentation request

Tool(s) involved

Tool name(s), parameters?

Description

Describe what needs to be added or modified.


@yfarjoun
Copy link
Contributor

I think that this is a big lift with very little upside....
from the sam spec:
image

@yfarjoun
Copy link
Contributor

image

image

This has come up every now and again in the hts-specs call, but there are so few cases where this is desired and there's a reasonable workaround (split your large contig into smaller parts) so it has been decided over and over again not to persue this.

@yfarjoun
Copy link
Contributor

For the record, bam indexing seems to break at 2^29 (samtools/hts-specs#35).

But the length requested by the original post is very very large (it's almost the size of the entire human reference....just one contig!)

@gbggrant
Copy link
Contributor Author

Thank you Yossi.

@jmarshall
Copy link

jmarshall commented Mar 16, 2022

The BAI format is limited to 229 due to its unchangeable minshift=14 / depth=5 geometry. This limitation is not inherent in the indexing scheme, and CSI can use geometries that allow maximum coordinates larger than 232 to be indexed.

IMHO the 32-bit ranges for LN / POS / PNEXT / TLEN that Yossi quoted from the SAM spec are a bit misleading. In BAM these fields are signed 32 bits so the values are indeed limited to 231 in BAM, but in SAM these fields are just text strings of digits so there is no inherent reason for them to be limited in range at all. I'm not sure whether these fields are limited to 32 bits in CRAM. So really these 32-bit ranges only necessarily apply to BAM\1 (see samtools/hts-specs#240).

So an implementer could choose to represent chromosome positions internally as e.g. 64 bits, read and write them with their full range in SAM and (perhaps) CRAM, and raise an error if an attempt is made to write a position greater than 231 as BAM.

HTSlib does this since samtools/htslib#709, which landed in 1.10.

(On reading further, I think current CRAM is also limited to 231 but CRAMv4 will in future lift this to 64 bits.)

@JuiTse
Copy link

JuiTse commented Mar 17, 2022

Hi, @yfarjoun
Could you give me some hint to split the large contig into smaller parts?
Thank for your help.
Best,
Jui-Tse

@yfarjoun
Copy link
Contributor

Firstly: this is completely experimental!!!! I have not done this!!! Someone else may have, but not I.

I do know that something like this was done for the circular mitochondrial genome, for very different reasons.

There's no tool (that I'm aware of) that will do this, here's what I would do if I had to do this:

  1. find a "boring" (i.e. far from genes, perhaps an area that you wouldn't trust variants there anyway) and just cut the reference sequence there.
  2. you'll need to give the parts new names (xxx-part1 xxx-part2?) and you'll disagree with the rest of the world about the reference you used for the following steps
  3. you'll need to index the genome and remap
  4. run the pipeline on "small" references
  5. create a liftOver chain and bring your variants/annotations back to the "normal" genome assuming.
  6. let us know how it worked!

@JuiTse
Copy link

JuiTse commented Mar 18, 2022

Hi, @yfarjoun
I will try it and if there is anything new, I will post here.
Thank again for the kind response and suggestions.

@TAKosch
Copy link

TAKosch commented Jun 6, 2023

Hello,

Could you please tell me if contig size is still limited for running Picard MarkDuplicates? I work mainly with amphibian genomes, which tend to have large chromosomes.

Recently, when I tried to run Picard MarkDuplicates, I got the following error:

[Tue Jun 06 11:46:58 AEST 2023] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=2172649472
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 334, Read name A00119:894:H5LMLDSX7:4:2668:13684:32800, Insert size out of range
at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:441)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:665)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:650)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:620)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:436)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:220)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

I've read several older threads that discuss contig size limits. One said that contigs must be <512MB and another that sam contig size limit is 2^31. The largest contig in my dataset exceeds the Byte limit, but not the sam contig size limit. This contig is 1,251,518,389bp and 1,272MB.

Thank you,
Tiffany

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants