Immediately stop with input string error (NumberFormatException: For input string) #1783

JuiTse · 2022-03-10T10:26:47Z

Hi,
I have used STAR to map the large genome (26GB), but fail when trying to do MarkDupliacte

Bug Report

I apply Picard version: 2.26.10, and the following is my code:
/home/liao/software/java/jre1.8.0_321/bin/java -jar picard.jar MarkDuplicates I=T733-02-T89Aligned_sortedByCoord_out.bam O=T733-02-T89Aligned.sortedByCoord.out_marked_dup.bam M=T733-02-T89Aligned.sortedByCoord.out_marked_dup_metrics.txt
INFO 2022-03-10 17:41:24 MarkDuplicates

********** NOTE: Picard's command line syntax is changing.

********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)

********** The command line looks like this in the new syntax:

********** MarkDuplicates -I T733-02-T89Aligned_sortedByCoord_out.bam -O T733-02-T89Aligned.sortedByCoord.out_marked_dup.bam -M T733-02-T89Aligned.sortedByCoord.out_marked_dup_metrics.txt

17:41:25.005 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/personal2/lego/lego_Pinus/2_mapping/mapped_ind/unable%20to%20process/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Thu Mar 10 17:41:25 CST 2022] MarkDuplicates INPUT=[T733-02-T89Aligned_sortedByCoord_out.bam] OUTPUT=T733-02-T89Aligned.sortedByCoord.out_marked_dup.bam METRICS_FILE=T733-02-T89Aligned.sortedByCoord.out_marked_dup_metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Thu Mar 10 17:41:25 CST 2022] Executing as liao@LiaoPC on Linux 5.4.0-96-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_321-b07; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.26.10
INFO 2022-03-10 17:41:25 MarkDuplicates Start of doWork freeMemory: 1750185592; totalMemory: 1771044864; maxMemory: 26277314560
INFO 2022-03-10 17:41:25 MarkDuplicates Reading input file and constructing read end information.
INFO 2022-03-10 17:41:25 MarkDuplicates Will retain up to 95207661 data points before spilling to disk.
[Thu Mar 10 17:41:25 CST 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=1771044864
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.NumberFormatException: For input string: "2364278061"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:583)
at java.lang.Integer.parseInt(Integer.java:615)
at htsjdk.samtools.SAMTextHeaderCodec.parseSQLine(SAMTextHeaderCodec.java:214)
at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:113)
at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704)
at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:298)
at htsjdk.samtools.BAMFileReader.(BAMFileReader.java:176)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:406)
at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:265)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:507)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:258)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

The string "2364278061", is the first chromosome length based on chrLength.txt file output from STAR.
It seems the program recognize it as non-numeric element, why it happened? and how to fix it?
Best,
Jui-Tse

The attached files are output from STAR (but the BAM file cannot be uploaded may due to its big size)
T733-02-T89.zip

cmnbroad · 2022-03-14T14:33:17Z

@JuiTse The error message isn't very specific, but the length of your reference contig (2364278061) exceeds the maximum size that Picard can handle. Unfortunately, I'm not aware of any workaround for this.

JuiTse · 2022-03-14T23:35:21Z

Hi, @cmnbroad
Thank for the answer,
So it seems this large genome reach the limit of the picard calculation?
Do you have any idea about other program to remove duplicate that I can handle this issue?
(I have also try samtools but also get the error)
Best regards,
Jui-Tse

cmnbroad · 2022-03-15T12:25:34Z

@JuiTse Sorry, I don't have any suggestions for alternatives.

jmarshall · 2022-03-16T19:29:24Z

(I have also try samtools but also get the error)

What samtools command did you try?

I think I would expect samtools markdup (in samtools 1.10 or later) to work on large chromosomes.

yfarjoun · 2022-03-16T19:59:15Z

@JuiTse
I don't understand how you managed to get a bam file with reads in positions > 2^31-1....perhaps star has integer overflow?

the BAM spec doesn't support positions > 2^31-1...so I think you are better off cutting up your reference into 2 than looking for a tool that supports such large positions.

what is the organism that has this giant reference sequence?

yfarjoun · 2022-03-16T19:59:54Z

(reopening since I'm hoping to gain some feedback from the OP)

JuiTse · 2022-03-17T00:26:25Z

Hi, @jmarshall
I just follow the code in the manual page: http://www.htslib.org/doc/samtools-markdup.html
(collate > fixmate > sort > markdup)
and get the following error in markdup: [markdup] error: bad coordinate order.

I have consulted the similar issue in STAR github, and the author Alex Dobin thought it may be the exon with negative coordinate in GTF file that cause this serial problems (alexdobin/STAR#1492)
Best,
Jui-Tse

JuiTse · 2022-03-17T00:34:59Z

Hi, @yfarjoun
I am going to map the transcriptome of Pine species to Pinus tabuliformis which is 25.4 GB
(from this reference: https://www.sciencedirect.com/science/article/pii/S0092867421014288)

There seem to be some errors in GTF file (as I mentioned in the response to jmarshall), though I am not sure whether it is the cause of this picard issue.
I am not sure whether the GTF error is from the original GFF3 file or the format conversion process. If it is the former case, I may need to contact the author of this genome for more information to fix it (i.e. negative exon coordinate).
Best,
Jui-Tse

gbggrant mentioned this issue Mar 16, 2022

Add support for reading SAM/BAM files with contigs larger than max int #1788

Closed

2 tasks

gbggrant closed this as completed Mar 16, 2022

yfarjoun reopened this Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Immediately stop with input string error (NumberFormatException: For input string) #1783

Immediately stop with input string error (NumberFormatException: For input string) #1783

JuiTse commented Mar 10, 2022

cmnbroad commented Mar 14, 2022

JuiTse commented Mar 14, 2022

cmnbroad commented Mar 15, 2022

jmarshall commented Mar 16, 2022

yfarjoun commented Mar 16, 2022

yfarjoun commented Mar 16, 2022 •

edited

Loading

JuiTse commented Mar 17, 2022

JuiTse commented Mar 17, 2022

Immediately stop with input string error (NumberFormatException: For input string) #1783

Immediately stop with input string error (NumberFormatException: For input string) #1783

Comments

JuiTse commented Mar 10, 2022

Bug Report

cmnbroad commented Mar 14, 2022

JuiTse commented Mar 14, 2022

cmnbroad commented Mar 15, 2022

jmarshall commented Mar 16, 2022

yfarjoun commented Mar 16, 2022

yfarjoun commented Mar 16, 2022 • edited Loading

JuiTse commented Mar 17, 2022

JuiTse commented Mar 17, 2022

yfarjoun commented Mar 16, 2022 •

edited

Loading