Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error ValidateVariants -no Support for VCFv4.4 #1978

Open
desmodus1984 opened this issue Aug 30, 2024 · 2 comments
Open

Error ValidateVariants -no Support for VCFv4.4 #1978

desmodus1984 opened this issue Aug 30, 2024 · 2 comments

Comments

@desmodus1984
Copy link

desmodus1984 commented Aug 30, 2024

bug report and feature request

Feature request

Tool(s) involved

Tool name(s), special parameters?
GATK

Description

Specify whether you want a modification of an existing behavior or addition of a new capability.
Provide examples, screenshots, where appropriate.
Hi,
I genotyped samples which were bisulfite sequenced, and now I am trying to make the dataset for population analysis.
I tried combining them with bcftools, but I got the message that the samples had the same name which I don't understand because they have different headers:

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality.In some cases of single-stranded coverge, we are sure there is a SNV, but we can not determine the alternative variant. So, we express the GQ as the Phred score (-10log10 (p-value)) of posterior probability of homozygote/heterozygote, namely, Prob(heterozygote) for homozygous sites and Prob(homozygote) for heterozygous sites. This is somewhat different with SNV calling from WGS data.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=DPW,Number=1,Type=Integer,Description="Read Depth of Wastson Strand">
##FORMAT=<ID=DPC,Number=1,Type=Integer,Description="Read Depth of Crick Strand">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT V00809.bsg

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality.In some cases of single-stranded coverge, we are sure there is a SNV, but we can not determine the alternative variant. So, we express the GQ as the Phred score (-10log10 (p-value)) of posterior probability of homozygote/heterozygote, namely, Prob(heterozygote) for homozygous sites and Prob(homozygote) for heterozygous sites. This is somewhat different with SNV calling from WGS data.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=DPW,Number=1,Type=Integer,Description="Read Depth of Wastson Strand">
##FORMAT=<ID=DPC,Number=1,Type=Integer,Description="Read Depth of Crick Strand">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT V00753.bsg

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality.In some cases of single-stranded coverge, we are sure there is a SNV, but we can not determine the alternative variant. So, we express the GQ as the Phred score (-10log10 (p-value)) of posterior probability of homozygote/heterozygote, namely, Prob(heterozygote) for homozygous sites and Prob(homozygote) for heterozygous sites. This is somewhat different with SNV calling from WGS data.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=DPW,Number=1,Type=Integer,Description="Read Depth of Wastson Strand">
##FORMAT=<ID=DPC,Number=1,Type=Integer,Description="Read Depth of Crick Strand">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT V00001.bsg

I thought that the name was the word at the end of #CHROM but it seems that I am wrong.

Thus, I am trying to rename the vcf files using picard.vcf.RenameSampleInVcf. GATK 4.5 is installed in the HPC, and I downloaded the precompiled binary of 4.6 from the website, and with both I got this error:

--GATK 4.5
gatk ValidateVariants -V V00001.bsg.vcf -R /home/juaguila/BombusMethylSeq/Rec-5/Bvos.fasta
Using GATK jar /home/applications/spack_new_install/opt/spack/linux-centos7-haswell/gcc-13.2.0/gatk-4.5.0.0-2yk33gnxpllakywmil6fx5ydci37fp7n/bin/gatk-package-4.5.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/applications/spack_new_install/opt/spack/linux-centos7-haswell/gcc-13.2.0/gatk-4.5.0.0-2yk33gnxpllakywmil6fx5ydci37fp7n/bin/gatk-package-4.5.0.0-local.jar ValidateVariants -V V00001.bsg.vcf -R /home/juaguila/BombusMethylSeq/Rec-5/Bvos.fasta
12:54:23.684 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/applications/spack_new_install/opt/spack/linux-centos7-haswell/gcc-13.2.0/gatk-4.5.0.0-2yk33gnxpllakywmil6fx5ydci37fp7n/bin/gatk-package-4.5.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
12:54:24.094 INFO ValidateVariants - ------------------------------------------------------------
12:54:24.102 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.5.0.0
12:54:24.103 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
12:54:24.103 INFO ValidateVariants - Executing as [email protected] on Linux v3.10.0-1160.105.1.el7.x86_64 amd64
12:54:24.103 INFO ValidateVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v20.0.1+9-29
12:54:24.104 INFO ValidateVariants - Start Date/Time: August 30, 2024, 12:54:23 PM EDT
12:54:24.104 INFO ValidateVariants - ------------------------------------------------------------
12:54:24.104 INFO ValidateVariants - ------------------------------------------------------------
12:54:24.105 INFO ValidateVariants - HTSJDK Version: 4.1.0
12:54:24.105 INFO ValidateVariants - Picard Version: 3.1.1
12:54:24.105 INFO ValidateVariants - Built for Spark Version: 3.5.0
12:54:24.106 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:54:24.106 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:54:24.106 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:54:24.107 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:54:24.107 INFO ValidateVariants - Deflater: IntelDeflater
12:54:24.107 INFO ValidateVariants - Inflater: IntelInflater
12:54:24.107 INFO ValidateVariants - GCS max retries/reopens: 20
12:54:24.108 INFO ValidateVariants - Requester pays: disabled
12:54:24.108 INFO ValidateVariants - Initializing engine
12:54:24.414 INFO FeatureManager - Using codec VCFCodec to read file file:///home/juaguila/BombusMethylSeq/Rec-5/mrkdup/gatk-rename/V00001.bsg.vcf
12:54:24.550 INFO ValidateVariants - Shutting down engine
[August 30, 2024, 12:54:24 PM EDT] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=285212672
org.broadinstitute.hellbender.exceptions.GATKException: Error initializing feature reader for path V00001.bsg.vcf
at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:436)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getFeatureReader(FeatureDataSource.java:377)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:319)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:291)
at org.broadinstitute.hellbender.engine.VariantWalker.initializeDrivingVariants(VariantWalker.java:58)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.initializeFeatures(VariantWalkerBase.java:67)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:726)
at org.broadinstitute.hellbender.engine.VariantWalker.onStartup(VariantWalker.java:45)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:147)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
at org.broadinstitute.hellbender.Main.main(Main.java:306)
Caused by: htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Your input file has a malformed header: This codec is strictly for VCFv4 and does not support VCFv4.4, for input source: V00001.bsg.vcf
at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:265)
at htsjdk.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:104)
at htsjdk.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:129)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:121)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:433)
... 13 more
Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: This codec is strictly for VCFv4 and does not support VCFv4.4
at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:108)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37)
at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:263)

----GATK 4.6
Using GATK jar /home/juaguila/appz/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/juaguila/appz/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar RenameSampleInVcf -INPUT V00001.bsg.vcf -OUTPUT V00001.rnm.bsg.vcf -NEW_SAMPLE_NAME V00001
14:14:44.248 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/juaguila/appz/gatk-4.6.0.0/gatk-package-4.6.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Aug 30 14:14:44 EDT 2024] RenameSampleInVcf --INPUT V00001.bsg.vcf --OUTPUT V00001.rnm.bsg.vcf --NEW_SAMPLE_NAME V00001 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Fri Aug 30 14:14:45 EDT 2024] Executing as [email protected] on Linux 3.10.0-1160.105.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 20.0.1+9-29; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.6.0.0
[Fri Aug 30 14:14:45 EDT 2024] picard.vcf.RenameSampleInVcf done. Elapsed time: 0.02 minutes.
Runtime.totalMemory()=285212672
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: Your input file has a malformed header: This codec is strictly for VCFv4 and does not support VCFv4.4, for input source: file:///home/juaguila/BombusMethylSeq/Rec-5/mrkdup/gatk-rename/V00001.bsg.vcf
at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:265)
at htsjdk.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:104)
at htsjdk.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:129)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:121)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:81)
at htsjdk.variant.vcf.VCFFileReader.(VCFFileReader.java:145)
at htsjdk.variant.vcf.VCFFileReader.(VCFFileReader.java:95)
at picard.vcf.RenameSampleInVcf.doWork(RenameSampleInVcf.java:113)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:281)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
at org.broadinstitute.hellbender.Main.main(Main.java:306)
Caused by: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: This codec is strictly for VCFv4 and does not support VCFv4.4
at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:108)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79)
at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37)
at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:263)
... 12 more

It seems that there is no support for VCFv4.4. Are there any plans to add it to GATK ?
I need the genotypes from the ~bisulfite (Enzymatic Methyl-seq Kit) data, and with cgmaptools, I get a lot of degenerate bases (IUPAC not accepted by GATK), and with this other method, @hippo-yf, bsgenova, the format is not compatible with GATK for renaming.

Attaching the 1st 30 lines of three files for you to check if there is any other extra problem there.

Thanks;

bsg.header.txt

@kockan
Copy link
Contributor

kockan commented Sep 4, 2024

I believe there is no support for VCF 4.4 yet? Right @lbergelson @cmnbroad ?

@droazen
Copy link
Contributor

droazen commented Sep 10, 2024

Correct, we do not yet have support for VCF 4.4.

It may be that bcftools does not like the . character in the sample names. You could try editing the sample names directly in a text editor, and then reindexing the VCFs using bcftools index before trying again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants