Pipeline for calling canine mitochondiral variation from aligned Illumina data
This pipeline calls mitochondrial variation from aligned Illumina data. The input is a BAM/CRAM of paired-end Illumina reads aligned to the UU_Cfam_GSD_1.0 assembly (as in dogmap). The output is a reconstructed fasta of the sample mitochondrial genome and a VCF file of potential variants.
Variant calling uses Mutect2 as described in Laricchia et al. and utilizes filters and assignment suggested by Fregel et al.. Parameters were optimized based on comparison with mitocondrial genomes reported in recently published genomes with PacBio and Illumina sequence data.
Read pairs with at least one read that aligns to chrM or to Nuclear Mitochondiral Segments (NUMTs) that are >300bp with > 95% identity are extracted. NuMTs were identified by Fabian Ramos-Almodovar.
NC_002008.4 is used as the reference mitochondiral genome. To account for the cirular nature of the mitochondria, extracted reads are aligned NC_002008.4 as well as to a version of NC_002008.4 that has been rotated by 8 kbp. Resulting BAM files are sorted and duplicates marked.
Coverage along the mitochondrial genome is assessed using CollectHsMetrics from GATK. Depth is merged from the reference and rotated alignments, with the first and last 4kbp taken from the rotated alignment.
If the mean coverage is greater than 5,000, the reference and rotated BAMs are downsampled to a depth of 5,000 using DownsampleSam from GATK.
Variants are called from the standard and rotated BAM using the mitochondrial mode of Mutect2. The following options are used: --mitochondria-mode --max-reads-per-alignment-start 75 --max-mnp-distance 0 --annotation StrandBiasBySample
Filters are applied to each of the resulting VCF files using GATK FilterMutectCalls --mitochondria-mode
The reference and rotated VCF files are merged, with the first and last 4kbp taken from the rotated alignment. The resulting variants are filtered to remove sites where the most frequent alternative allele fails the strand_bias filter or where the most frequenct alternative allele has a allele fraction less than 0.5.
A representation of the mitochondria sequence is created using bcftools consensus. Regions with a coverage less than 100, or that overlap with NC_002008.4 positions 15990-15990 or 15512-15535 are masked to 'N.'
An attempt at haplogroup assignment based on the diagnostic changes reported in Fregel et al. is reported. Note that these assignments should not be considered definitive due to ambiguities and limited resolution. Assessment relative to clade defining haplogroup sequences is recommended.
The following software and versions are used
bwa-mem
gatk version 4.2.5.0
samtools version >= 1.9
liftOver
bgzip
tabix
bcftools version >= 1.9.
python callmito/process-sample.py \
--ref GENEOME-FASTA-USED-IN-ALIGNMENT \
--name SAMPLE-NAME \
--finaldir OUTPUT-DIRECTORY \
--cram SAMPLE-BAM-or-CRAM-FILE \
--coords COORDINTATES-TO-EXTRACT (callmito/numt-coords-to-extract.95_300.bed) \
--mitoFa MITO-REFERENCE (callmito/refs/NC_002008.4.fa) \
--mitoFaRotated ROTATED-MITO-REFERENCE (callmito/refs/NC_002008.4.rotate8k.fa) \
--chainfile LIFTOVER-CHAIN-FILE-FOR-ROTATED-MITO (callmito/refs/rotatedToOriginal.liftOver) \
--diagnosticTable DIAGNOSITIC-SNPS-TABLE (callmito/fregel-haplogroups.txt)