Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CombineBatches workflow #732

Draft
wants to merge 36 commits into
base: main
Choose a base branch
from
Draft

Conversation

mwalker174
Copy link
Collaborator

Replaces most methods in the CombineBatches workflow with a greatly simplified set of tasks that utilize GATK SVCluster and the new GroupedSVCluster tool (see PR). SVCluster replaces most of the current functionality including VCF joining and clustering, while GroupedSVCluster introduces refined clustering (a.k.a. "reclustering") that has become a best practice for larger call sets.

Clustering refinement is critical for consolidating redundant variants in repetitive sequence contexts such as simple repeats and segmental duplications. This also addresses an issue with duplicate insertions that share coordinates but have slightly different split read signatures (i.e. different END positions).

In addition, this PR makes some minor improvements to VCF formatting and parsing:

  • In GenotypeBatch, CNVs are now formatted the same way as in CleanVcf, i.e. no genotypes and <CNV> ALT allele and SVTYPE=CNV, rather than using alt alleles <CN0>,<CN1>,… and having SVTYPE=DUP. In case a user needs to run on a VCF with the old format, there is a legacy_vcfs flag in CombineBatches that will update to the new format prior to processing.
  • Within CombineBatches, records are annotated with HIGH_SR_BACKGROUND and BOTHSIDE_PASS INFO field flags rather than passing around separate lists, which is cumbersome.
  • Minor improvements to some downstream scripts to use .get() for accessing FORMAT fields rather than brackets. This was required in some cases because GATK omits a FORMAT field if it is null for all samples in a given record. Pysam then throws an error since the requested key does not exist, whereas .get() returns None.
  • The VCF is converted back to the "old" format at the end of CombineBatches to minimize risk of bugs in downstream workflows.
  • Minor change to the breakpoint overlap filter: variants are prioritized on BOTHSIDE_PASS status (binary) rather than fraction of supporting batches.

Add reformatting to GenotypeBatch

Expose reformat_script

Start ripping stuff out

Finish rewriting wdl and template

Add TODO and delete unused task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant