Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded AnnotateVcf workflow #566

Merged
merged 26 commits into from
Aug 4, 2023
Merged

Conversation

epiercehoffman
Copy link
Collaborator

@epiercehoffman epiercehoffman commented Jul 10, 2023

Updates

This PR incorporates @xuefzhao's changes to AnnotateVcf to shard the workflow for additional scaling capabilities and add the PAR BED file for AF annotation, as well as additional changes to streamline and clean up the AnnotateVcf workflow. Changes include:

  • Shard input VCF(s) for AnnotateVcf for scalability
  • Pass PAR BED file to AF annotation task
  • UNDID AFTER REVIEW: Allow user to provide either a single full input VCF or a list of contig VCFs, and the output will match the input in this regard. INSTEAD: Input is a single VCF, and to run separately on subsets of contigs the user should submit multiple workflows with different contig_list inputs.
  • UNDID AFTER REVIEW: Make index file an optional input. INSTEAD: Require presence of index but not input; infer path based on VCF path
  • Remove some scatter-gather operations in AnnotateVcf subworkflows that are no longer necessary (there is one scatter at the beginning & one gather at the end of the workflow now)
  • Remove some intermediary WDLs in AnnotateVcf subworkflows that are no longer necessary
  • Use latest utility tasks for sharding, concatenating, subsetting samples, etc.
  • Add set -euo pipefail in some tasks that were missing it
  • Change runtime_override to runtime_attr in some tasks to match the rest of the code base

Testing

  • Validated all WDLs and JSONs with womtool & terra validation script
  • Tested AnnotateVcf.wdl with ref panel data, with both a full VCF input & contig VCFs
    • The output truncates the floats (to 6 places) for the external AF values compared to the latest main branch, so it is not identical, but that appears to be the only difference
    • The contig VCFs and full VCF output were not comparable because the inputs were from slightly different versions of the reference panel VCF
    • Both the latest main branch & this branch took about 85 mins to run on a single full VCF input, but performance gains are expected for larger input VCFs

Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Unless I'm missing something, I think things can be simplified a bit though - see comments below.

String gatk_docker

File? NONE_FILE_

RuntimeAttr? runtime_attr_svannotate
RuntimeAttr? runtime_attr_concat_vcfs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

RuntimeAttr? runtime_attr_svannotate
RuntimeAttr? runtime_attr_concat_vcfs
RuntimeAttr? runtime_attr_prune_vcf
RuntimeAttr? runtime_attr_shard_vcf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

Comment on lines 52 to 56
RuntimeAttr? runtime_attr_fix_ends_rescale_GQ
RuntimeAttr? runtime_attr_concat_sharded_cluster
RuntimeAttr? runtime_attr_preconcat_sharded_cluster
RuntimeAttr? runtime_attr_hail_merge_sharded_cluster
RuntimeAttr? runtime_attr_fix_header_sharded_cluster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All unused

@@ -990,6 +990,7 @@ task ScatterVcf {

command <<<
set -euo pipefail
~{if !defined(vcf_index) then "tabix ~{vcf}" else ""}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic if the vcf is unsorted. Can you make an explicit input like Boolean generate_index_if_unavailable that defaults to false but is also required to be true here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this line and requiring index in AnnotateVcf as discussed

@@ -4,151 +4,108 @@
version 1.0

import "TasksMakeCohortVcf.wdl" as MiniTasks
import "ChromosomeAlleleFrequencies.wdl" as calcAF
import "Utils.wdl" as util

# Prune off samples in annotated VCF, add VAF annotation
workflow PruneAndAddVafs {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let delete this workflow and just move the tasks into ShardedAnnotateVcf.wdl

@epiercehoffman
Copy link
Collaborator Author

Addressed review comments and switched to a single input VCF as discussed to simplify the WDL structure. If the user wants to run a subset of contigs they should provide a subsetted contig_list, and they can submit multiple workflows for multiple subsets of contigs. Re-tested: womtool validation & successful run of AnnotateVcf.wdl on the reference panel.

@epiercehoffman epiercehoffman merged commit fa2ac80 into main Aug 4, 2023
2 checks passed
@epiercehoffman epiercehoffman deleted the eph_xf_shard_annotation_add_par branch August 4, 2023 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants