Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak extract values based on Echo Runs [VS-1432] #8979

Merged
merged 3 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions scripts/variantstore/docs/aou/AOU_DELIVERABLES.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ You can take advantage of our existing sub-cohort WDL, `GvsExtractCohortFromSamp
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
- Specify the `interval_weights_bed` appropriate for the PGEN extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
- Select the workflow option "Retry with more memory" and choose a "Memory retry factor" of 1.5
- Set the `extract_maxretries_override` input to 5, `split_intervals_disk_size_override` to 1000, `scatter_count` to 25000, and `y_bed_weight_scaling` to 8 to start; you will likely have to adjust one or more of these values in subsequent attempts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all looks good--esp since you are using real world experience from Echo
Do we have any record of the thought process behind this? I know at one point there was a doc?
anyway, LGTM but I'm def eluded by why these numbers are ideal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the top of my head:

  • split_intervals_disk_size_override to 1000 because otherwise we run out of disk space 😭
  • scatter_count - this one I think may vary depending on the interval list? Pretty sure the default 34K is almost guaranteed to give us Cromwell problems (and maybe that should be changed)
  • y_bed_weight_scaling we were doing both X and Y at a scale factor of 4 and Y shards were still the laggards. We might even go higher than 8 but we didn't actually try that during Echo.

- `GvsExtractCallsetPgen` currently defaults to 100 alt alleles maximum, which means that any sites having more than that number of alt alleles will be dropped.
- Be sure to set the `output_gcs_dir` to the proper path in the AoU delivery bucket so you don't need to copy the output files there yourself once the workflow has finished.
- For `GvsExtractCallsetPgen` (which is called by `GvsExtractCallsetPgenMerged`), if one (or several) of the `PgenExtractTask` shards fail because of angry cloud, you can re-run the workflow with the exact same inputs with call caching turned on; the successful shards will cache and only the failed ones will re-run.
Expand Down
8 changes: 4 additions & 4 deletions scripts/variantstore/wdl/GvsExtractCallsetPgen.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -140,12 +140,12 @@ workflow GvsExtractCallsetPgen {

Int effective_split_intervals_disk_size_override = select_first([split_intervals_disk_size_override,
if GetNumSamplesLoaded.num_samples < 100 then 50 # Quickstart
else 500])
else 200])

Int effective_extract_memory_gib = if defined(extract_memory_override_gib) then select_first([extract_memory_override_gib])
else if effective_scatter_count <= 100 then 37 + extract_overhead_memory_override_gib
else if effective_scatter_count <= 500 then 17 + extract_overhead_memory_override_gib
else 9 + extract_overhead_memory_override_gib
else if effective_scatter_count <= 100 then 35 + extract_overhead_memory_override_gib
else if effective_scatter_count <= 500 then 15 + extract_overhead_memory_override_gib
else 5 + extract_overhead_memory_override_gib
# WDL 1.0 trick to set a variable ('none') to be undefined.
if (false) {
File? none = ""
Expand Down
Loading