From 702a804da5b2d1407af6ddd0c94150cdb350ba86 Mon Sep 17 00:00:00 2001 From: Bec Asch Date: Tue, 10 Sep 2024 20:29:16 -0400 Subject: [PATCH 1/5] Update Docs Around Callset Cleanup and Cost [VS-1107] (#8976) --- .../variantstore/docs/aou/AOU_DELIVERABLES.md | 21 ++- scripts/variantstore/docs/aou/cleanup/Cost.md | 53 +++++++ .../variantstore/docs/aou/cleanup/cleanup.md | 130 ++++-------------- 3 files changed, 96 insertions(+), 108 deletions(-) create mode 100644 scripts/variantstore/docs/aou/cleanup/Cost.md diff --git a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md index ff4e54c6fdd..2eb7bd56fad 100644 --- a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md +++ b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md @@ -88,13 +88,26 @@ 1. `GvsCallsetCost` workflow - This workflow calculates the total BigQuery cost of generating this callset (which is not represented in the Terra UI total workflow cost) using the above GVS workflows; it's used to calculate the cost as a whole and by sample. + +## Internal sign-off protocol + +The Variants team currently has the following VDS internal sign-off protocol: + +1. Generate a VDS for the candidate callset into the "delivery" bucket. +1. Open up the VDS in a [beefy](vds/cluster/AoU%20VDS%20Cluster%20Configuration.md) notebook and confirm the "shape" looks right. +1. Run `GvsPrepareRangesCallset.wdl` to generate a prepare table of VET data. +1. Run `GvsCallsetStatistics.wdl` to generate callset statistics for the candidate callset using the prepare VET table created in the preceding step. +1. Copy the output of `GvsCallsetStatistics.wdl` into the "delivery" bucket. +1. Email the paths to the VDS and callset statistics to Lee/Wail for QA / approval. + + ## Main Deliverables (via email to stakeholders once the above steps are complete) The Callset Stats and S&P files can be simply `gsutil cp`ed to the AoU delivery bucket since they are so much smaller. 1. GCS location of the VDS in the AoU delivery bucket -2. Fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows) -3. GCS location of the CSV output from `GvsCallsetStatistics` workflow in the AoU delivery bucket -4. GCS location of the TSV output from `GvsCalculatePrecisionAndSensitivity` in the AoU delivery bucket +1. Fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows) +1. GCS location of the CSV output from `GvsCallsetStatistics` workflow in the AoU delivery bucket +1. GCS location of the TSV output from `GvsCalculatePrecisionAndSensitivity` in the AoU delivery bucket ## Running the VAT pipeline To create a BigQuery table of variant annotations, you may follow the instructions here: @@ -213,4 +226,4 @@ Once the VAT has been created, you will need to create a database table mapping ``` select distinct vid from `.` where vid not in (select vid from `.`) ; - ``` \ No newline at end of file + ``` diff --git a/scripts/variantstore/docs/aou/cleanup/Cost.md b/scripts/variantstore/docs/aou/cleanup/Cost.md new file mode 100644 index 00000000000..ce3157ac5c7 --- /dev/null +++ b/scripts/variantstore/docs/aou/cleanup/Cost.md @@ -0,0 +1,53 @@ +# Storage vs Regeneration Costs + +## Prepare tables + +These numbers assume the `GvsPrepareRangesCallset` workflow is invoked with the `only_output_vet_tables` input set +to `true`. If this is not the case, meaning the prepare version of the ref ranges table was also generated, all costs +below should be multiplied by about 4: + +* Running + GvsPrepareRangesCallset: [$429.18](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0) +``` +-- Look in the cost observability table for the bytes scanned for the appropriate run of `GvsPrepareRanges`. +SELECT + ROUND(event_bytes * (5 / POW(1024, 4)), 2) AS cost, -- $5 / TiB on demand https://cloud.google.com/bigquery/pricing#on_demand_pricing + call_start_timestamp +FROM + `aou-genomics-curation-prod.aou_wgs_fullref_v2.cost_observability` +WHERE + step = 'GvsPrepareRanges' +ORDER BY call_start_timestamp DESC +``` +* Storing prepare data: $878.39 / month + * Assuming compressed pricing, multiply the number of physical bytes by $0.026 / GiB. + +## Avro files + +The Avro files generated from the Delta callset onward are very large, several times the size of the final Hail VDS. +For the ~250K sample Delta callset the Avro files consumed nearly 80 TiB of GCS storage while the delivered VDS was +"only" about 26 TiB. + +Approximate figures for the ~250K sample Delta callset: + +* Avro storage cost: $1568 / month (might be lower if we can get a colder bucket to copy them into) + * `76.61 TiB gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/submissions/c86a6e8f-71a1-4c38-9e6a-f5229520641e/GvsExtractAvroFilesForHail/efb3dbe8-13e9-4542-8b69-02237ec77ca5/call-OutputPath/avro` +* [Avro generation cost](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0): + $3000, 12 hours runtime. + +## Hail VariantDataset (VDS) + +The Hail VDS generated for the Delta callset consumes about 26 TiB of space in GCS at a cost of approximately $500 / +month. Recreating the VDS from Avro files would take around 10 hours at about $100 / hour in cluster time for a total of +about $1000. Note that re-creating the VDS requires Avro files; if we have not retained the Avro files per the step +above, we would need to regenerate those as well which would add significantly to the cost. + +Approximate figures for the ~250K samples Delta callset: + +* VDS storage cost: ~$500 / month. Note AoU should have exact copies of the VDSes we have delivered for Delta, though + it's not certain that these copies will remain accessible to the Variants team in the long term. The delivered VDSes are put here `gs://prod-drc-broad/` and we have noted that we need them to remain there for hot-fixes. The Variants team has + generated five versions of the Delta VDS so far, one of which (the original) still exist: + * First version of the callset, includes many samples that were later + removed `gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/vds/2022-10-19/dead_alleles_removed_vs_667_249047_samples/gvs_export.vds` +* VDS regeneration cost: $1000 (~10 hours @ ~$100 / hour cluster cost) + $3000 to regenerate Avro files if necessary. + diff --git a/scripts/variantstore/docs/aou/cleanup/cleanup.md b/scripts/variantstore/docs/aou/cleanup/cleanup.md index af4c2da7013..2967d28a1cb 100644 --- a/scripts/variantstore/docs/aou/cleanup/cleanup.md +++ b/scripts/variantstore/docs/aou/cleanup/cleanup.md @@ -1,117 +1,39 @@ -# AoU callset cleanup +# AoU Callset Cleanup ## Overview -The current Variants policy for AoU callsets is effectively to retain all versions of all artifacts forever. As the -storage costs for these artifacts can be significant (particularly from Delta onward), the Variants team would like to -make the cost of retaining artifacts more clear so conscious choices can be made about what to keep and what to delete. +The current Variants policy for AoU callsets is effectively to retain all versions of all artifacts forever. As the storage costs for these artifacts can be significant (particularly from Delta onward), the Variants team would like to make the cost of retaining artifacts more clear so conscious choices can be made about what to keep and what to delete. -As a general rule, any artifacts that have clearly become obsolete (e.g. VDSes with known issues that have been -superseded by corrected versions, obsolete sets of prepare tables, etc.) should be deleted ASAP. If it's not clear to -the Variants team whether an artifact should be cleaned up or not, we should calculate the monthly cost to preserve the -artifact (e.g. the sum of all relevant GCS or BigQuery storage costs) as well as the cost to regenerate the artifact. -Reach out to Lee with these numbers for his verdict on whether to keep or delete. +As a general rule, any artifacts that have clearly become obsolete (e.g. VDSes with known issues that have been superseded by corrected versions, obsolete sets of prepare tables, etc.) should be deleted ASAP. If it's not clear to the Variants team whether an artifact should be cleaned up or not, [we should calculate the monthly cost to preserve the artifact (e.g. the sum of all relevant GCS or BigQuery storage costs) as well as the cost to regenerate the artifact](Cost.md). + +Reach out to leadership with these numbers for their verdict on whether to keep or delete. ## Specific AoU GVS Artifacts During the course of creating AoU callsets several large and expensive artifacts are created: * Pilot workspace / dataset - * For the AoU Delta callset the Variants team created an AoU 10K workspace and dataset to pilot the Hail-related - processes we were using for the first time. At some point these processes will mature to the point where some or - all of the contents of this workspace and dataset should be deleted. This is likely not an issue for discussion - with Lee as it is internal to the Variants team's development process, but we should be mindful to clean up the - parts of this that we are done using promptly. + * for the Delta callset the Variants team created an AoU 10K workspace and dataset to pilot the Hail/VDS creation + * these processes will mature to the point where some or all of the contents of this workspace and dataset should be deleted * Production BigQuery dataset - * The plan from Delta forward is to use the same production BigQuery dataset for all future callsets. This was also - the plan historically as well, but in practice that didn't work out for various reasons. For Delta in particular - the Variants team was forced to create a new dataset due to the use of drop state NONE for Hail compatibility. If - we are forced to create another BigQuery dataset in the future, discuss with Lee to determine what to do with the - previous dataset(s). In the AoU Delta timeframe, the Variants team additionally reached out to a contact at VUMC - to determine if various datasets from the Alpha / Beta eras were still in use or could possibly be deleted. + * for each previous callset, there was (at least) one new dataset created + * the dream is to keep the same dataset for multiple callsets and just add new samples, regenerate the filter and create new deliverables, but that has yet to happen because of new features requested for each callset (e.g. update to Dragen version, addition of ploidy data, different requirements to use Hail...etc.) + * if there are datasets from previous callsets that aren't needed anymore (check with AoU and Lee/Wail), they should be deleted * Prepare tables - * These tables are used for callset statistics only starting with Delta but were previously used for extract as well - in pre-Delta callsets. Per 2023-01-25 meeting with Lee it appears this VET prepare table and associated callset - statistics tables can be deleted once the callset statistics have been generated and handed off. -* Terra workspace - * It seems that VAT workflows may generate large amounts of data under the submissions "folder". e.g. ~10 TiB of - data under this folder in the AoU 10K workspace (!). At the time of this writing the VAT process is not fully - defined so this item may especially benefit from updates. + * needed for VCF or PGEN extract + * only variant tables are used by `GvsCallsetStatistics.wdl` for callset statistics deliverable + * will be given a TTL by default, but if they are not needed anymore for any of the above, delete +* Sites-only VCFs + * the VAT is created from a sites-only VCF and its creation is the most resource-intensive part of the VAT pipeline + * for Echo, Wail requested that it get copied over to the "delivery" bucket; ask about this before deleting + * clean up all runs (check for failures) once the VAT has been delivered and accepted * Avro files (Delta onward) - * These are huge, several times larger than the corresponding Hail VDS. It's not clear that there's any point to - keeping these files around unless there was a bug in the Hail GVS import code that would require a patch and - re-import. Per 2023-01-25 meeting with Lee we have now deleted the Delta versions of these Avro files. Per the - preceding comments, going forward these files can be deleted once the Variants team feels reasonably confident - that they won't be needed for the current callset any longer. -* Hail VariantDataset (VDS) (Delta onward) - * The Variants team creates a copy of the VDS and then delivers a copy to the AoU preprod datasets bucket. That copy - of the VDS seems to stay in the delivery location for at least a few days, but it's not clear if that copy gets - cleaned up after AoU later copies the VDS to a production bucket. The Variants team should not rely on this copy - of the VDS being available long-term. Per 2023-01-25 meeting with Lee, we have retained the oldest (with AI/AN + - controls) and most recent versions (without AI/AN or controls, corrected phasing and GT) of the Delta VDS. This - can serve as our team's guidance for how to handle multiple VDS versions going forward, though of course we can - always ask Lee for explicit guidance. - -## Internal sign-off protocol - -The Variants team currently has the following VDS internal sign-off protocol: - -* Generate a VDS for the candidate callset -* Run validation on this VDS -* Run `GvsPrepareRangesCallset` to generate a prepare table of VET data -* Generate callset statistics for the candidate callset using the prepare VET created in the preceding step -* Forward VDS and callset statistics to Lee for QA / approval - -## Storage versus regeneration costs - -### Prepare tables - -These numbers assume the `GvsPrepareRangesCallset` workflow is invoked with the `only_output_vet_tables` input set -to `true`. If this is not the case, meaning the prepare version of the ref ranges table was also generated, all costs -below should be multiplied by about 4: - -* Running - GvsPrepareRangesCallset: [$429.18](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0) -``` --- Look in the cost observability table for the bytes scanned for the appropriate run of `GvsPrepareRanges`. -SELECT - ROUND(event_bytes * (5 / POW(1024, 4)), 2) AS cost, -- $5 / TiB on demand https://cloud.google.com/bigquery/pricing#on_demand_pricing - call_start_timestamp -FROM - `aou-genomics-curation-prod.aou_wgs_fullref_v2.cost_observability` -WHERE - step = 'GvsPrepareRanges' -ORDER BY call_start_timestamp DESC -``` -* Storing prepare data: $878.39 / month - * Assuming compressed pricing, multiply the number of physical bytes by $0.026 / GiB. - -### Avro files - -The Avro files generated from the Delta callset onward are very large, several times the size of the final Hail VDS. -For the ~250K sample Delta callset the Avro files consumed nearly 80 TiB of GCS storage while the delivered VDS was -"only" about 26 TiB. - -Approximate figures for the ~250K sample Delta callset: - -* Avro storage cost: $1568 / month (might be lower if we can get a colder bucket to copy them into) - * `76.61 TiB gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/submissions/c86a6e8f-71a1-4c38-9e6a-f5229520641e/GvsExtractAvroFilesForHail/efb3dbe8-13e9-4542-8b69-02237ec77ca5/call-OutputPath/avro` -* [Avro generation cost](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0): - $3000, 12 hours runtime. - -### Hail VariantDataset (VDS) - -The Hail VDS generated for the Delta callset consumes about 26 TiB of space in GCS at a cost of approximately $500 / -month. Recreating the VDS from Avro files would take around 10 hours at about $100 / hour in cluster time for a total of -about $1000. Note that re-creating the VDS requires Avro files; if we have not retained the Avro files per the step -above, we would need to regenerate those as well which would add significantly to the cost. - -Approximate figures for the ~250K samples Delta callset: - -* VDS storage cost: ~$500 / month. Note AoU should have exact copies of the VDSes we have delivered for Delta, though - it's not certain that these copies will remain accessible to the Variants team in the long term. The delivered VDSes are put here `gs://prod-drc-broad/` and we have noted that we need them to remain there for hot-fixes. The Variants team has - generated five versions of the Delta VDS so far, one of which (the original) still exist: - * First version of the callset, includes many samples that were later - removed `gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/vds/2022-10-19/dead_alleles_removed_vs_667_249047_samples/gvs_export.vds` -* VDS regeneration cost: $1000 (~10 hours @ ~$100 / hour cluster cost) + $3000 to regenerate Avro files if necessary. - + * huge, several times larger than the corresponding Hail VDS + * as long as the VDS has been delivered and accepted, they can be deleted +* Hail VariantDataset (VDS) + * we used to create it in the Terra workspace and then copy it + * WDL was updated to create in the "deliverables" GCS bucket so there is only one copy of each one + * clean up any failed runs once the VDS has been delivered and accepted +* PGEN/VCF Intermediate Files + * PGEN: multiple versions of the PGEN files are created by GvsExtractCallsetPgenMerged.wdl because it delivers files split by chromosome + * VCF: only one version of the VCF files and indices are created, but check for failed runs From cab8d5284058706851abc2dacf7e4062e39220af Mon Sep 17 00:00:00 2001 From: Bec Asch Date: Fri, 13 Sep 2024 14:05:58 -0400 Subject: [PATCH 2/5] Tweak extract values based on Echo Runs [VS-1432] (#8979) --- scripts/variantstore/docs/aou/AOU_DELIVERABLES.md | 1 + scripts/variantstore/wdl/GvsExtractCallsetPgen.wdl | 8 ++++---- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md index 2eb7bd56fad..b54de7ce43a 100644 --- a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md +++ b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md @@ -172,6 +172,7 @@ You can take advantage of our existing sub-cohort WDL, `GvsExtractCohortFromSamp - Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above. - Specify the `interval_weights_bed` appropriate for the PGEN extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart. - Select the workflow option "Retry with more memory" and choose a "Memory retry factor" of 1.5 + - Set the `extract_maxretries_override` input to 5, `split_intervals_disk_size_override` to 1000, `scatter_count` to 25000, and `y_bed_weight_scaling` to 8 to start; you will likely have to adjust one or more of these values in subsequent attempts. - `GvsExtractCallsetPgen` currently defaults to 100 alt alleles maximum, which means that any sites having more than that number of alt alleles will be dropped. - Be sure to set the `output_gcs_dir` to the proper path in the AoU delivery bucket so you don't need to copy the output files there yourself once the workflow has finished. - For `GvsExtractCallsetPgen` (which is called by `GvsExtractCallsetPgenMerged`), if one (or several) of the `PgenExtractTask` shards fail because of angry cloud, you can re-run the workflow with the exact same inputs with call caching turned on; the successful shards will cache and only the failed ones will re-run. diff --git a/scripts/variantstore/wdl/GvsExtractCallsetPgen.wdl b/scripts/variantstore/wdl/GvsExtractCallsetPgen.wdl index cdbfb69f669..ebb73f38935 100644 --- a/scripts/variantstore/wdl/GvsExtractCallsetPgen.wdl +++ b/scripts/variantstore/wdl/GvsExtractCallsetPgen.wdl @@ -140,12 +140,12 @@ workflow GvsExtractCallsetPgen { Int effective_split_intervals_disk_size_override = select_first([split_intervals_disk_size_override, if GetNumSamplesLoaded.num_samples < 100 then 50 # Quickstart - else 500]) + else 200]) Int effective_extract_memory_gib = if defined(extract_memory_override_gib) then select_first([extract_memory_override_gib]) - else if effective_scatter_count <= 100 then 37 + extract_overhead_memory_override_gib - else if effective_scatter_count <= 500 then 17 + extract_overhead_memory_override_gib - else 9 + extract_overhead_memory_override_gib + else if effective_scatter_count <= 100 then 35 + extract_overhead_memory_override_gib + else if effective_scatter_count <= 500 then 15 + extract_overhead_memory_override_gib + else 5 + extract_overhead_memory_override_gib # WDL 1.0 trick to set a variable ('none') to be undefined. if (false) { File? none = "" From 6c68384dd2798471a22c09cc5541db6d9dd6eab1 Mon Sep 17 00:00:00 2001 From: Bec Asch Date: Mon, 16 Sep 2024 18:43:20 -0400 Subject: [PATCH 3/5] update cleanup doc (#8981) --- scripts/variantstore/docs/aou/cleanup/cleanup.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/variantstore/docs/aou/cleanup/cleanup.md b/scripts/variantstore/docs/aou/cleanup/cleanup.md index 2967d28a1cb..47c2e4c5f20 100644 --- a/scripts/variantstore/docs/aou/cleanup/cleanup.md +++ b/scripts/variantstore/docs/aou/cleanup/cleanup.md @@ -14,7 +14,7 @@ During the course of creating AoU callsets several large and expensive artifacts * Pilot workspace / dataset * for the Delta callset the Variants team created an AoU 10K workspace and dataset to pilot the Hail/VDS creation - * these processes will mature to the point where some or all of the contents of this workspace and dataset should be deleted + * the dataset, workspace data tables, and submission files have been deleted to save money, but the workspace has been kept around for future testing * Production BigQuery dataset * for each previous callset, there was (at least) one new dataset created * the dream is to keep the same dataset for multiple callsets and just add new samples, regenerate the filter and create new deliverables, but that has yet to happen because of new features requested for each callset (e.g. update to Dragen version, addition of ploidy data, different requirements to use Hail...etc.) From 213366dc28a0b5ffec0c22d5608ed8ab7d52c5b4 Mon Sep 17 00:00:00 2001 From: Bec Asch Date: Tue, 24 Sep 2024 10:01:51 -0400 Subject: [PATCH 4/5] Change Batch Size Parameter to Scatter Width For Ingest [VS-1218] (#8985) --- .dockstore.yml | 2 +- scripts/variantstore/docs/aou/AOU_DELIVERABLES.md | 2 +- scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl | 10 ++-------- scripts/variantstore/wdl/GvsImportGenomes.wdl | 8 ++++---- 4 files changed, 8 insertions(+), 14 deletions(-) diff --git a/.dockstore.yml b/.dockstore.yml index 91a9e10cf6d..f7e7800447f 100644 --- a/.dockstore.yml +++ b/.dockstore.yml @@ -184,7 +184,7 @@ workflows: branches: - master - ah_var_store - - vs_1456_status_writes_bug + - rsa_vs_1218 tags: - /.*/ - name: GvsPrepareRangesCallset diff --git a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md index b54de7ce43a..a1cb46ae9f6 100644 --- a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md +++ b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md @@ -6,7 +6,7 @@ - As described in the "Getting Started" of [Operational concerns for running Hail in Terra Cromwell/WDL](https://docs.google.com/document/d/1_OY2rKwZ-qKCDldSZrte4jRIZf4eAw2d7Jd-Asi50KE/edit?usp=sharing), this workspace will need permission in Terra to run Hail dataproc clusters within WDL. Contact Emily to request this access as part of setting up the new workspace. - There is a quota that needs to be upgraded for the process of Bulk Ingest. When we ingest data, we use the Write API, which is part of BQ’s Storage API. Since we are hitting this API with so much data all at once, we want to increase our CreateWriteStream quota. Follow the [Quota Request Template](workspace/CreateWriteStreamRequestIncreasedQuota.md). - Once that quota has been increased, the `load_data_batch` value needs to be updated based on calculations in the [Quota Request Template](workspace/CreateWriteStreamRequestIncreasedQuota.md) doc. Even if no increased quota is granted, this doc goes over how to choose the value for this param. + Once that quota has been increased, the `load_data_scatter_width` value needs to be updated based on that new quota (for information on what we did for Echo, see the "Calculate Quota To be Requested" section in the [Quota Request Template](workspace/CreateWriteStreamRequestIncreasedQuota.md) doc). - Create and push a feature branch (e.g. `EchoCallset`) based off the `ah_var_store` branch to the GATK GitHub repo. - Update the .dockstore.yml file on that feature branch to add the feature branch for all the WDLs that will be loaded into the workspace in the next step. - Once the requested workspace has been created and permissioned, populate with the following WDLs: diff --git a/scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl b/scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl index 566efafc077..cb65d93d77b 100644 --- a/scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl +++ b/scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl @@ -39,9 +39,7 @@ workflow GvsBulkIngestGenomes { # set to "NONE" to ingest all the reference data into GVS for VDS (instead of VCF) output String drop_state = "NONE" - # The larger the `load_data_batch_size` the greater the probability of preemptions and non-retryable BigQuery errors, - # so if specifying `load_data_batch_size`, adjust preemptible and maxretries accordingly. Or just take the defaults, as those should work fine in most cases. - Int? load_data_batch_size + Int? load_data_scatter_width Int? load_data_preemptible_override Int? load_data_maxretries_override String? billing_project_id @@ -131,11 +129,7 @@ workflow GvsBulkIngestGenomes { input_vcfs = SplitBulkImportFofn.vcf_file_name_fofn, input_vcf_indexes = SplitBulkImportFofn.vcf_index_file_name_fofn, interval_list = interval_list, - - # The larger the `load_data_batch_size` the greater the probability of preemptions and non-retryable - # BigQuery errors so if specifying this adjust preemptible and maxretries accordingly. Or just take the defaults, - # those should work fine in most cases. - load_data_batch_size = load_data_batch_size, + load_data_scatter_width = load_data_scatter_width, load_data_maxretries_override = load_data_maxretries_override, load_data_preemptible_override = load_data_preemptible_override, basic_docker = effective_basic_docker, diff --git a/scripts/variantstore/wdl/GvsImportGenomes.wdl b/scripts/variantstore/wdl/GvsImportGenomes.wdl index 7af8683244a..4a7b433135b 100644 --- a/scripts/variantstore/wdl/GvsImportGenomes.wdl +++ b/scripts/variantstore/wdl/GvsImportGenomes.wdl @@ -29,7 +29,7 @@ workflow GvsImportGenomes { # without going over Int beta_customer_max_scatter = 200 File interval_list = "gs://gcp-public-data--broad-references/hg38/v0/wgs_calling_regions.hg38.noCentromeres.noTelomeres.interval_list" - Int? load_data_batch_size + Int? load_data_scatter_width Int? load_data_preemptible_override Int? load_data_maxretries_override # At least one of these "load" inputs must be true @@ -76,17 +76,17 @@ workflow GvsImportGenomes { } } - if ((num_samples > max_auto_batch_size) && !(defined(load_data_batch_size))) { + if ((num_samples > max_auto_batch_size) && !(defined(load_data_scatter_width))) { call Utils.TerminateWorkflow as DieDueToTooManySamplesWithoutExplicitLoadDataBatchSize { input: - message = "Importing " + num_samples + " samples but 'load_data_batch_size' is not explicitly specified; the limit for auto batch-sizing is " + max_auto_batch_size + " for " + genome_type + " samples.", + message = "Importing " + num_samples + " samples but 'load_data_scatter_width' is not explicitly specified; the limit for auto batch-sizing is " + max_auto_batch_size + " for " + genome_type + " samples.", basic_docker = effective_basic_docker, } } # At least 1, per limits above not more than 20. # But if it's a beta customer, use the number computed above - Int effective_load_data_batch_size = if (defined(load_data_batch_size)) then select_first([load_data_batch_size]) + Int effective_load_data_batch_size = if (defined(load_data_scatter_width)) then select_first([num_samples / load_data_scatter_width]) else if num_samples < max_scatter_for_user then 1 else if is_wgs then num_samples / max_scatter_for_user else if num_samples < 5001 then (num_samples / (max_scatter_for_user * 2)) From 4966cff82ca9ccbd7a7aa08eeb392c72a8e0b66f Mon Sep 17 00:00:00 2001 From: George Grant Date: Fri, 27 Sep 2024 13:12:54 -0400 Subject: [PATCH 5/5] Gg vs 1990 a widdle cleanup (#8970) * Moving some wdls around. * Remove HailFromWdl.wdl --- .dockstore.yml | 30 +-- .../variantstore/docs/aou/AOU_DELIVERABLES.md | 2 +- .../Dockerfile | 0 .../GvsCreateVATFilesFromBigQuery.wdl | 0 .../GvsCreateVATfromVDS.wdl | 4 +- .../GvsValidateVAT.example.inputs.json | 0 .../GvsValidateVAT.wdl | 0 .../README.md | 4 +- .../Reference Disk Terra Opt In.png | Bin .../build_docker.sh | 0 .../custom_annotations_template.tsv | 0 scripts/variantstore/wdl/HailFromWdl.wdl | 249 ------------------ .../wdl/{ => old}/ImportArrayManifest.wdl | 2 +- .../wdl/{ => old}/ImportArrays.wdl | 2 +- .../{ => test}/GvsBenchmarkExtractTask.wdl | 2 +- .../wdl/{ => test}/GvsIngestTieout.wdl | 6 +- .../GvsQuickstartHailIntegration.wdl | 6 +- .../{ => test}/GvsQuickstartIntegration.wdl | 6 +- .../GvsQuickstartVcfIntegration.wdl | 4 +- .../wdl/{ => test}/GvsTieoutPgenToVcf.wdl | 0 .../{ => test}/GvsTieoutVcfMaxAltAlleles.wdl | 0 21 files changed, 31 insertions(+), 286 deletions(-) rename scripts/variantstore/{variant_annotations_table => variant-annotations-table}/Dockerfile (100%) rename scripts/variantstore/{variant_annotations_table => variant-annotations-table}/GvsCreateVATFilesFromBigQuery.wdl (100%) rename scripts/variantstore/{wdl => variant-annotations-table}/GvsCreateVATfromVDS.wdl (99%) rename scripts/variantstore/{variant_annotations_table => variant-annotations-table}/GvsValidateVAT.example.inputs.json (100%) rename scripts/variantstore/{variant_annotations_table => variant-annotations-table}/GvsValidateVAT.wdl (100%) rename scripts/variantstore/{variant_annotations_table => variant-annotations-table}/README.md (93%) rename scripts/variantstore/{variant_annotations_table => variant-annotations-table}/Reference Disk Terra Opt In.png (100%) rename scripts/variantstore/{variant_annotations_table => variant-annotations-table}/build_docker.sh (100%) rename scripts/variantstore/{variant_annotations_table => variant-annotations-table}/custom_annotations_template.tsv (100%) delete mode 100644 scripts/variantstore/wdl/HailFromWdl.wdl rename scripts/variantstore/wdl/{ => old}/ImportArrayManifest.wdl (99%) rename scripts/variantstore/wdl/{ => old}/ImportArrays.wdl (99%) rename scripts/variantstore/wdl/{ => test}/GvsBenchmarkExtractTask.wdl (99%) rename scripts/variantstore/wdl/{ => test}/GvsIngestTieout.wdl (97%) rename scripts/variantstore/wdl/{ => test}/GvsQuickstartHailIntegration.wdl (98%) rename scripts/variantstore/wdl/{ => test}/GvsQuickstartIntegration.wdl (99%) rename scripts/variantstore/wdl/{ => test}/GvsQuickstartVcfIntegration.wdl (99%) rename scripts/variantstore/wdl/{ => test}/GvsTieoutPgenToVcf.wdl (100%) rename scripts/variantstore/wdl/{ => test}/GvsTieoutVcfMaxAltAlleles.wdl (100%) diff --git a/.dockstore.yml b/.dockstore.yml index f7e7800447f..d3871bcd62d 100644 --- a/.dockstore.yml +++ b/.dockstore.yml @@ -101,7 +101,7 @@ workflows: - /.*/ - name: GvsBenchmarkExtractTask subclass: WDL - primaryDescriptorPath: /scripts/variantstore/wdl/GvsBenchmarkExtractTask.wdl + primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsBenchmarkExtractTask.wdl filters: branches: - master @@ -200,7 +200,7 @@ workflows: - /.*/ - name: GvsCreateVATfromVDS subclass: WDL - primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateVATfromVDS.wdl + primaryDescriptorPath: /scripts/variantstore/wdl/variant-annotations-table/GvsCreateVATfromVDS.wdl filters: branches: - master @@ -209,7 +209,7 @@ workflows: - /.*/ - name: GvsCreateVATFilesFromBigQuery subclass: WDL - primaryDescriptorPath: /scripts/variantstore/variant_annotations_table/GvsCreateVATFilesFromBigQuery.wdl + primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/GvsCreateVATFilesFromBigQuery.wdl filters: branches: - master @@ -218,9 +218,9 @@ workflows: - /.*/ - name: GvsValidateVat subclass: WDL - primaryDescriptorPath: /scripts/variantstore/variant_annotations_table/GvsValidateVAT.wdl + primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/GvsValidateVAT.wdl testParameterFiles: - - /scripts/variantstore/variant_annotations_table/GvsValidateVat.example.inputs.json + - /scripts/variantstore/variant-annotations-table/GvsValidateVat.example.inputs.json filters: branches: - master @@ -285,7 +285,7 @@ workflows: - /.*/ - name: GvsQuickstartVcfIntegration subclass: WDL - primaryDescriptorPath: /scripts/variantstore/wdl/GvsQuickstartVcfIntegration.wdl + primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsQuickstartVcfIntegration.wdl filters: branches: - master @@ -294,7 +294,7 @@ workflows: - /.*/ - name: GvsQuickstartHailIntegration subclass: WDL - primaryDescriptorPath: /scripts/variantstore/wdl/GvsQuickstartHailIntegration.wdl + primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsQuickstartHailIntegration.wdl filters: branches: - master @@ -303,7 +303,7 @@ workflows: - /.*/ - name: GvsQuickstartIntegration subclass: WDL - primaryDescriptorPath: /scripts/variantstore/wdl/GvsQuickstartIntegration.wdl + primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsQuickstartIntegration.wdl filters: branches: - master @@ -313,7 +313,7 @@ workflows: - /.*/ - name: GvsIngestTieout subclass: WDL - primaryDescriptorPath: /scripts/variantstore/wdl/GvsIngestTieout.wdl + primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsIngestTieout.wdl filters: branches: - master @@ -358,21 +358,13 @@ workflows: - /.*/ - name: GvsTieoutVcfMaxAltAlleles subclass: WDL - primaryDescriptorPath: /scripts/variantstore/wdl/GvsTieoutVcfMaxAltAlleles.wdl + primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsTieoutVcfMaxAltAlleles.wdl filters: branches: - ah_var_store - master tags: - /.*/ - - name: HailFromWdl - subclass: WDL - primaryDescriptorPath: /scripts/variantstore/wdl/HailFromWdl.wdl - filters: - branches: - - master - tags: - - /.*/ - name: MitochondriaPipeline subclass: WDL primaryDescriptorPath: /scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl @@ -448,7 +440,7 @@ workflows: - EchoCallset - name: GvsTieoutPgenToVcf subclass: WDL - primaryDescriptorPath: /scripts/variantstore/wdl/GvsTieoutPgenToVcf.wdl + primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsTieoutPgenToVcf.wdl filters: branches: - ah_var_store diff --git a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md index a1cb46ae9f6..ed2b601cc81 100644 --- a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md +++ b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md @@ -111,7 +111,7 @@ The Callset Stats and S&P files can be simply `gsutil cp`ed to the AoU delivery ## Running the VAT pipeline To create a BigQuery table of variant annotations, you may follow the instructions here: -[process to create variant annotations table](../../variant_annotations_table/README.md) +[process to create variant annotations table](../../variant-annotations-table/README.md) The pipeline takes in the VDS and outputs a variant annotations table in BigQuery. Once the VAT table is created and a tsv is exported, the AoU research workbench team should be notified of its creation and permission should be granted so that several members of the team have view permission. diff --git a/scripts/variantstore/variant_annotations_table/Dockerfile b/scripts/variantstore/variant-annotations-table/Dockerfile similarity index 100% rename from scripts/variantstore/variant_annotations_table/Dockerfile rename to scripts/variantstore/variant-annotations-table/Dockerfile diff --git a/scripts/variantstore/variant_annotations_table/GvsCreateVATFilesFromBigQuery.wdl b/scripts/variantstore/variant-annotations-table/GvsCreateVATFilesFromBigQuery.wdl similarity index 100% rename from scripts/variantstore/variant_annotations_table/GvsCreateVATFilesFromBigQuery.wdl rename to scripts/variantstore/variant-annotations-table/GvsCreateVATFilesFromBigQuery.wdl diff --git a/scripts/variantstore/wdl/GvsCreateVATfromVDS.wdl b/scripts/variantstore/variant-annotations-table/GvsCreateVATfromVDS.wdl similarity index 99% rename from scripts/variantstore/wdl/GvsCreateVATfromVDS.wdl rename to scripts/variantstore/variant-annotations-table/GvsCreateVATfromVDS.wdl index 9ddbe7094fc..805d324d861 100644 --- a/scripts/variantstore/wdl/GvsCreateVATfromVDS.wdl +++ b/scripts/variantstore/variant-annotations-table/GvsCreateVATfromVDS.wdl @@ -1,7 +1,7 @@ version 1.0 -import "GvsUtils.wdl" as Utils -import "../variant_annotations_table/GvsCreateVATFilesFromBigQuery.wdl" as GvsCreateVATFilesFromBigQuery +import "../wdl/GvsUtils.wdl" as Utils +import "GvsCreateVATFilesFromBigQuery.wdl" as GvsCreateVATFilesFromBigQuery workflow GvsCreateVATfromVDS { input { diff --git a/scripts/variantstore/variant_annotations_table/GvsValidateVAT.example.inputs.json b/scripts/variantstore/variant-annotations-table/GvsValidateVAT.example.inputs.json similarity index 100% rename from scripts/variantstore/variant_annotations_table/GvsValidateVAT.example.inputs.json rename to scripts/variantstore/variant-annotations-table/GvsValidateVAT.example.inputs.json diff --git a/scripts/variantstore/variant_annotations_table/GvsValidateVAT.wdl b/scripts/variantstore/variant-annotations-table/GvsValidateVAT.wdl similarity index 100% rename from scripts/variantstore/variant_annotations_table/GvsValidateVAT.wdl rename to scripts/variantstore/variant-annotations-table/GvsValidateVAT.wdl diff --git a/scripts/variantstore/variant_annotations_table/README.md b/scripts/variantstore/variant-annotations-table/README.md similarity index 93% rename from scripts/variantstore/variant_annotations_table/README.md rename to scripts/variantstore/variant-annotations-table/README.md index 111a19de57a..ec2cbff612f 100644 --- a/scripts/variantstore/variant_annotations_table/README.md +++ b/scripts/variantstore/variant-annotations-table/README.md @@ -5,8 +5,8 @@ The pipeline takes in a Hail Variant Dataset (VDS), creates a queryable table in ### VAT WDLs -- [GvsCreateVATfromVDS.wdl](/scripts/variantstore/wdl/GvsCreateVATfromVDS.wdl) creates a sites only VCF from a VDS and then uses that and an ancestry file TSV to build the variant annotations table. -- [GvsValidateVAT.wdl](/scripts/variantstore/variant_annotations_table/GvsValidateVAT.wdl) checks and validates the created VAT and prints a report of any failing validation. +- [GvsCreateVATfromVDS.wdl](/scripts/variantstore/variant-annotations-table/GvsCreateVATfromVDS.wdl) creates a sites only VCF from a VDS and then uses that and an ancestry file TSV to build the variant annotations table. +- [GvsValidateVAT.wdl](/scripts/variantstore/variant-annotations-table/GvsValidateVAT.wdl) checks and validates the created VAT and prints a report of any failing validation. ### Run GvsCreateVATfromVDS diff --git a/scripts/variantstore/variant_annotations_table/Reference Disk Terra Opt In.png b/scripts/variantstore/variant-annotations-table/Reference Disk Terra Opt In.png similarity index 100% rename from scripts/variantstore/variant_annotations_table/Reference Disk Terra Opt In.png rename to scripts/variantstore/variant-annotations-table/Reference Disk Terra Opt In.png diff --git a/scripts/variantstore/variant_annotations_table/build_docker.sh b/scripts/variantstore/variant-annotations-table/build_docker.sh similarity index 100% rename from scripts/variantstore/variant_annotations_table/build_docker.sh rename to scripts/variantstore/variant-annotations-table/build_docker.sh diff --git a/scripts/variantstore/variant_annotations_table/custom_annotations_template.tsv b/scripts/variantstore/variant-annotations-table/custom_annotations_template.tsv similarity index 100% rename from scripts/variantstore/variant_annotations_table/custom_annotations_template.tsv rename to scripts/variantstore/variant-annotations-table/custom_annotations_template.tsv diff --git a/scripts/variantstore/wdl/HailFromWdl.wdl b/scripts/variantstore/wdl/HailFromWdl.wdl deleted file mode 100644 index 4438525ca2d..00000000000 --- a/scripts/variantstore/wdl/HailFromWdl.wdl +++ /dev/null @@ -1,249 +0,0 @@ -version 1.0 -# Largely "borrowing" from Lee's work -# https://github.com/broadinstitute/aou-ancestry/blob/a57bbab3ccee4d06317fecb8ca109424bca373b7/script/wdl/hail_in_wdl/filter_VDS_and_shard_by_contig.wdl - -# -# Given a VDS and a bed file, render a VCF (sharded by chromosome). -# All bed files referenced in this WDL are UCSC bed files (as opposed to PLINK bed files). -# -# This has not been tested on any reference other than hg38. -# Inputs: -# -# ## ANALYSIS PARAMETERS -# # ie, parameters that go to the Hail python code (submission_script below) -# String vds_url -# -# # Genomic region for the output VCFs to cover -# String bed_url -# -# # VCF Header that will be used in the output -# String vcf_header_url -# -# # Contigs of interest. If a contig is present in the bed file, but not in this list, the contig will be ignored. -# # In other words, this is a contig level intersection with the bed file. -# # This list of contigs that must be present in the reference. Each contig will be processed separately (shard) -# # This list should be ordered. Eg, ["chr21", "chr22"] -# Array[String] contigs -# -# # String used in construction of output filename -# # Cannot contain any special characters, ie, characters must be alphanumeric or "-" -# String prefix -# -# ## CLUSTER PARAMETERS -# # Number of workers (per shard) to use in the Hail cluster. -# Int num_workers -# -# # Set to 'subnetwork' if running in Terra Cromwell -# String gcs_subnetwork_name='subnetwork' -# -# # The script that is run on the cluster -# # See filter_VDS_and_shard_by_contig.py for an example. -# File submission_script -# -# # Set to "us-central1" if running in Terra Cromwell -# String region = "us-central1" -# -# ## VM PARAMETERS -# # Please note that there is a RuntimeAttr struct and a task parameter that can be used to override the defaults -# # of the VM. These are task parameters. -# # However, since this can be a lightweight VM, overriding is unlikely to be necessary. -# -# # The docker to be used on the VM. This will need both Hail and Google Cloud SDK installed. -# String hail_docker="us.gcr.io/broad-dsde-methods/lichtens/hail_dataproc_wdl:1.0" -# -# Important notes: -# - Hail will save the VCFs in the cloud. You will need to provide this storage space. In other words, the runtime -# parameters must have enough storage space to support a single contig -# - This WDL script is still dependent on the python/Hail script that it calls. You will see this when the parameters -# are passed into the script. -# - This WDL is boilerplate, except for input parameters, output parameters, and where marked in the main task. -# - We HIGHLY recommend that the WDL is NOT run on a preemptible VM -# (reminder, this is a single VM that spins up the dataproc cluster and submits jobs -- it is not doing any of the -# actual computation. In other words, it does not need to be a heavy machine.) -# In other words, always set `preemptible_tries` to zero (default). -# - -import "GvsUtils.wdl" as Utils - -struct RuntimeAttr { - Float? mem_gb - Int? cpu_cores - Int? disk_gb - Int? boot_disk_gb - Int? preemptible_tries - Int? max_retries -} - -workflow filter_vds_to_VCF_by_chr { - ### Change here: You will need to specify all parameters (both analysis and runtime) that need to go to the - # cluster, VM spinning up the cluster, and the script being run on the cluster. - input { - - ## ANALYSIS PARAMETERS - # ie, parameters that go to the Hail python code (submission_script below) - String vds_url - - String? git_branch_or_tag - String? hail_version - String? worker_machine_type - - # Genomic region for the output VCFs to cover - String bed_url = "gs://broad-public-datasets/gvs/weights/gvs_vet_weights_1kb.bed" - - # VCF Header that will be used in the output - String vcf_header_url = "gs://gvs_quickstart_storage/hail_from_wdl/vcf_header.txt" - - # Contigs of interest. If a contig is present in the bed file, but not in this list, the contig will be ignored. - # In other words, this is a contig level intersection with the bed file. - # This list of contigs that must be present in the reference. Each contig will be processed separately (shard) - # This list should be ordered. Eg, ["chr21", "chr22"] - Array[String] contigs = ["chr20"] - - # String used in construction of output filename - # Cannot contain any special characters, ie, characters must be alphanumeric or "-" - String prefix = "hail-from-wdl" - - ## CLUSTER PARAMETERS - # Number of workers (per shard) to use in the Hail cluster. - Int num_workers = 10 - - # Set to 'subnetwork' if running in Terra Cromwell - String gcs_subnetwork_name = 'subnetwork' - - # The script that is run on the cluster - # See filter_VDS_and_shard_by_contig.py for an example. - File? submission_script - - # Set to "us-central1" if running in Terra Cromwell - String region = "us-central1" - } - - call Utils.GetToolVersions - - scatter (contig in contigs) { - call filter_vds_and_export_as_vcf { - input: - vds_url = vds_url, - bed_url = bed_url, - contig = contig, - prefix = prefix, - gcs_project = GetToolVersions.google_project, - num_workers = num_workers, - gcs_subnetwork_name = gcs_subnetwork_name, - vcf_header_url = vcf_header_url, - git_branch_or_tag = git_branch_or_tag, - hail_version = hail_version, - worker_machine_type = worker_machine_type, - submission_script = submission_script, - cloud_sdk_slim_docker = GetToolVersions.cloud_sdk_slim_docker, - region = region, - } - } - - output { - Array[File] vcfs = filter_vds_and_export_as_vcf.vcf - } -} - -task filter_vds_and_export_as_vcf { - input { - # You must treat a VDS as a String, since it is a directory and not a single file - String vds_url - String bed_url - String vcf_header_url - - String? git_branch_or_tag - File? submission_script - String? hail_version - String? worker_machine_type - - # contig must be in the reference - String contig - String prefix - String gcs_project - String region = "us-central1" - Int num_workers - RuntimeAttr? runtime_attr_override - String gcs_subnetwork_name - - String cloud_sdk_slim_docker - } - - RuntimeAttr runtime_default = object { - mem_gb: 30, - disk_gb: 100, - cpu_cores: 1, - preemptible_tries: 0, - max_retries: 0, - boot_disk_gb: 10 - } - RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) - - String default_script_filename = "filter_VDS_and_shard_by_contig.py" - - command <<< - # Prepend date, time and pwd to xtrace log entries. - PS4='\D{+%F %T} \w $ ' - set -o errexit -o nounset -o pipefail -o xtrace - - account_name=$(gcloud config list account --format "value(core.account)") - - pip3 install --upgrade pip - pip3 install hail~{'==' + hail_version} - pip3 install --upgrade google-cloud-dataproc ijson - - if [[ -z "~{git_branch_or_tag}" && -z "~{submission_script}" ]] || [[ ! -z "~{git_branch_or_tag}" && ! -z "~{submission_script}" ]] - then - echo "Must specify git_branch_or_tag XOR submission_script" - exit 1 - elif [[ ! -z "~{git_branch_or_tag}" ]] - then - script_url="https://raw.githubusercontent.com/broadinstitute/gatk/~{git_branch_or_tag}/scripts/variantstore/wdl/extract/~{default_script_filename}" - curl --silent --location --remote-name "${script_url}" - fi - - if [[ ! -z "~{submission_script}" ]] - then - script_path="~{submission_script}" - else - script_path="~{default_script_filename}" - fi - - # Generate a UUIDish random hex string of <8 hex chars (4 bytes)>-<4 hex chars (2 bytes)> - hex="$(head -c4 < /dev/urandom | xxd -p)-$(head -c2 < /dev/urandom | xxd -p)" - - cluster_name="~{prefix}-~{contig}-hail-${hex}" - echo ${cluster_name} > cluster_name.txt - - python3 /app/run_in_hail_cluster.py \ - --script-path ${script_path} \ - --account ${account_name} \ - --num-workers ~{num_workers} \ - ~{'--worker-machine-type' + worker_machine_type} \ - --region ~{region} \ - --gcs-project ~{gcs_project} \ - --cluster-name ${cluster_name} \ - --prefix ~{prefix} \ - --contig ~{contig} \ - --vds-url ~{vds_url} \ - --vcf-header-url ~{vcf_header_url} \ - --bed-url ~{bed_url} - - echo "Complete" - >>> - - output { - String cluster_name = read_string("cluster_name.txt") - File vcf = "~{prefix}.~{contig}.vcf.bgz" - } - - runtime { - memory: select_first([runtime_override.mem_gb, runtime_default.mem_gb]) + " GB" - disks: "local-disk " + select_first([runtime_override.disk_gb, runtime_default.disk_gb]) + " SSD" - cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) - preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) - maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) - docker: cloud_sdk_slim_docker - bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) - } -} diff --git a/scripts/variantstore/wdl/ImportArrayManifest.wdl b/scripts/variantstore/wdl/old/ImportArrayManifest.wdl similarity index 99% rename from scripts/variantstore/wdl/ImportArrayManifest.wdl rename to scripts/variantstore/wdl/old/ImportArrayManifest.wdl index 6520571618d..7d9bf2d7c0c 100644 --- a/scripts/variantstore/wdl/ImportArrayManifest.wdl +++ b/scripts/variantstore/wdl/old/ImportArrayManifest.wdl @@ -1,6 +1,6 @@ version 1.0 -import "GvsUtils.wdl" as Utils +import "../GvsUtils.wdl" as Utils workflow ImportArrayManifest { diff --git a/scripts/variantstore/wdl/ImportArrays.wdl b/scripts/variantstore/wdl/old/ImportArrays.wdl similarity index 99% rename from scripts/variantstore/wdl/ImportArrays.wdl rename to scripts/variantstore/wdl/old/ImportArrays.wdl index 808473b89e7..b13098ed760 100644 --- a/scripts/variantstore/wdl/ImportArrays.wdl +++ b/scripts/variantstore/wdl/old/ImportArrays.wdl @@ -1,6 +1,6 @@ version 1.0 -import "GvsUtils.wdl" as Utils +import "../GvsUtils.wdl" as Utils workflow ImportArrays { diff --git a/scripts/variantstore/wdl/GvsBenchmarkExtractTask.wdl b/scripts/variantstore/wdl/test/GvsBenchmarkExtractTask.wdl similarity index 99% rename from scripts/variantstore/wdl/GvsBenchmarkExtractTask.wdl rename to scripts/variantstore/wdl/test/GvsBenchmarkExtractTask.wdl index 39c3f85630c..1a956ea5f03 100644 --- a/scripts/variantstore/wdl/GvsBenchmarkExtractTask.wdl +++ b/scripts/variantstore/wdl/test/GvsBenchmarkExtractTask.wdl @@ -1,6 +1,6 @@ version 1.0 -import "GvsUtils.wdl" as Utils +import "../GvsUtils.wdl" as Utils workflow GvsBenchmarkExtractTask { input { diff --git a/scripts/variantstore/wdl/GvsIngestTieout.wdl b/scripts/variantstore/wdl/test/GvsIngestTieout.wdl similarity index 97% rename from scripts/variantstore/wdl/GvsIngestTieout.wdl rename to scripts/variantstore/wdl/test/GvsIngestTieout.wdl index 68d8d8c6c26..bd4d74dc055 100644 --- a/scripts/variantstore/wdl/GvsIngestTieout.wdl +++ b/scripts/variantstore/wdl/test/GvsIngestTieout.wdl @@ -1,8 +1,8 @@ version 1.0 -import "GvsAssignIds.wdl" as GvsAssignIds -import "GvsImportGenomes.wdl" as GvsImportGenomes -import "GvsUtils.wdl" as Utils +import "../GvsAssignIds.wdl" as GvsAssignIds +import "../GvsImportGenomes.wdl" as GvsImportGenomes +import "../GvsUtils.wdl" as Utils workflow GvsIngestTieout { input { diff --git a/scripts/variantstore/wdl/GvsQuickstartHailIntegration.wdl b/scripts/variantstore/wdl/test/GvsQuickstartHailIntegration.wdl similarity index 98% rename from scripts/variantstore/wdl/GvsQuickstartHailIntegration.wdl rename to scripts/variantstore/wdl/test/GvsQuickstartHailIntegration.wdl index 85a006e55dd..a5758b72625 100644 --- a/scripts/variantstore/wdl/GvsQuickstartHailIntegration.wdl +++ b/scripts/variantstore/wdl/test/GvsQuickstartHailIntegration.wdl @@ -1,8 +1,8 @@ version 1.0 -import "GvsUtils.wdl" as Utils -import "GvsExtractAvroFilesForHail.wdl" as ExtractAvroFilesForHail -import "GvsCreateVDS.wdl" as CreateVds +import "../GvsUtils.wdl" as Utils +import "../GvsExtractAvroFilesForHail.wdl" as ExtractAvroFilesForHail +import "../GvsCreateVDS.wdl" as CreateVds import "GvsQuickstartVcfIntegration.wdl" as QuickstartVcfIntegration workflow GvsQuickstartHailIntegration { diff --git a/scripts/variantstore/wdl/GvsQuickstartIntegration.wdl b/scripts/variantstore/wdl/test/GvsQuickstartIntegration.wdl similarity index 99% rename from scripts/variantstore/wdl/GvsQuickstartIntegration.wdl rename to scripts/variantstore/wdl/test/GvsQuickstartIntegration.wdl index 9515fbefe55..f40e33223e2 100644 --- a/scripts/variantstore/wdl/GvsQuickstartIntegration.wdl +++ b/scripts/variantstore/wdl/test/GvsQuickstartIntegration.wdl @@ -2,8 +2,10 @@ version 1.0 import "GvsQuickstartVcfIntegration.wdl" as QuickstartVcfIntegration import "GvsQuickstartHailIntegration.wdl" as QuickstartHailIntegration -import "GvsJointVariantCalling.wdl" as JointVariantCalling -import "GvsUtils.wdl" as Utils +import "../GvsJointVariantCalling.wdl" as JointVariantCalling +import "../GvsUtils.wdl" as Utils + +# comment workflow GvsQuickstartIntegration { input { diff --git a/scripts/variantstore/wdl/GvsQuickstartVcfIntegration.wdl b/scripts/variantstore/wdl/test/GvsQuickstartVcfIntegration.wdl similarity index 99% rename from scripts/variantstore/wdl/GvsQuickstartVcfIntegration.wdl rename to scripts/variantstore/wdl/test/GvsQuickstartVcfIntegration.wdl index 375229db061..96932264a36 100644 --- a/scripts/variantstore/wdl/GvsQuickstartVcfIntegration.wdl +++ b/scripts/variantstore/wdl/test/GvsQuickstartVcfIntegration.wdl @@ -1,7 +1,7 @@ version 1.0 -import "GvsUtils.wdl" as Utils -import "GvsJointVariantCalling.wdl" as JointVariantCalling +import "../GvsUtils.wdl" as Utils +import "../GvsJointVariantCalling.wdl" as JointVariantCalling workflow GvsQuickstartVcfIntegration { input { diff --git a/scripts/variantstore/wdl/GvsTieoutPgenToVcf.wdl b/scripts/variantstore/wdl/test/GvsTieoutPgenToVcf.wdl similarity index 100% rename from scripts/variantstore/wdl/GvsTieoutPgenToVcf.wdl rename to scripts/variantstore/wdl/test/GvsTieoutPgenToVcf.wdl diff --git a/scripts/variantstore/wdl/GvsTieoutVcfMaxAltAlleles.wdl b/scripts/variantstore/wdl/test/GvsTieoutVcfMaxAltAlleles.wdl similarity index 100% rename from scripts/variantstore/wdl/GvsTieoutVcfMaxAltAlleles.wdl rename to scripts/variantstore/wdl/test/GvsTieoutVcfMaxAltAlleles.wdl