Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/ah_var_store' into gg_VS-1422_Me…
Browse files Browse the repository at this point in the history
…rgeVATChangesInEchoIntoAhVarStore
  • Loading branch information
gbggrant committed Sep 30, 2024
2 parents a9a44d2 + 4966cff commit 2e2e793
Show file tree
Hide file tree
Showing 26 changed files with 140 additions and 412 deletions.
32 changes: 12 additions & 20 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ workflows:
- /.*/
- name: GvsBenchmarkExtractTask
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsBenchmarkExtractTask.wdl
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsBenchmarkExtractTask.wdl
filters:
branches:
- master
Expand Down Expand Up @@ -184,7 +184,7 @@ workflows:
branches:
- master
- ah_var_store
- vs_1456_status_writes_bug
- rsa_vs_1218
tags:
- /.*/
- name: GvsPrepareRangesCallset
Expand All @@ -200,7 +200,7 @@ workflows:
- /.*/
- name: GvsCreateVATfromVDS
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateVATfromVDS.wdl
primaryDescriptorPath: /scripts/variantstore/wdl/variant-annotations-table/GvsCreateVATfromVDS.wdl
filters:
branches:
- master
Expand All @@ -209,7 +209,7 @@ workflows:
- /.*/
- name: GvsCreateVATFilesFromBigQuery
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/variant_annotations_table/GvsCreateVATFilesFromBigQuery.wdl
primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/GvsCreateVATFilesFromBigQuery.wdl
filters:
branches:
- master
Expand All @@ -218,9 +218,9 @@ workflows:
- /.*/
- name: GvsValidateVat
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/variant_annotations_table/GvsValidateVAT.wdl
primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/GvsValidateVAT.wdl
testParameterFiles:
- /scripts/variantstore/variant_annotations_table/GvsValidateVat.example.inputs.json
- /scripts/variantstore/variant-annotations-table/GvsValidateVat.example.inputs.json
filters:
branches:
- master
Expand Down Expand Up @@ -285,7 +285,7 @@ workflows:
- /.*/
- name: GvsQuickstartVcfIntegration
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsQuickstartVcfIntegration.wdl
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsQuickstartVcfIntegration.wdl
filters:
branches:
- master
Expand All @@ -294,7 +294,7 @@ workflows:
- /.*/
- name: GvsQuickstartHailIntegration
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsQuickstartHailIntegration.wdl
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsQuickstartHailIntegration.wdl
filters:
branches:
- master
Expand All @@ -303,7 +303,7 @@ workflows:
- /.*/
- name: GvsQuickstartIntegration
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsQuickstartIntegration.wdl
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsQuickstartIntegration.wdl
filters:
branches:
- master
Expand All @@ -313,7 +313,7 @@ workflows:
- /.*/
- name: GvsIngestTieout
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsIngestTieout.wdl
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsIngestTieout.wdl
filters:
branches:
- master
Expand Down Expand Up @@ -358,21 +358,13 @@ workflows:
- /.*/
- name: GvsTieoutVcfMaxAltAlleles
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsTieoutVcfMaxAltAlleles.wdl
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsTieoutVcfMaxAltAlleles.wdl
filters:
branches:
- ah_var_store
- master
tags:
- /.*/
- name: HailFromWdl
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/HailFromWdl.wdl
filters:
branches:
- master
tags:
- /.*/
- name: MitochondriaPipeline
subclass: WDL
primaryDescriptorPath: /scripts/mitochondria_m2_wdl/MitochondriaPipeline.wdl
Expand Down Expand Up @@ -448,7 +440,7 @@ workflows:
- EchoCallset
- name: GvsTieoutPgenToVcf
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsTieoutPgenToVcf.wdl
primaryDescriptorPath: /scripts/variantstore/wdl/test/GvsTieoutPgenToVcf.wdl
filters:
branches:
- ah_var_store
Expand Down
26 changes: 20 additions & 6 deletions scripts/variantstore/docs/aou/AOU_DELIVERABLES.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
- As described in the "Getting Started" of [Operational concerns for running Hail in Terra Cromwell/WDL](https://docs.google.com/document/d/1_OY2rKwZ-qKCDldSZrte4jRIZf4eAw2d7Jd-Asi50KE/edit?usp=sharing), this workspace will need permission in Terra to run Hail dataproc clusters within WDL. Contact Emily to request this access as part of setting up the new workspace.
- There is a quota that needs to be upgraded for the process of Bulk Ingest.
When we ingest data, we use the Write API, which is part of BQ’s Storage API. Since we are hitting this API with so much data all at once, we want to increase our CreateWriteStream quota. Follow the [Quota Request Template](workspace/CreateWriteStreamRequestIncreasedQuota.md).
Once that quota has been increased, the `load_data_batch` value needs to be updated based on calculations in the [Quota Request Template](workspace/CreateWriteStreamRequestIncreasedQuota.md) doc. Even if no increased quota is granted, this doc goes over how to choose the value for this param.
Once that quota has been increased, the `load_data_scatter_width` value needs to be updated based on that new quota (for information on what we did for Echo, see the "Calculate Quota To be Requested" section in the [Quota Request Template](workspace/CreateWriteStreamRequestIncreasedQuota.md) doc).
- Create and push a feature branch (e.g. `EchoCallset`) based off the `ah_var_store` branch to the GATK GitHub repo.
- Update the .dockstore.yml file on that feature branch to add the feature branch for all the WDLs that will be loaded into the workspace in the next step.
- Once the requested workspace has been created and permissioned, populate with the following WDLs:
Expand Down Expand Up @@ -88,17 +88,30 @@
1. `GvsCallsetCost` workflow
- This workflow calculates the total BigQuery cost of generating this callset (which is not represented in the Terra UI total workflow cost) using the above GVS workflows; it's used to calculate the cost as a whole and by sample.


## Internal sign-off protocol

The Variants team currently has the following VDS internal sign-off protocol:

1. Generate a VDS for the candidate callset into the "delivery" bucket.
1. Open up the VDS in a [beefy](vds/cluster/AoU%20VDS%20Cluster%20Configuration.md) notebook and confirm the "shape" looks right.
1. Run `GvsPrepareRangesCallset.wdl` to generate a prepare table of VET data.
1. Run `GvsCallsetStatistics.wdl` to generate callset statistics for the candidate callset using the prepare VET table created in the preceding step.
1. Copy the output of `GvsCallsetStatistics.wdl` into the "delivery" bucket.
1. Email the paths to the VDS and callset statistics to Lee/Wail for QA / approval.


## Main Deliverables (via email to stakeholders once the above steps are complete)

The Callset Stats and S&P files can be simply `gsutil cp`ed to the AoU delivery bucket since they are so much smaller.
1. GCS location of the VDS in the AoU delivery bucket
2. Fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows)
3. GCS location of the CSV output from `GvsCallsetStatistics` workflow in the AoU delivery bucket
4. GCS location of the TSV output from `GvsCalculatePrecisionAndSensitivity` in the AoU delivery bucket
1. Fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows)
1. GCS location of the CSV output from `GvsCallsetStatistics` workflow in the AoU delivery bucket
1. GCS location of the TSV output from `GvsCalculatePrecisionAndSensitivity` in the AoU delivery bucket

## Running the VAT pipeline
To create a BigQuery table of variant annotations, you may follow the instructions here:
[process to create variant annotations table](../../variant_annotations_table/README.md)
[process to create variant annotations table](../../variant-annotations-table/README.md)
The pipeline takes in the VDS and outputs a variant annotations table in BigQuery.

Once the VAT table is created and a tsv is exported, the AoU research workbench team should be notified of its creation and permission should be granted so that several members of the team have view permission.
Expand Down Expand Up @@ -159,6 +172,7 @@ You can take advantage of our existing sub-cohort WDL, `GvsExtractCohortFromSamp
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
- Specify the `interval_weights_bed` appropriate for the PGEN extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
- Select the workflow option "Retry with more memory" and choose a "Memory retry factor" of 1.5
- Set the `extract_maxretries_override` input to 5, `split_intervals_disk_size_override` to 1000, `scatter_count` to 25000, and `y_bed_weight_scaling` to 8 to start; you will likely have to adjust one or more of these values in subsequent attempts.
- `GvsExtractCallsetPgen` currently defaults to 100 alt alleles maximum, which means that any sites having more than that number of alt alleles will be dropped.
- Be sure to set the `output_gcs_dir` to the proper path in the AoU delivery bucket so you don't need to copy the output files there yourself once the workflow has finished.
- For `GvsExtractCallsetPgen` (which is called by `GvsExtractCallsetPgenMerged`), if one (or several) of the `PgenExtractTask` shards fail because of angry cloud, you can re-run the workflow with the exact same inputs with call caching turned on; the successful shards will cache and only the failed ones will re-run.
Expand Down Expand Up @@ -213,4 +227,4 @@ Once the VAT has been created, you will need to create a database table mapping

```
select distinct vid from `<dataset>.<vat_table_name>` where vid not in (select vid from `<dataset>.<mapping_table_name>`) ;
```
```
53 changes: 53 additions & 0 deletions scripts/variantstore/docs/aou/cleanup/Cost.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Storage vs Regeneration Costs

## Prepare tables

These numbers assume the `GvsPrepareRangesCallset` workflow is invoked with the `only_output_vet_tables` input set
to `true`. If this is not the case, meaning the prepare version of the ref ranges table was also generated, all costs
below should be multiplied by about 4:

* Running
GvsPrepareRangesCallset: [$429.18](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0)
```
-- Look in the cost observability table for the bytes scanned for the appropriate run of `GvsPrepareRanges`.
SELECT
ROUND(event_bytes * (5 / POW(1024, 4)), 2) AS cost, -- $5 / TiB on demand https://cloud.google.com/bigquery/pricing#on_demand_pricing
call_start_timestamp
FROM
`aou-genomics-curation-prod.aou_wgs_fullref_v2.cost_observability`
WHERE
step = 'GvsPrepareRanges'
ORDER BY call_start_timestamp DESC
```
* Storing prepare data: $878.39 / month
* Assuming compressed pricing, multiply the number of physical bytes by $0.026 / GiB.

## Avro files

The Avro files generated from the Delta callset onward are very large, several times the size of the final Hail VDS.
For the ~250K sample Delta callset the Avro files consumed nearly 80 TiB of GCS storage while the delivered VDS was
"only" about 26 TiB.

Approximate figures for the ~250K sample Delta callset:

* Avro storage cost: $1568 / month (might be lower if we can get a colder bucket to copy them into)
* `76.61 TiB gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/submissions/c86a6e8f-71a1-4c38-9e6a-f5229520641e/GvsExtractAvroFilesForHail/efb3dbe8-13e9-4542-8b69-02237ec77ca5/call-OutputPath/avro`
* [Avro generation cost](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0):
$3000, 12 hours runtime.

## Hail VariantDataset (VDS)

The Hail VDS generated for the Delta callset consumes about 26 TiB of space in GCS at a cost of approximately $500 /
month. Recreating the VDS from Avro files would take around 10 hours at about $100 / hour in cluster time for a total of
about $1000. Note that re-creating the VDS requires Avro files; if we have not retained the Avro files per the step
above, we would need to regenerate those as well which would add significantly to the cost.

Approximate figures for the ~250K samples Delta callset:

* VDS storage cost: ~$500 / month. Note AoU should have exact copies of the VDSes we have delivered for Delta, though
it's not certain that these copies will remain accessible to the Variants team in the long term. The delivered VDSes are put here `gs://prod-drc-broad/` and we have noted that we need them to remain there for hot-fixes. The Variants team has
generated five versions of the Delta VDS so far, one of which (the original) still exist:
* First version of the callset, includes many samples that were later
removed `gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/vds/2022-10-19/dead_alleles_removed_vs_667_249047_samples/gvs_export.vds`
* VDS regeneration cost: $1000 (~10 hours @ ~$100 / hour cluster cost) + $3000 to regenerate Avro files if necessary.

Loading

0 comments on commit 2e2e793

Please sign in to comment.