From 96969e069b8b4160f85bcc480e30a4ac04bb2d3d Mon Sep 17 00:00:00 2001 From: Rebecca Asch Date: Mon, 9 Sep 2024 10:12:07 -0400 Subject: [PATCH 1/8] WIP --- scripts/variantstore/docs/aou/cleanup/cleanup.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/scripts/variantstore/docs/aou/cleanup/cleanup.md b/scripts/variantstore/docs/aou/cleanup/cleanup.md index af4c2da7013..82f2159e7d1 100644 --- a/scripts/variantstore/docs/aou/cleanup/cleanup.md +++ b/scripts/variantstore/docs/aou/cleanup/cleanup.md @@ -2,15 +2,11 @@ ## Overview -The current Variants policy for AoU callsets is effectively to retain all versions of all artifacts forever. As the -storage costs for these artifacts can be significant (particularly from Delta onward), the Variants team would like to -make the cost of retaining artifacts more clear so conscious choices can be made about what to keep and what to delete. +The current Variants policy for AoU callsets is effectively to retain all versions of all artifacts forever. As the storage costs for these artifacts can be significant (particularly from Delta onward), the Variants team would like to make the cost of retaining artifacts more clear so conscious choices can be made about what to keep and what to delete. -As a general rule, any artifacts that have clearly become obsolete (e.g. VDSes with known issues that have been -superseded by corrected versions, obsolete sets of prepare tables, etc.) should be deleted ASAP. If it's not clear to -the Variants team whether an artifact should be cleaned up or not, we should calculate the monthly cost to preserve the -artifact (e.g. the sum of all relevant GCS or BigQuery storage costs) as well as the cost to regenerate the artifact. -Reach out to Lee with these numbers for his verdict on whether to keep or delete. +As a general rule, any artifacts that have clearly become obsolete (e.g. VDSes with known issues that have been superseded by corrected versions, obsolete sets of prepare tables, etc.) should be deleted ASAP. If it's not clear to the Variants team whether an artifact should be cleaned up or not, we should calculate the monthly cost to preserve the artifact (e.g. the sum of all relevant GCS or BigQuery storage costs) as well as the cost to regenerate the artifact. + +Reach out to leadership with these numbers for his verdict on whether to keep or delete. ## Specific AoU GVS Artifacts @@ -51,6 +47,7 @@ During the course of creating AoU callsets several large and expensive artifacts controls) and most recent versions (without AI/AN or controls, corrected phasing and GT) of the Delta VDS. This can serve as our team's guidance for how to handle multiple VDS versions going forward, though of course we can always ask Lee for explicit guidance. +* PGEN/VCF Intermediate Files ## Internal sign-off protocol From 2afc565787f2ecd52d64c2a434d4385155e37e01 Mon Sep 17 00:00:00 2001 From: Rebecca Asch Date: Mon, 9 Sep 2024 10:46:18 -0400 Subject: [PATCH 2/8] more WIP --- .../variantstore/docs/aou/cleanup/cleanup.md | 59 ++++++++----------- 1 file changed, 23 insertions(+), 36 deletions(-) diff --git a/scripts/variantstore/docs/aou/cleanup/cleanup.md b/scripts/variantstore/docs/aou/cleanup/cleanup.md index 82f2159e7d1..a3d32fe3b30 100644 --- a/scripts/variantstore/docs/aou/cleanup/cleanup.md +++ b/scripts/variantstore/docs/aou/cleanup/cleanup.md @@ -13,51 +13,38 @@ Reach out to leadership with these numbers for his verdict on whether to keep or During the course of creating AoU callsets several large and expensive artifacts are created: * Pilot workspace / dataset - * For the AoU Delta callset the Variants team created an AoU 10K workspace and dataset to pilot the Hail-related - processes we were using for the first time. At some point these processes will mature to the point where some or - all of the contents of this workspace and dataset should be deleted. This is likely not an issue for discussion - with Lee as it is internal to the Variants team's development process, but we should be mindful to clean up the - parts of this that we are done using promptly. + * for the Delta callset the Variants team created an AoU 10K workspace and dataset to pilot the Hail/VDS creation + * these processes will mature to the point where some or all of the contents of this workspace and dataset should be deleted * Production BigQuery dataset - * The plan from Delta forward is to use the same production BigQuery dataset for all future callsets. This was also - the plan historically as well, but in practice that didn't work out for various reasons. For Delta in particular - the Variants team was forced to create a new dataset due to the use of drop state NONE for Hail compatibility. If - we are forced to create another BigQuery dataset in the future, discuss with Lee to determine what to do with the - previous dataset(s). In the AoU Delta timeframe, the Variants team additionally reached out to a contact at VUMC - to determine if various datasets from the Alpha / Beta eras were still in use or could possibly be deleted. + * for each previous callset, there was (at least) one new dataset created + * the dream is to keep the same dataset for multiple callsets and just add new samples, regenerate the filter and create new deliverables, but that has yet to happen because of new features requested for each callset (e.g. update to Dragen version, addition of ploidy data, different requirements to use Hail...etc.) * Prepare tables - * These tables are used for callset statistics only starting with Delta but were previously used for extract as well - in pre-Delta callsets. Per 2023-01-25 meeting with Lee it appears this VET prepare table and associated callset - statistics tables can be deleted once the callset statistics have been generated and handed off. -* Terra workspace - * It seems that VAT workflows may generate large amounts of data under the submissions "folder". e.g. ~10 TiB of - data under this folder in the AoU 10K workspace (!). At the time of this writing the VAT process is not fully - defined so this item may especially benefit from updates. + * needed for VCF or PGEN extract + * only variant tables are used by `GvsCallsetStatistics.wdl` for callset statistics deliverable +* Sites-only VCFs + * the VAT is created from a sites-only VCF and its creation is the most resource-intensive part of the VAT pipeline + * clean up any failed runs once the VAT has been delivered and accepted * Avro files (Delta onward) - * These are huge, several times larger than the corresponding Hail VDS. It's not clear that there's any point to - keeping these files around unless there was a bug in the Hail GVS import code that would require a patch and - re-import. Per 2023-01-25 meeting with Lee we have now deleted the Delta versions of these Avro files. Per the - preceding comments, going forward these files can be deleted once the Variants team feels reasonably confident - that they won't be needed for the current callset any longer. -* Hail VariantDataset (VDS) (Delta onward) - * The Variants team creates a copy of the VDS and then delivers a copy to the AoU preprod datasets bucket. That copy - of the VDS seems to stay in the delivery location for at least a few days, but it's not clear if that copy gets - cleaned up after AoU later copies the VDS to a production bucket. The Variants team should not rely on this copy - of the VDS being available long-term. Per 2023-01-25 meeting with Lee, we have retained the oldest (with AI/AN + - controls) and most recent versions (without AI/AN or controls, corrected phasing and GT) of the Delta VDS. This - can serve as our team's guidance for how to handle multiple VDS versions going forward, though of course we can - always ask Lee for explicit guidance. + * huge, several times larger than the corresponding Hail VDS + * as long as the VDS has been delivered and accepted, they can be deleted +* Hail VariantDataset (VDS) + * we used to create it in the Terra workspace and then copy it + * WDL was updated to create in the "deliverables" GCS bucket so there is only one copy of each one + * clean up any failed runs once the VDS has been delivered and accepted * PGEN/VCF Intermediate Files + * PGEN: multiple versions of the PGEN files are created by GvsExtractCallsetPgenMerged.wdl because it delivers files split by chromosome + * VCF: only one version of the VCF files and indices are created, but check for failed runs ## Internal sign-off protocol The Variants team currently has the following VDS internal sign-off protocol: -* Generate a VDS for the candidate callset -* Run validation on this VDS -* Run `GvsPrepareRangesCallset` to generate a prepare table of VET data -* Generate callset statistics for the candidate callset using the prepare VET created in the preceding step -* Forward VDS and callset statistics to Lee for QA / approval +1. Generate a VDS for the candidate callset into the "delivery" bucket. +1. Open up the VDS in a beefy notebook and confirm the "shape" looks right. +1. Run `GvsPrepareRangesCallset.wdl` to generate a prepare table of VET data +1. Run `GvsCallsetStatistics.wdl` to generate callset statistics for the candidate callset using the prepare VET created in the preceding step +1. Copy the output of `GvsCallsetStatistics.wdl` into the "delivery" bucket. +1. Email the paths to the VDS and callset statistics to Lee/Wail for QA / approval ## Storage versus regeneration costs From 74a4773965605a6ea985c303b8e6fdd2bb08097d Mon Sep 17 00:00:00 2001 From: Rebecca Asch Date: Mon, 9 Sep 2024 12:31:11 -0400 Subject: [PATCH 3/8] moved cost stuff to new doc to keep cleanup doc about cleanup --- scripts/variantstore/docs/aou/cleanup/Cost.md | 53 +++++++++++++++++ .../variantstore/docs/aou/cleanup/cleanup.md | 57 +------------------ 2 files changed, 55 insertions(+), 55 deletions(-) create mode 100644 scripts/variantstore/docs/aou/cleanup/Cost.md diff --git a/scripts/variantstore/docs/aou/cleanup/Cost.md b/scripts/variantstore/docs/aou/cleanup/Cost.md new file mode 100644 index 00000000000..ef09d25e76e --- /dev/null +++ b/scripts/variantstore/docs/aou/cleanup/Cost.md @@ -0,0 +1,53 @@ +# Storage versus regeneration costs + +## Prepare tables + +These numbers assume the `GvsPrepareRangesCallset` workflow is invoked with the `only_output_vet_tables` input set +to `true`. If this is not the case, meaning the prepare version of the ref ranges table was also generated, all costs +below should be multiplied by about 4: + +* Running + GvsPrepareRangesCallset: [$429.18](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0) +``` +-- Look in the cost observability table for the bytes scanned for the appropriate run of `GvsPrepareRanges`. +SELECT + ROUND(event_bytes * (5 / POW(1024, 4)), 2) AS cost, -- $5 / TiB on demand https://cloud.google.com/bigquery/pricing#on_demand_pricing + call_start_timestamp +FROM + `aou-genomics-curation-prod.aou_wgs_fullref_v2.cost_observability` +WHERE + step = 'GvsPrepareRanges' +ORDER BY call_start_timestamp DESC +``` +* Storing prepare data: $878.39 / month + * Assuming compressed pricing, multiply the number of physical bytes by $0.026 / GiB. + +## Avro files + +The Avro files generated from the Delta callset onward are very large, several times the size of the final Hail VDS. +For the ~250K sample Delta callset the Avro files consumed nearly 80 TiB of GCS storage while the delivered VDS was +"only" about 26 TiB. + +Approximate figures for the ~250K sample Delta callset: + +* Avro storage cost: $1568 / month (might be lower if we can get a colder bucket to copy them into) + * `76.61 TiB gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/submissions/c86a6e8f-71a1-4c38-9e6a-f5229520641e/GvsExtractAvroFilesForHail/efb3dbe8-13e9-4542-8b69-02237ec77ca5/call-OutputPath/avro` +* [Avro generation cost](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0): + $3000, 12 hours runtime. + +## Hail VariantDataset (VDS) + +The Hail VDS generated for the Delta callset consumes about 26 TiB of space in GCS at a cost of approximately $500 / +month. Recreating the VDS from Avro files would take around 10 hours at about $100 / hour in cluster time for a total of +about $1000. Note that re-creating the VDS requires Avro files; if we have not retained the Avro files per the step +above, we would need to regenerate those as well which would add significantly to the cost. + +Approximate figures for the ~250K samples Delta callset: + +* VDS storage cost: ~$500 / month. Note AoU should have exact copies of the VDSes we have delivered for Delta, though + it's not certain that these copies will remain accessible to the Variants team in the long term. The delivered VDSes are put here `gs://prod-drc-broad/` and we have noted that we need them to remain there for hot-fixes. The Variants team has + generated five versions of the Delta VDS so far, one of which (the original) still exist: + * First version of the callset, includes many samples that were later + removed `gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/vds/2022-10-19/dead_alleles_removed_vs_667_249047_samples/gvs_export.vds` +* VDS regeneration cost: $1000 (~10 hours @ ~$100 / hour cluster cost) + $3000 to regenerate Avro files if necessary. + diff --git a/scripts/variantstore/docs/aou/cleanup/cleanup.md b/scripts/variantstore/docs/aou/cleanup/cleanup.md index a3d32fe3b30..c7931ddee71 100644 --- a/scripts/variantstore/docs/aou/cleanup/cleanup.md +++ b/scripts/variantstore/docs/aou/cleanup/cleanup.md @@ -1,10 +1,10 @@ -# AoU callset cleanup +# AoU Callset Cleanup ## Overview The current Variants policy for AoU callsets is effectively to retain all versions of all artifacts forever. As the storage costs for these artifacts can be significant (particularly from Delta onward), the Variants team would like to make the cost of retaining artifacts more clear so conscious choices can be made about what to keep and what to delete. -As a general rule, any artifacts that have clearly become obsolete (e.g. VDSes with known issues that have been superseded by corrected versions, obsolete sets of prepare tables, etc.) should be deleted ASAP. If it's not clear to the Variants team whether an artifact should be cleaned up or not, we should calculate the monthly cost to preserve the artifact (e.g. the sum of all relevant GCS or BigQuery storage costs) as well as the cost to regenerate the artifact. +As a general rule, any artifacts that have clearly become obsolete (e.g. VDSes with known issues that have been superseded by corrected versions, obsolete sets of prepare tables, etc.) should be deleted ASAP. If it's not clear to the Variants team whether an artifact should be cleaned up or not, [we should calculate the monthly cost to preserve the artifact (e.g. the sum of all relevant GCS or BigQuery storage costs) as well as the cost to regenerate the artifact](Cost.md). Reach out to leadership with these numbers for his verdict on whether to keep or delete. @@ -46,56 +46,3 @@ The Variants team currently has the following VDS internal sign-off protocol: 1. Copy the output of `GvsCallsetStatistics.wdl` into the "delivery" bucket. 1. Email the paths to the VDS and callset statistics to Lee/Wail for QA / approval -## Storage versus regeneration costs - -### Prepare tables - -These numbers assume the `GvsPrepareRangesCallset` workflow is invoked with the `only_output_vet_tables` input set -to `true`. If this is not the case, meaning the prepare version of the ref ranges table was also generated, all costs -below should be multiplied by about 4: - -* Running - GvsPrepareRangesCallset: [$429.18](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0) -``` --- Look in the cost observability table for the bytes scanned for the appropriate run of `GvsPrepareRanges`. -SELECT - ROUND(event_bytes * (5 / POW(1024, 4)), 2) AS cost, -- $5 / TiB on demand https://cloud.google.com/bigquery/pricing#on_demand_pricing - call_start_timestamp -FROM - `aou-genomics-curation-prod.aou_wgs_fullref_v2.cost_observability` -WHERE - step = 'GvsPrepareRanges' -ORDER BY call_start_timestamp DESC -``` -* Storing prepare data: $878.39 / month - * Assuming compressed pricing, multiply the number of physical bytes by $0.026 / GiB. - -### Avro files - -The Avro files generated from the Delta callset onward are very large, several times the size of the final Hail VDS. -For the ~250K sample Delta callset the Avro files consumed nearly 80 TiB of GCS storage while the delivered VDS was -"only" about 26 TiB. - -Approximate figures for the ~250K sample Delta callset: - -* Avro storage cost: $1568 / month (might be lower if we can get a colder bucket to copy them into) - * `76.61 TiB gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/submissions/c86a6e8f-71a1-4c38-9e6a-f5229520641e/GvsExtractAvroFilesForHail/efb3dbe8-13e9-4542-8b69-02237ec77ca5/call-OutputPath/avro` -* [Avro generation cost](https://docs.google.com/spreadsheets/d/1fcmEVWvjsx4XFLT9ZUsruUznnlB94xKgDIIyCGu6ryQ/edit#gid=0): - $3000, 12 hours runtime. - -### Hail VariantDataset (VDS) - -The Hail VDS generated for the Delta callset consumes about 26 TiB of space in GCS at a cost of approximately $500 / -month. Recreating the VDS from Avro files would take around 10 hours at about $100 / hour in cluster time for a total of -about $1000. Note that re-creating the VDS requires Avro files; if we have not retained the Avro files per the step -above, we would need to regenerate those as well which would add significantly to the cost. - -Approximate figures for the ~250K samples Delta callset: - -* VDS storage cost: ~$500 / month. Note AoU should have exact copies of the VDSes we have delivered for Delta, though - it's not certain that these copies will remain accessible to the Variants team in the long term. The delivered VDSes are put here `gs://prod-drc-broad/` and we have noted that we need them to remain there for hot-fixes. The Variants team has - generated five versions of the Delta VDS so far, one of which (the original) still exist: - * First version of the callset, includes many samples that were later - removed `gs://fc-secure-fb908548-fe3c-41d6-adaf-7ac20d541375/vds/2022-10-19/dead_alleles_removed_vs_667_249047_samples/gvs_export.vds` -* VDS regeneration cost: $1000 (~10 hours @ ~$100 / hour cluster cost) + $3000 to regenerate Avro files if necessary. - From 627ded568eebe93714b6c3c156eed1a5c1484345 Mon Sep 17 00:00:00 2001 From: Rebecca Asch Date: Mon, 9 Sep 2024 12:31:56 -0400 Subject: [PATCH 4/8] more formatting --- scripts/variantstore/docs/aou/cleanup/Cost.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/variantstore/docs/aou/cleanup/Cost.md b/scripts/variantstore/docs/aou/cleanup/Cost.md index ef09d25e76e..ce3157ac5c7 100644 --- a/scripts/variantstore/docs/aou/cleanup/Cost.md +++ b/scripts/variantstore/docs/aou/cleanup/Cost.md @@ -1,4 +1,4 @@ -# Storage versus regeneration costs +# Storage vs Regeneration Costs ## Prepare tables From bd0cf01fccaf7bcc3be12f03657b0e754f86bfe7 Mon Sep 17 00:00:00 2001 From: Rebecca Asch Date: Mon, 9 Sep 2024 12:38:53 -0400 Subject: [PATCH 5/8] shuffle around doc contents --- .../variantstore/docs/aou/AOU_DELIVERABLES.md | 21 +++++++++++++++---- .../variantstore/docs/aou/cleanup/cleanup.md | 17 ++++----------- 2 files changed, 21 insertions(+), 17 deletions(-) diff --git a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md index d69edaf4136..6f2ebdfce9b 100644 --- a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md +++ b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md @@ -88,13 +88,26 @@ 1. `GvsCallsetCost` workflow - This workflow calculates the total BigQuery cost of generating this callset (which is not represented in the Terra UI total workflow cost) using the above GVS workflows; it's used to calculate the cost as a whole and by sample. + +## Internal sign-off protocol + +The Variants team currently has the following VDS internal sign-off protocol: + +1. Generate a VDS for the candidate callset into the "delivery" bucket. +1. Open up the VDS in a beefy notebook and confirm the "shape" looks right. +1. Run `GvsPrepareRangesCallset.wdl` to generate a prepare table of VET data +1. Run `GvsCallsetStatistics.wdl` to generate callset statistics for the candidate callset using the prepare VET created in the preceding step +1. Copy the output of `GvsCallsetStatistics.wdl` into the "delivery" bucket. +1. Email the paths to the VDS and callset statistics to Lee/Wail for QA / approval + + ## Main Deliverables (via email to stakeholders once the above steps are complete) The Callset Stats and S&P files can be simply `gsutil cp`ed to the AoU delivery bucket since they are so much smaller. 1. GCS location of the VDS in the AoU delivery bucket -2. Fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows) -3. GCS location of the CSV output from `GvsCallsetStatistics` workflow in the AoU delivery bucket -4. GCS location of the TSV output from `GvsCalculatePrecisionAndSensitivity` in the AoU delivery bucket +1.Fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows) +1. GCS location of the CSV output from `GvsCallsetStatistics` workflow in the AoU delivery bucket +1. GCS location of the TSV output from `GvsCalculatePrecisionAndSensitivity` in the AoU delivery bucket ## Running the VAT pipeline To create a BigQuery table of variant annotations, you may follow the instructions here: @@ -207,4 +220,4 @@ Once the VAT has been created, you will need to create a database table mapping ``` select distinct vid from `.` where vid not in (select vid from `.`) ; - ``` \ No newline at end of file + ``` diff --git a/scripts/variantstore/docs/aou/cleanup/cleanup.md b/scripts/variantstore/docs/aou/cleanup/cleanup.md index c7931ddee71..29c527574c6 100644 --- a/scripts/variantstore/docs/aou/cleanup/cleanup.md +++ b/scripts/variantstore/docs/aou/cleanup/cleanup.md @@ -18,12 +18,15 @@ During the course of creating AoU callsets several large and expensive artifacts * Production BigQuery dataset * for each previous callset, there was (at least) one new dataset created * the dream is to keep the same dataset for multiple callsets and just add new samples, regenerate the filter and create new deliverables, but that has yet to happen because of new features requested for each callset (e.g. update to Dragen version, addition of ploidy data, different requirements to use Hail...etc.) + * if there are datasets from previous callsets that aren't needed anymore (check with AoU and Lee/Wail), they should be deleted * Prepare tables * needed for VCF or PGEN extract * only variant tables are used by `GvsCallsetStatistics.wdl` for callset statistics deliverable + * will be given a TTL by default, but if they are not needed anymore for any of the above, delete * Sites-only VCFs * the VAT is created from a sites-only VCF and its creation is the most resource-intensive part of the VAT pipeline - * clean up any failed runs once the VAT has been delivered and accepted + * for Echo, Wail requested that it get copied over to the "delivery" bucket; ask about this before deleting + * clean up all runs (check for failures) once the VAT has been delivered and accepted * Avro files (Delta onward) * huge, several times larger than the corresponding Hail VDS * as long as the VDS has been delivered and accepted, they can be deleted @@ -34,15 +37,3 @@ During the course of creating AoU callsets several large and expensive artifacts * PGEN/VCF Intermediate Files * PGEN: multiple versions of the PGEN files are created by GvsExtractCallsetPgenMerged.wdl because it delivers files split by chromosome * VCF: only one version of the VCF files and indices are created, but check for failed runs - -## Internal sign-off protocol - -The Variants team currently has the following VDS internal sign-off protocol: - -1. Generate a VDS for the candidate callset into the "delivery" bucket. -1. Open up the VDS in a beefy notebook and confirm the "shape" looks right. -1. Run `GvsPrepareRangesCallset.wdl` to generate a prepare table of VET data -1. Run `GvsCallsetStatistics.wdl` to generate callset statistics for the candidate callset using the prepare VET created in the preceding step -1. Copy the output of `GvsCallsetStatistics.wdl` into the "delivery" bucket. -1. Email the paths to the VDS and callset statistics to Lee/Wail for QA / approval - From e7a3c40d5a84b46914e4ddf559c911e71564a4c3 Mon Sep 17 00:00:00 2001 From: Rebecca Asch Date: Mon, 9 Sep 2024 13:50:00 -0400 Subject: [PATCH 6/8] PR review suggestion --- scripts/variantstore/docs/aou/AOU_DELIVERABLES.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md index 6f2ebdfce9b..b222489f621 100644 --- a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md +++ b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md @@ -94,7 +94,7 @@ The Variants team currently has the following VDS internal sign-off protocol: 1. Generate a VDS for the candidate callset into the "delivery" bucket. -1. Open up the VDS in a beefy notebook and confirm the "shape" looks right. +1. Open up the VDS in a [beefy](/scripts/variantstore/docs/vds/cluster/AoU%20VDS%20Cluster%20Configuration.md) notebook and confirm the "shape" looks right. 1. Run `GvsPrepareRangesCallset.wdl` to generate a prepare table of VET data 1. Run `GvsCallsetStatistics.wdl` to generate callset statistics for the candidate callset using the prepare VET created in the preceding step 1. Copy the output of `GvsCallsetStatistics.wdl` into the "delivery" bucket. From 1825c44cb46ffa0fb111ac95f318bc97f86d74fb Mon Sep 17 00:00:00 2001 From: Rebecca Asch Date: Mon, 9 Sep 2024 13:51:18 -0400 Subject: [PATCH 7/8] get path right --- scripts/variantstore/docs/aou/AOU_DELIVERABLES.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md index b222489f621..21f44c55a61 100644 --- a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md +++ b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md @@ -94,7 +94,7 @@ The Variants team currently has the following VDS internal sign-off protocol: 1. Generate a VDS for the candidate callset into the "delivery" bucket. -1. Open up the VDS in a [beefy](/scripts/variantstore/docs/vds/cluster/AoU%20VDS%20Cluster%20Configuration.md) notebook and confirm the "shape" looks right. +1. Open up the VDS in a [beefy](vds/cluster/AoU%20VDS%20Cluster%20Configuration.md) notebook and confirm the "shape" looks right. 1. Run `GvsPrepareRangesCallset.wdl` to generate a prepare table of VET data 1. Run `GvsCallsetStatistics.wdl` to generate callset statistics for the candidate callset using the prepare VET created in the preceding step 1. Copy the output of `GvsCallsetStatistics.wdl` into the "delivery" bucket. From 1525b0d0989667347a401af09b0b2feb917f8d05 Mon Sep 17 00:00:00 2001 From: Rebecca Asch Date: Tue, 10 Sep 2024 18:36:18 -0400 Subject: [PATCH 8/8] more PR feedback --- scripts/variantstore/docs/aou/AOU_DELIVERABLES.md | 8 ++++---- scripts/variantstore/docs/aou/cleanup/cleanup.md | 4 ++-- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md index d7bff4a87ef..2eb7bd56fad 100644 --- a/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md +++ b/scripts/variantstore/docs/aou/AOU_DELIVERABLES.md @@ -95,17 +95,17 @@ The Variants team currently has the following VDS internal sign-off protocol: 1. Generate a VDS for the candidate callset into the "delivery" bucket. 1. Open up the VDS in a [beefy](vds/cluster/AoU%20VDS%20Cluster%20Configuration.md) notebook and confirm the "shape" looks right. -1. Run `GvsPrepareRangesCallset.wdl` to generate a prepare table of VET data -1. Run `GvsCallsetStatistics.wdl` to generate callset statistics for the candidate callset using the prepare VET created in the preceding step +1. Run `GvsPrepareRangesCallset.wdl` to generate a prepare table of VET data. +1. Run `GvsCallsetStatistics.wdl` to generate callset statistics for the candidate callset using the prepare VET table created in the preceding step. 1. Copy the output of `GvsCallsetStatistics.wdl` into the "delivery" bucket. -1. Email the paths to the VDS and callset statistics to Lee/Wail for QA / approval +1. Email the paths to the VDS and callset statistics to Lee/Wail for QA / approval. ## Main Deliverables (via email to stakeholders once the above steps are complete) The Callset Stats and S&P files can be simply `gsutil cp`ed to the AoU delivery bucket since they are so much smaller. 1. GCS location of the VDS in the AoU delivery bucket -1.Fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows) +1. Fully qualified name of the BigQuery dataset (composed of the `project_id` and `dataset_name` inputs from the workflows) 1. GCS location of the CSV output from `GvsCallsetStatistics` workflow in the AoU delivery bucket 1. GCS location of the TSV output from `GvsCalculatePrecisionAndSensitivity` in the AoU delivery bucket diff --git a/scripts/variantstore/docs/aou/cleanup/cleanup.md b/scripts/variantstore/docs/aou/cleanup/cleanup.md index 29c527574c6..2967d28a1cb 100644 --- a/scripts/variantstore/docs/aou/cleanup/cleanup.md +++ b/scripts/variantstore/docs/aou/cleanup/cleanup.md @@ -4,9 +4,9 @@ The current Variants policy for AoU callsets is effectively to retain all versions of all artifacts forever. As the storage costs for these artifacts can be significant (particularly from Delta onward), the Variants team would like to make the cost of retaining artifacts more clear so conscious choices can be made about what to keep and what to delete. -As a general rule, any artifacts that have clearly become obsolete (e.g. VDSes with known issues that have been superseded by corrected versions, obsolete sets of prepare tables, etc.) should be deleted ASAP. If it's not clear to the Variants team whether an artifact should be cleaned up or not, [we should calculate the monthly cost to preserve the artifact (e.g. the sum of all relevant GCS or BigQuery storage costs) as well as the cost to regenerate the artifact](Cost.md). +As a general rule, any artifacts that have clearly become obsolete (e.g. VDSes with known issues that have been superseded by corrected versions, obsolete sets of prepare tables, etc.) should be deleted ASAP. If it's not clear to the Variants team whether an artifact should be cleaned up or not, [we should calculate the monthly cost to preserve the artifact (e.g. the sum of all relevant GCS or BigQuery storage costs) as well as the cost to regenerate the artifact](Cost.md). -Reach out to leadership with these numbers for his verdict on whether to keep or delete. +Reach out to leadership with these numbers for their verdict on whether to keep or delete. ## Specific AoU GVS Artifacts