From f261eed22571fc8950eed8699b5fbab62ad0e11c Mon Sep 17 00:00:00 2001 From: Karan Jaisingh Date: Thu, 15 Aug 2024 15:27:04 -0400 Subject: [PATCH] Skip subsampling if batch size is less than n_samples_subsample (#707) Includes a check that n_samples_subsample is less than the batch size - i.e. length(samples). If this is the case, the subsampling task RandomSubsampleStringArray is skipped. Also updates documentation to indicate this. --- .../cohort_mode/cohort_mode_workspace_dashboard.md.tmpl | 2 +- wdl/TrainGCNV.wdl | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl b/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl index 7523f1ac1..86e1d8bd0 100644 --- a/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl +++ b/inputs/templates/terra_workspaces/cohort_mode/cohort_mode_workspace_dashboard.md.tmpl @@ -166,7 +166,7 @@ Read the full EvidenceQC documentation [here](https://github.com/broadinstitute/ Read the full TrainGCNV documentation [here](https://github.com/broadinstitute/gatk-sv#gcnv-training-1). * Before running this workflow, create the batches (~100-500 samples) you will use for the rest of the pipeline based on sample coverage, WGD score (from `02-EvidenceQC`), and PCR status. These will likely not be the same as the batches you used for `02-EvidenceQC`. -* By default, `03-TrainGCNV` is configured to be run once per `sample_set` on 100 randomly-chosen samples from that set to create a gCNV model for each batch. If your `sample_set` contains fewer than 100 samples (not recommended), you will need to edit the `n_samples_subsample` parameter to be less than or equal to the number of samples. +* By default, `03-TrainGCNV` is configured to be run once per `sample_set` on 100 randomly-chosen samples from that set to create a gCNV model for each batch. To modify this behavior, you can set the `n_samples_subsample` parameter to the number of samples to use for training. #### 04-GatherBatchEvidence diff --git a/wdl/TrainGCNV.wdl b/wdl/TrainGCNV.wdl index 579d1255e..43c41e8fa 100644 --- a/wdl/TrainGCNV.wdl +++ b/wdl/TrainGCNV.wdl @@ -112,7 +112,7 @@ workflow TrainGCNV { } } - if (defined(n_samples_subsample) && !defined(sample_ids_training_subset)) { + if (defined(n_samples_subsample) && (select_first([n_samples_subsample]) < length(samples)) && !defined(sample_ids_training_subset)) { call util.RandomSubsampleStringArray { input: strings = write_lines(samples),