diff --git a/website/docs/gs/runtime_env.md b/website/docs/gs/runtime_env.md index 10b8c6234..65aee6032 100644 --- a/website/docs/gs/runtime_env.md +++ b/website/docs/gs/runtime_env.md @@ -48,4 +48,4 @@ and share code on forked repositories. Here are a some considerations: - The GATK-SV pipeline takes advantage of the massive parallelization possible in the cloud. Local backends may not have the resources to execute all of the workflows. Workflows that use fewer resources or that are less parallelized may be more successful. - For instance, some users have been able to run [GatherSampleEvidence](#gather-sample-evidence) on a SLURM cluster. + For instance, some users have been able to run [GatherSampleEvidence](../modules/gse) on a SLURM cluster. diff --git a/website/docs/modules/evidence_qc.md b/website/docs/modules/evidence_qc.md index 085945f87..5177636da 100644 --- a/website/docs/modules/evidence_qc.md +++ b/website/docs/modules/evidence_qc.md @@ -5,6 +5,8 @@ sidebar_position: 2 slug: eqc --- +import { Highlight, HighlightOptionalArg } from "../../src/components/highlight.js" + Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching. @@ -17,9 +19,36 @@ for further guidance on creating batches. We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies. -### Prerequisites +The following diagram illustrates the upstream and downstream workflows of the `EvidenceQC` workflow +in the recommended invocation order. You may refer to +[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg) +for the overall recommended invocation order. + +
+ +```mermaid + +stateDiagram + direction LR + + classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8 + classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white + classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d + + gse: GatherSampleEvidence + eqc: EvidenceQC + batching: Batching, sample QC, and sex assignment + + gse --> eqc + eqc --> batching + + class eqc thisModule + class gse inModules + class batching outModules +``` + +
-- [Gather Sample Evidence](./gse) ### Inputs diff --git a/website/docs/modules/gather_batch_evidence.md b/website/docs/modules/gather_batch_evidence.md index d6de3948f..0bdf8a7a9 100644 --- a/website/docs/modules/gather_batch_evidence.md +++ b/website/docs/modules/gather_batch_evidence.md @@ -5,25 +5,174 @@ sidebar_position: 4 slug: gbe --- -Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample -raw evidence into a batch. See above for more information on batching. +Runs CNV callers ([cn.MOPS](https://academic.oup.com/nar/article/40/9/e69/1136601), GATK-gCNV) +and combines single-sample raw evidence into a batch. -### Prerequisites +The following diagram illustrates the downstream workflows of the `GatherBatchEvidence` workflow +in the recommended invocation order. You may refer to +[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg) +for the overall recommended invocation order. -- GatherSampleEvidence -- (Recommended) EvidenceQC -- gCNV training. +```mermaid -### Inputs -- PED file (updated with EvidenceQC sex assignments, including sex = 0 - for sex aneuploidies. Calls will not be made on sex chromosomes - when sex = 0 in order to avoid generating many confusing calls - or upsetting normalized copy numbers for the batch.) -- Read count, BAF, PE, SD, and SR files (GatherSampleEvidence) -- Caller VCFs (GatherSampleEvidence) -- Contig ploidy model and gCNV model files (gCNV training) +stateDiagram + direction LR + + classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8 + classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white + classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d -### Outputs + gbe: GatherBatchEvidence + t: TrainGCNV + cb: ClusterBatch + t --> gbe + gbe --> cb + + class gbe thisModule + class t inModules + class cb outModules +``` + +## Inputs +This workflow takes as input the read counts, BAF, PE, SD, SR, and per-caller VCF files +produced in the GatherSampleEvidence workflow, and contig ploidy and gCNV models from +the TrainGCNV workflow. +The following is the list of the inputs the GatherBatchEvidence workflow takes. + + +#### `batch` +An identifier for the batch. + + +#### `samples` +Sets the list of sample IDs. + + +#### `counts` +Set to the [`GatherSampleEvidence.coverage_counts`](./gse#coverage-counts) output. + + +#### Raw calls + +The following inputs set the per-caller raw SV calls, and should be set +if the caller was run in the [`GatherSampleEvidence`](./gse) workflow. +You may set each of the following inputs to the linked output from +the GatherSampleEvidence workflow. + + +- `manta_vcfs`: [`GatherSampleEvidence.manta_vcf`](./gse#manta-vcf); +- `melt_vcfs`: [`GatherSampleEvidence.melt_vcf`](./gse#melt-vcf); +- `scramble_vcfs`: [`GatherSampleEvidence.scramble_vcf`](./gse#scramble-vcf); +- `wham_vcfs`: [`GatherSampleEvidence.wham_vcf`](./gse#wham-vcf). + +#### `PE_files` +Set to the [`GatherSampleEvidence.pesr_disc`](./gse#pesr-disc) output. + +#### `SR_files` +Set to the [`GatherSampleEvidence.pesr_split`](./gse#pesr-split) + + +#### `SD_files` +Set to the [`GatherSampleEvidence.pesr_sd`](./gse#pesr-sd) + + +#### `matrix_qc_distance` +You may refer to [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl) +for an example value. + + +#### `min_svsize` +Sets the minimum size of SVs to include. + + +#### `ped_file` +A pedigree file describing the familial relationshipts between the samples in the cohort. +Please refer to [this section](./#ped_file) for details. + + +#### `run_matrix_qc` +Enables or disables running optional QC tasks. + + +#### `gcnv_qs_cutoff` +You may refer to [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl) +for an example value. + +#### cn.MOPS files +The workflow needs the following cn.MOPS files. + +- `cnmops_chrom_file` and `cnmops_allo_file`: FASTA index files (`.fai`) for respectively + non-sex chromosomes (autosomes) and chromosomes X and Y (allosomes). + The file format is explained [on this page](https://www.htslib.org/doc/faidx.html). + + You may use the following files for these fields: + + ```json + "cnmops_chrom_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/autosome.fai" + "cnmops_allo_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/allosome.fai" + ``` + +- `cnmops_exclude_list`: + You may use [this file](https://github.com/broadinstitute/gatk-sv/blob/d66f760865a89f30dbce456a3f720dec8b70705c/inputs/values/resources_hg38.json#L10) + for this field. + +#### GATK-gCNV inputs + +The following inputs are configured based on the outputs generated in the [`TrainGCNV`](./gcnv) workflow. + +- `contig_ploidy_model_tar`: [`TrainGCNV.cohort_contig_ploidy_model_tar`](./gcnv#contig-ploidy-model-tarball) +- `gcnv_model_tars`: [`TrainGCNV.cohort_gcnv_model_tars`](./gcnv#model-tarballs) + + +The workflow also enables setting a few optional arguments of gCNV. +The arguments and their default values are provided +[here](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl) +as the following, and each argument is documented on +[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360037593411-PostprocessGermlineCNVCalls) +and +[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360047217671-GermlineCNVCaller). + + +#### Docker images + +The workflow needs the following Docker images, the latest versions of which are in +[this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers.json). + + - `cnmops_docker`; + - `condense_counts_docker`; + - `linux_docker`; + - `sv_base_docker`; + - `sv_base_mini_docker`; + - `sv_pipeline_docker`; + - `sv_pipeline_qc_docker`; + - `gcnv_gatk_docker`; + - `gatk_docker`. + +#### Static inputs + +You may refer to [this reference file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json) +for values of the following inputs. + + - `primary_contigs_fai`; + - `cytoband`; + - `ref_dict`; + - `mei_bed`; + - `genome_file`; + - `sd_locs_vcf`. + + +#### Optional Inputs +The following is the list of a few optional inputs of the +workflow, with an example of possible values. + +- `"allosomal_contigs": [["chrX", "chrY"]]` +- `"ploidy_sample_psi_scale": 0.001` + + + + + +## Outputs - Combined read count matrix, SR, PE, and BAF files - Standardized call VCFs diff --git a/website/docs/modules/gather_sample_evidence.md b/website/docs/modules/gather_sample_evidence.md index 918a1b27d..fcc951be3 100644 --- a/website/docs/modules/gather_sample_evidence.md +++ b/website/docs/modules/gather_sample_evidence.md @@ -6,20 +6,77 @@ slug: gse --- Runs raw evidence collection on each sample with the following SV callers: -Manta, Wham, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence, +Manta, Wham, Scramble, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence, refer to the Sample Exclusion section. -Note: a list of sample IDs must be provided. Refer to the sample ID -requirements for specifications of allowable sample IDs. +The following diagram illustrates the downstream workflows of the `GatherSampleEvidence` workflow +in the recommended invocation order. You may refer to +[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg) +for the overall recommended invocation order. + + +```mermaid + +stateDiagram + direction LR + + classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8 + classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white + classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d + + gse: GatherSampleEvidence + eqc: EvidenceQC + gse --> eqc + + class gse thisModule + class eqc outModules +``` + + +## Inputs + +#### `bam_or_cram_file` +A BAM or CRAM file aligned to hg38. Index file (.bai) must be provided if using BAM. + +#### `sample_id` +Refer to the [sample ID requirements](/docs/gs/inputs#sampleids) for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors. -### Inputs +#### `preprocessed_intervals` +Picard interval list. + +#### `sd_locs_vcf` +(`sd`: site depth) +A VCF file containing allele counts at common SNP loci of the genome, which is used for calculating BAF. +For human genome, you may use [`dbSNP`](https://www.ncbi.nlm.nih.gov/snp/) +that contains a complete list of common and clinical human single nucleotide variations, +microsatellites, and small-scale insertions and deletions. +You may find a link to the file in +[this reference](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json). -- Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs. -### Outputs +## Outputs -- Caller VCFs (Manta, MELT, and/or Wham) - Binned read counts file - Split reads (SR) file - Discordant read pairs (PE) file + +#### `manta_vcf` {#manta-vcf} +A VCF file containing variants called by Manta. + +#### `melt_vcf` {#melt-vcf} +A VCF file containing variants called by MELT. + +#### `scramble_vcf` {#scramble-vcf} +A VCF file containing variants called by Scramble. + +#### `wham_vcf` {#wham-vcf} +A VCF file containing variants called by Wham. + +#### `coverage_counts` {#coverage-counts} + +#### `pesr_disc` {#pesr-disc} + +#### `pesr_split` {#pesr-split} + +#### `pesr_sd` {#pesr-sd} \ No newline at end of file diff --git a/website/docs/modules/index.md b/website/docs/modules/index.md index 2ffea615f..a5f4ad7c5 100644 --- a/website/docs/modules/index.md +++ b/website/docs/modules/index.md @@ -36,3 +36,18 @@ consisting of multiple modules to be executed in the following order. - **Module 09 (in development)** Visualization, including scripts that generates IGV screenshots and rd plots. - Additional modules to be added: de novo and mosaic scripts + + +## Pipeline Parameters + +Several inputs are shared across different modules of the pipeline, which are explained in this section. + +#### `ped_file` + +A pedigree file describing the familial relationships between the samples in the cohort. +The file needs to be in the +[PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format). +Updated with [EvidenceQC](./eqc) sex assignments, including +`sex = 0` for sex aneuploidies; +genotypes on chrX and chrY for samples with `sex = 0` in the PED file will be set to +`./.` and these samples will be excluded from sex-specific training steps. diff --git a/website/docs/modules/train_gcnv.md b/website/docs/modules/train_gcnv.md index 7bbd8e934..45c7b3cac 100644 --- a/website/docs/modules/train_gcnv.md +++ b/website/docs/modules/train_gcnv.md @@ -5,37 +5,102 @@ sidebar_position: 3 slug: gcnv --- -Trains a gCNV model for use in GatherBatchEvidence. -The WDL can be found at /wdl/TrainGCNV.wdl. - -Both the cohort and single-sample modes use the -GATK gCNV depth calling pipeline, which requires a -trained model as input. The samples used for training -should be technically homogeneous and similar to the -samples to be processed (i.e. same sample type, -library prep protocol, sequencer, sequencing center, etc.). -The samples to be processed may comprise all or a subset -of the training set. For small, relatively homogenous cohorts, -a single gCNV model is usually sufficient. If a cohort -contains multiple data sources, we recommend training a separate -model for each batch or group of batches with similar dosage -score (WGD). The model may be trained on all or a subset of -the samples to which it will be applied; a reasonable default -is 100 randomly-selected samples from the batch (the random -selection can be done as part of the workflow by specifying -a number of samples to the n_samples_subsample input -parameter in /wdl/TrainGCNV.wdl). - -### Prerequisites - -- GatherSampleEvidence -- (Recommended) EvidenceQC - -### Inputs - -- Read count files (GatherSampleEvidence) - -### Outputs - -- Contig ploidy model tarball -- gCNV model tarballs \ No newline at end of file +import { Highlight, HighlightOptionalArg } from "../../src/components/highlight.js" + +[GATK-gCNV](https://www.nature.com/articles/s41588-023-01449-0) +is a method for detecting rare germline copy number variants (CNVs) +from short-read sequencing read-depth information. +The [TrainGCNV](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/TrainGCNV.wdl) +module trains a gCNV model for use in the [GatherBatchEvidence](./gbe) workflow. +The upstream and downstream dependencies of the TrainGCNV module are illustrated in the following diagram. + + +The samples used for training should be homogeneous (concerning sequencing platform, +coverage, library preparation, etc.) and similar +to the samples on which the model will be applied in terms of sample type, +library preparation protocol, sequencer, sequencing center, and etc. + + +For small, relatively homogeneous cohorts, a single gCNV model is usually sufficient. +However, for larger cohorts, especially those with multiple data sources, +we recommend training a separate model for each batch or group of batches (see +[batching section](/docs/run/joint#batching) for details). +The model can be trained on all or a subset of the samples to which it will be applied. +A subset of 100 randomly selected samples from the batch is a reasonable +input size for training the model; when the `n_samples_subsample` input is provided, +the `TrainGCNV` workflow can automatically perform this random selection. + +The following diagram illustrates the upstream and downstream workflows of the `TrainGCNV` workflow +in the recommended invocation order. You may refer to +[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg) +for the overall recommended invocation order. + +```mermaid + +stateDiagram + direction LR + + classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8 + classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white + classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d + + batching: Batching, sample QC, and sex assignment + t: TrainGCNV + gbe: GatherBatchEvidence + + batching --> t + t --> gbe + + class t thisModule + class batching inModules + class gbe outModules +``` + +## Inputs + +This section provides a brief description on the _required_ inputs of the TrainGCNV workflow. +For a description on the _optional_ inputs and their default values, you may refer to the +[source code](https://github.com/broadinstitute/gatk-sv/blob/main/wdl/TrainGCNV.wdl) of the TrainGCNV workflow. +Additionally, the majority of the optional inputs of the workflow map to the optional arguments of the +tool the workflow uses, `GATK GermlineCNVCaller`; hence, you may refer to the +[documentation](https://gatk.broadinstitute.org/hc/en-us/articles/360040097712-GermlineCNVCaller) +of the tool for a description on these optional inputs. + +#### `samples` +A list of sample IDs. +The order of IDs in this list should match the order of files in `count_files`. + +#### `count_files` +A list of per-sample coverage counts generated in the [GatherSampleEvidence](./gse#outputs) workflow. + +#### `contig_ploidy_priors` +A tabular file with ploidy prior probability per contig. +You may find the link to this input from +[this reference](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json) +and a description to the file format +[here](https://gatk.broadinstitute.org/hc/en-us/articles/360037224772-DetermineGermlineContigPloidy). + + +#### `reference_fasta` +`reference_fasta`, `reference_index`, `reference_dict` are respectively the +reference genome sequence in the FASTA format, its index file, and a corresponding +[dictionary file](https://gatk.broadinstitute.org/hc/en-us/articles/360035531652-FASTA-Reference-genome-format). +You may find links to these files from +[this reference](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json). + + +## Outputs + +#### Optional `annotated_intervals` {#annotated-intervals} + +The count files from [GatherSampleEvidence](./gse) with adjacent intervals combined into +locus-sorted `DepthEvidence` files using `GATK CondenseDepthEvidence` tool, which are +annotated with GC content, mappability, and segmental-duplication content using +[`GATK AnnotateIntervals`](https://gatk.broadinstitute.org/hc/en-us/articles/360041416652-AnnotateIntervals) +tool. This output is generated if the optional input `do_explicit_gc_correction` is set to `True`. + +#### Optional `filtered_intervals_cnv` {#filtered-intervals-cnv} + +#### Optional `cohort_contig_ploidy_model_tar` {#contig-ploidy-model-tarball} + +#### Optional `cohort_gcnv_model_tars` {#model-tarballs} diff --git a/website/src/components/highlight.js b/website/src/components/highlight.js new file mode 100644 index 000000000..8f41722bc --- /dev/null +++ b/website/src/components/highlight.js @@ -0,0 +1,25 @@ +const Highlight = ({children, color}) => ( + + {children} + +); + +const HighlightOptionalArg = ({children}) => ( + + {children} + +); + +export { Highlight, HighlightOptionalArg }; diff --git a/website/src/css/custom.css b/website/src/css/custom.css index 2bc6a4cfd..e34580355 100644 --- a/website/src/css/custom.css +++ b/website/src/css/custom.css @@ -15,6 +15,11 @@ --ifm-color-primary-lightest: #3cad6e; --ifm-code-font-size: 95%; --docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.1); + + --highlight-text-color: black; + --highlight-background-color: #7091e6; + --highlight-optional-arg-text-color: black; + --highlight-optional-arg-background-color: #7091e6; } /* For readability concerns, you should choose a lighter palette in dark mode. */ @@ -27,4 +32,9 @@ --ifm-color-primary-lighter: #32d8b4; --ifm-color-primary-lightest: #4fddbf; --docusaurus-highlighted-code-line-bg: rgba(0, 0, 0, 0.3); + + --highlight-text-color: black; + --highlight-background-color: #7091e6; + --highlight-optional-arg-text-color: black; + --highlight-optional-arg-background-color: #7091e6; }