Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GatherSampleEvidence & TrainGCNV docs #681

Merged
merged 26 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
8b6931c
Extend docs.
VJalili May 13, 2024
cdbe237
Merge remote-tracking branch 'upstream/main' into docs_gather_sample_…
VJalili May 13, 2024
1fcfe12
Add Scramble to GSE.
VJalili May 29, 2024
19a824a
Update TrainGCNV docs.
VJalili May 29, 2024
d5a3c98
Merge remote-tracking branch 'upstream/main' into docs_gather_sample_…
VJalili May 29, 2024
71de496
Document the annotated_intervals output.
VJalili Jun 6, 2024
6032dcf
Add an option to highlight text.
VJalili Jun 12, 2024
a0f0ced
Extend docs on inputs and outputs of workflows.
VJalili Aug 2, 2024
a06c18b
Fix typo & add diagram for gather sample evidence.
VJalili Aug 2, 2024
00076ec
Update header level to match inputs section.
VJalili Aug 2, 2024
dfb0a72
Update website/docs/modules/gather_sample_evidence.md
VJalili Sep 13, 2024
c7f9795
Update website/docs/modules/gather_sample_evidence.md
VJalili Sep 13, 2024
e1862b3
Update website/docs/modules/train_gcnv.md
VJalili Sep 13, 2024
fd129b3
Replace direct link with a reference to the resources file.
VJalili Sep 13, 2024
2926a89
Update website/docs/modules/train_gcnv.md
VJalili Sep 13, 2024
7d4b503
Replace direct links with references to the resources file.
VJalili Sep 13, 2024
2ab676e
Separate gatk-sv input, & add additional external docs link.
VJalili Sep 13, 2024
fdb33ab
Remove links.
VJalili Sep 13, 2024
b9dfb39
Add a single-line descript to avoid empty section. Needs to be extend…
VJalili Sep 13, 2024
490c739
update diagrams to display recommended invocation order.
VJalili Sep 18, 2024
53df818
add a common inputs section & remove some values.
VJalili Sep 18, 2024
fb0bf8f
make plural
VJalili Sep 18, 2024
d291c6c
update.
VJalili Sep 19, 2024
56dfaa2
update link
VJalili Sep 19, 2024
2e2a4dc
clarify homogeneous
VJalili Sep 19, 2024
b782169
Fix a broken link.
VJalili Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 31 additions & 2 deletions website/docs/modules/evidence_qc.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ sidebar_position: 2
slug: eqc
---

import { Highlight, HighlightOptionalArg } from "../../src/components/highlight.js"

Runs ploidy estimation, dosage scoring, and optionally VCF QC.
The results from this module can be used for QC and batching.

Expand All @@ -17,9 +19,36 @@ for further guidance on creating batches.
We also recommend using sex assignments generated from the ploidy
estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies.

### Prerequisites
The following diagram illustrates the upstream and downstream workflows of the `EvidenceQC` workflow
in the recommended invocation order. You may refer to
[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg)
for the overall recommended invocation order.

<br/>

```mermaid

stateDiagram
direction LR

classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d

gse: GatherSampleEvidence
eqc: EvidenceQC
batching: Batching, sample QC, and sex assignment

gse --> eqc
eqc --> batching

class eqc thisModule
class gse inModules
class batching outModules
```

<br/>

- [Gather Sample Evidence](./gse)

### Inputs

Expand Down
179 changes: 164 additions & 15 deletions website/docs/modules/gather_batch_evidence.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,174 @@ sidebar_position: 4
slug: gbe
---

Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample
raw evidence into a batch. See above for more information on batching.
Runs CNV callers ([cn.MOPS](https://academic.oup.com/nar/article/40/9/e69/1136601), GATK-gCNV)
and combines single-sample raw evidence into a batch.

### Prerequisites
The following diagram illustrates the downstream workflows of the `GatherBatchEvidence` workflow
in the recommended invocation order. You may refer to
[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg)
for the overall recommended invocation order.

- GatherSampleEvidence
- (Recommended) EvidenceQC
- gCNV training.
```mermaid

### Inputs
- PED file (updated with EvidenceQC sex assignments, including sex = 0
for sex aneuploidies. Calls will not be made on sex chromosomes
when sex = 0 in order to avoid generating many confusing calls
or upsetting normalized copy numbers for the batch.)
- Read count, BAF, PE, SD, and SR files (GatherSampleEvidence)
- Caller VCFs (GatherSampleEvidence)
- Contig ploidy model and gCNV model files (gCNV training)
stateDiagram
direction LR

classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d

### Outputs
gbe: GatherBatchEvidence
t: TrainGCNV
cb: ClusterBatch
t --> gbe
gbe --> cb

class gbe thisModule
class t inModules
class cb outModules
```

## Inputs
This workflow takes as input the read counts, BAF, PE, SD, SR, and per-caller VCF files
produced in the GatherSampleEvidence workflow, and contig ploidy and gCNV models from
the TrainGCNV workflow.
The following is the list of the inputs the GatherBatchEvidence workflow takes.


#### `batch`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to have detailed documentation for every input like this? Is that necessary?

Maybe it could be collapsible so it's more approachable for users who do not need that level of detail? Most users will just use the pre-configured default inputs and will only need detailed documentation on the pipeline-level inputs and outputs, and I wouldn't want to make it more difficult for them to navigate the documentation.

One other thing to consider is there are places where we do want users to be able to edit inputs as necessary, and I wouldn't want those inputs to get lost among the others - a separate category that does not collapse maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to document every required input of these modules.

We have discussed a few options for those required inputs that do not have values set on Terra, or set values need to be adjusted, or set values need tweaking for cohort-to-cohort, etc. One of the options is tagging/labeling such inputs (similar to labeling optional/conditional outputs) and we can think of other alternatives. However, that is beyond the scope of this PR as here we are just documenting all the required (at least leaving a placeholder for them), and we will revisit their spotlighting later.

An identifier for the batch.


#### `samples`
Sets the list of sample IDs.


#### `counts`
Set to the [`GatherSampleEvidence.coverage_counts`](./gse#coverage-counts) output.


#### Raw calls

The following inputs set the per-caller raw SV calls, and should be set
if the caller was run in the [`GatherSampleEvidence`](./gse) workflow.
You may set each of the following inputs to the linked output from
the GatherSampleEvidence workflow.


- `manta_vcfs`: [`GatherSampleEvidence.manta_vcf`](./gse#manta-vcf);
- `melt_vcfs`: [`GatherSampleEvidence.melt_vcf`](./gse#melt-vcf);
- `scramble_vcfs`: [`GatherSampleEvidence.scramble_vcf`](./gse#scramble-vcf);
- `wham_vcfs`: [`GatherSampleEvidence.wham_vcf`](./gse#wham-vcf).

#### `PE_files`
Set to the [`GatherSampleEvidence.pesr_disc`](./gse#pesr-disc) output.

#### `SR_files`
Set to the [`GatherSampleEvidence.pesr_split`](./gse#pesr-split)


#### `SD_files`
Set to the [`GatherSampleEvidence.pesr_sd`](./gse#pesr-sd)


#### `matrix_qc_distance`
You may refer to [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl)
for an example value.


#### `min_svsize`
Sets the minimum size of SVs to include.


#### `ped_file`
VJalili marked this conversation as resolved.
Show resolved Hide resolved
A pedigree file describing the familial relationshipts between the samples in the cohort.
Please refer to [this section](./#ped_file) for details.


#### `run_matrix_qc`
Enables or disables running optional QC tasks.


#### `gcnv_qs_cutoff`
You may refer to [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl)
for an example value.

#### cn.MOPS files
The workflow needs the following cn.MOPS files.

- `cnmops_chrom_file` and `cnmops_allo_file`: FASTA index files (`.fai`) for respectively
non-sex chromosomes (autosomes) and chromosomes X and Y (allosomes).
The file format is explained [on this page](https://www.htslib.org/doc/faidx.html).

You may use the following files for these fields:

```json
"cnmops_chrom_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/autosome.fai"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing Mark's comments about not giving specific file paths in this documentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a good reference for these?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes most sense to direct users to the JSONs in inputs/ in general rather than linking to a specific JSON for each input (cluttered, requires more maintenance)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to link to a specific file in the resources JSON; that is not much better than this, but we have ongoing internal discussions on how best to address such inputs. We should not point to the inputs/ directory without direct references as it will be a needle in a haystack given all the resources and a lot of variable name mismatches.

"cnmops_allo_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/allosome.fai"
```

- `cnmops_exclude_list`:
You may use [this file](https://github.com/broadinstitute/gatk-sv/blob/d66f760865a89f30dbce456a3f720dec8b70705c/inputs/values/resources_hg38.json#L10)
for this field.

#### GATK-gCNV inputs

The following inputs are configured based on the outputs generated in the [`TrainGCNV`](./gcnv) workflow.

- `contig_ploidy_model_tar`: [`TrainGCNV.cohort_contig_ploidy_model_tar`](./gcnv#contig-ploidy-model-tarball)
- `gcnv_model_tars`: [`TrainGCNV.cohort_gcnv_model_tars`](./gcnv#model-tarballs)


The workflow also enables setting a few optional arguments of gCNV.
The arguments and their default values are provided
[here](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/GatherBatchEvidence.json.tmpl)
as the following, and each argument is documented on
[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360037593411-PostprocessGermlineCNVCalls)
and
[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360047217671-GermlineCNVCaller).


#### Docker images

The workflow needs the following Docker images, the latest versions of which are in
[this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers.json).

- `cnmops_docker`;
- `condense_counts_docker`;
- `linux_docker`;
- `sv_base_docker`;
- `sv_base_mini_docker`;
- `sv_pipeline_docker`;
- `sv_pipeline_qc_docker`;
- `gcnv_gatk_docker`;
- `gatk_docker`.

#### Static inputs

You may refer to [this reference file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json)
VJalili marked this conversation as resolved.
Show resolved Hide resolved
for values of the following inputs.

- `primary_contigs_fai`;
- `cytoband`;
- `ref_dict`;
- `mei_bed`;
- `genome_file`;
- `sd_locs_vcf`.


#### Optional Inputs
The following is the list of a few optional inputs of the
workflow, with an example of possible values.

- `"allosomal_contigs": [["chrX", "chrY"]]`
- `"ploidy_sample_psi_scale": 0.001`





## Outputs

- Combined read count matrix, SR, PE, and BAF files
VJalili marked this conversation as resolved.
Show resolved Hide resolved
- Standardized call VCFs
Expand Down
71 changes: 64 additions & 7 deletions website/docs/modules/gather_sample_evidence.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,77 @@ slug: gse
---

Runs raw evidence collection on each sample with the following SV callers:
Manta, Wham, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence,
Manta, Wham, Scramble, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence,
refer to the Sample Exclusion section.

Note: a list of sample IDs must be provided. Refer to the sample ID
requirements for specifications of allowable sample IDs.
The following diagram illustrates the downstream workflows of the `GatherSampleEvidence` workflow
in the recommended invocation order. You may refer to
[this diagram](https://github.com/broadinstitute/gatk-sv/blob/main/terra_pipeline_diagram.jpg)
for the overall recommended invocation order.


```mermaid

stateDiagram
direction LR

classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d

gse: GatherSampleEvidence
eqc: EvidenceQC
gse --> eqc

class gse thisModule
class eqc outModules
```


## Inputs
VJalili marked this conversation as resolved.
Show resolved Hide resolved

#### `bam_or_cram_file`
A BAM or CRAM file aligned to hg38. Index file (.bai) must be provided if using BAM.

#### `sample_id`
Refer to the [sample ID requirements](/docs/gs/inputs#sampleids) for specifications of allowable sample IDs.
IDs that do not meet these requirements may cause errors.

### Inputs
#### `preprocessed_intervals`
Picard interval list.

#### `sd_locs_vcf`
(`sd`: site depth)
A VCF file containing allele counts at common SNP loci of the genome, which is used for calculating BAF.
For human genome, you may use [`dbSNP`](https://www.ncbi.nlm.nih.gov/snp/)
that contains a complete list of common and clinical human single nucleotide variations,
microsatellites, and small-scale insertions and deletions.
You may find a link to the file in
[this reference](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/resources_hg38.json).

- Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs.

### Outputs
## Outputs

- Caller VCFs (Manta, MELT, and/or Wham)
- Binned read counts file
- Split reads (SR) file
- Discordant read pairs (PE) file

#### `manta_vcf` {#manta-vcf}
A VCF file containing variants called by Manta.

#### `melt_vcf` {#melt-vcf}
A VCF file containing variants called by MELT.

#### `scramble_vcf` {#scramble-vcf}
A VCF file containing variants called by Scramble.

#### `wham_vcf` {#wham-vcf}
A VCF file containing variants called by Wham.

#### `coverage_counts` {#coverage-counts}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there supposed to be descriptions here? Feels inconsistent with the other sections

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a description of these. We discussed leaving them as placeholders to make sure we will populate them. If you have a description, feel free to suggest one.


#### `pesr_disc` {#pesr-disc}

#### `pesr_split` {#pesr-split}

#### `pesr_sd` {#pesr-sd}
15 changes: 15 additions & 0 deletions website/docs/modules/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,18 @@ consisting of multiple modules to be executed in the following order.
- **Module 09 (in development)** Visualization, including scripts that generates IGV screenshots and rd plots.

- Additional modules to be added: de novo and mosaic scripts


## Pipeline Parameters

Several inputs are shared across different modules of the pipeline, which are explained in this section.

#### `ped_file`

A pedigree file describing the familial relationships between the samples in the cohort.
The file needs to be in the
[PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format).
Updated with [EvidenceQC](./eqc) sex assignments, including
`sex = 0` for sex aneuploidies;
genotypes on chrX and chrY for samples with `sex = 0` in the PED file will be set to
`./.` and these samples will be excluded from sex-specific training steps.
Loading
Loading