Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GatherSampleEvidence & TrainGCNV docs #681

Merged
merged 26 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
8b6931c
Extend docs.
VJalili May 13, 2024
cdbe237
Merge remote-tracking branch 'upstream/main' into docs_gather_sample_…
VJalili May 13, 2024
1fcfe12
Add Scramble to GSE.
VJalili May 29, 2024
19a824a
Update TrainGCNV docs.
VJalili May 29, 2024
d5a3c98
Merge remote-tracking branch 'upstream/main' into docs_gather_sample_…
VJalili May 29, 2024
71de496
Document the annotated_intervals output.
VJalili Jun 6, 2024
6032dcf
Add an option to highlight text.
VJalili Jun 12, 2024
a0f0ced
Extend docs on inputs and outputs of workflows.
VJalili Aug 2, 2024
a06c18b
Fix typo & add diagram for gather sample evidence.
VJalili Aug 2, 2024
00076ec
Update header level to match inputs section.
VJalili Aug 2, 2024
dfb0a72
Update website/docs/modules/gather_sample_evidence.md
VJalili Sep 13, 2024
c7f9795
Update website/docs/modules/gather_sample_evidence.md
VJalili Sep 13, 2024
e1862b3
Update website/docs/modules/train_gcnv.md
VJalili Sep 13, 2024
fd129b3
Replace direct link with a reference to the resources file.
VJalili Sep 13, 2024
2926a89
Update website/docs/modules/train_gcnv.md
VJalili Sep 13, 2024
7d4b503
Replace direct links with references to the resources file.
VJalili Sep 13, 2024
2ab676e
Separate gatk-sv input, & add additional external docs link.
VJalili Sep 13, 2024
fdb33ab
Remove links.
VJalili Sep 13, 2024
b9dfb39
Add a single-line descript to avoid empty section. Needs to be extend…
VJalili Sep 13, 2024
490c739
update diagrams to display recommended invocation order.
VJalili Sep 18, 2024
53df818
add a common inputs section & remove some values.
VJalili Sep 18, 2024
fb0bf8f
make plural
VJalili Sep 18, 2024
d291c6c
update.
VJalili Sep 19, 2024
56dfaa2
update link
VJalili Sep 19, 2024
2e2a4dc
clarify homogeneous
VJalili Sep 19, 2024
b782169
Fix a broken link.
VJalili Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 30 additions & 2 deletions website/docs/modules/evidence_qc.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ sidebar_position: 2
slug: eqc
---

import { Highlight, HighlightOptionalArg } from "../../src/components/highlight.js"

Runs ploidy estimation, dosage scoring, and optionally VCF QC.
The results from this module can be used for QC and batching.

Expand All @@ -17,9 +19,35 @@ for further guidance on creating batches.
We also recommend using sex assignments generated from the ploidy
estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies.

### Prerequisites
The upstream and downstream dependencies of the EvidenceQC workflow
are illustrated in the following diagram.

<br/>

```mermaid

stateDiagram
direction LR

classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d

gse: GatherSampleEvidence
gbe: GatherBatchEvidence
eqc: EvidenceQC
t: TrainGCNV
gse --> eqc
eqc --> t
eqc --> gbe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eqc --> gbe
t --> gbe

This is related to the existing thread on displaying dependencies. None of the outputs of EvidenceQC are used in GatherBatchEvidence (or TrainGCNV technically). EvidenceQC is recommended to use to create batches for TrainGCNV (and the following steps) but if that's what you wanted to represent I would just exclude GatherBatchEvidence from this diagram since it follows TrainGCNV

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is resolved in the updated diagrams; please recheck.


class eqc thisModule
class gse inModules
class t, gbe outModules
```

<br/>

- [Gather Sample Evidence](./gse)

### Inputs

Expand Down
204 changes: 189 additions & 15 deletions website/docs/modules/gather_batch_evidence.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,199 @@ sidebar_position: 4
slug: gbe
---

Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample
raw evidence into a batch. See above for more information on batching.
Runs CNV callers ([cn.MOPS](https://academic.oup.com/nar/article/40/9/e69/1136601), GATK gCNV)
VJalili marked this conversation as resolved.
Show resolved Hide resolved
and combines single-sample raw evidence into a batch.

### Prerequisites

- GatherSampleEvidence
- (Recommended) EvidenceQC
- gCNV training.
```mermaid

### Inputs
- PED file (updated with EvidenceQC sex assignments, including sex = 0
for sex aneuploidies. Calls will not be made on sex chromosomes
when sex = 0 in order to avoid generating many confusing calls
or upsetting normalized copy numbers for the batch.)
- Read count, BAF, PE, SD, and SR files (GatherSampleEvidence)
- Caller VCFs (GatherSampleEvidence)
- Contig ploidy model and gCNV model files (gCNV training)
stateDiagram
direction LR

classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d

### Outputs
gse: GatherSampleEvidence
eqc: EvidenceQC
gcnv: TrainGCNV
gbe: GatherBatchEvidence
cbe: ClusterBatch
gse --> gbe
eqc --> gbe
gcnv --> gbe
gbe --> cbe

class gbe thisModule
class gse, eqc, gcnv inModules
class cbe outModules
```

## Inputs
This workflow takes as input the read counts, BAF, PE, SD, SR, and per-caller VCF files
produced in the GatherSampleEvidence workflow, and contig ploidy and gCNV models from
the TrainGCNV workflow.
The following is the list of the inputs the GatherBatchEvidence workflow takes.


#### `batch`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to have detailed documentation for every input like this? Is that necessary?

Maybe it could be collapsible so it's more approachable for users who do not need that level of detail? Most users will just use the pre-configured default inputs and will only need detailed documentation on the pipeline-level inputs and outputs, and I wouldn't want to make it more difficult for them to navigate the documentation.

One other thing to consider is there are places where we do want users to be able to edit inputs as necessary, and I wouldn't want those inputs to get lost among the others - a separate category that does not collapse maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to document every required input of these modules.

We have discussed a few options for those required inputs that do not have values set on Terra, or set values need to be adjusted, or set values need tweaking for cohort-to-cohort, etc. One of the options is tagging/labeling such inputs (similar to labeling optional/conditional outputs) and we can think of other alternatives. However, that is beyond the scope of this PR as here we are just documenting all the required (at least leaving a placeholder for them), and we will revisit their spotlighting later.

An identifier for the batch.


#### `samples`
Sets the list of sample IDs.


#### `counts`
Set to the [`GatherSampleEvidence.coverage_counts`](./gse#coverage-counts) output.


#### Raw calls

The following inputs set the per-caller raw SV calls, and should be set
if the caller was run in the [`GatherSampleEvidence`](./gse) workflow.
You may set each of the following inputs to the linked output from
the GatherSampleEvidence workflow.


- `manta_vcfs`: [`GatherSampleEvidence.manta_vcf`](./gse#manta-vcf);
- `melt_vcfs`: [`GatherSampleEvidence.melt_vcf`](./gse#melt-vcf);
- `scramble_vcfs`: [`GatherSampleEvidence.scramble_vcf`](./gse#scramble-vcf);
- `wham_vcfs`: [`GatherSampleEvidence.wham_vcf`](./gse#wham-vcf).

#### `PE_files`
Set to the [`GatherSampleEvidence.pesr_disc`](./gse#pesr-disc) output.

#### `SR_files`
Set to the [`GatherSampleEvidence.pesr_split`](./gse#pesr-split)


#### `SD_files`
Set to the [`GatherSampleEvidence.pesr_sd`](./gse#pesr-sd)


#### `matrix_qc_distance`
You may set it to `1000000`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think providing specific input values on the website is not what we want to do, as we don't want to have to update the website every time we update the inputs. I think we should refer users to the JSON templates. I also think "You may set it to" is pretty ambiguous - probably we want to refer to these as "recommended input values" or "default settings" or similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to reference the external files we have. The text is a placeholder, and we should document what it does and what its impact is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I prefer to finalize the text while the PR is open. The edits in the PR allow for convenient discussion of these particular lines. And in my experience TODOs often get lost once they're merged!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The external reference I mentioned above is my best resource for this; if you have other information on it, happy to extend it.



#### `min_svsize`
Sets the minimum size of SVs to include.
You may set it to `50`.
VJalili marked this conversation as resolved.
Show resolved Hide resolved


#### `ped_file`
VJalili marked this conversation as resolved.
Show resolved Hide resolved
A pedigree file describing the familial relationshipts between the samples in the cohort.
The file needs to be in the
[PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format).
Updated with [EvidenceQC](./eqc) sex assignments, including
`sex = 0` for sex aneuploidies. Calls will not be made on sex chromosomes
when `sex = 0` in order to avoid generating many confusing calls
or upsetting normalized copy numbers for the batch.
VJalili marked this conversation as resolved.
Show resolved Hide resolved


#### `run_matrix_qc`
Enables or disables running optional QC tasks.


#### cn.MOPS files
The workflow needs the following cn.MOPS files.

- `cnmops_chrom_file` and `cnmops_allo_file`: FASTA index files (`.fai`) for respectively non-sex chromosome (autosome) and chromosomes X and Y (allosomes).
VJalili marked this conversation as resolved.
Show resolved Hide resolved
The content of the files may read as the following,
and the format is explained [on this page](https://www.htslib.org/doc/faidx.html).

```bash
VJalili marked this conversation as resolved.
Show resolved Hide resolved
chrX 156040895 2903754205 100 101
chrY 57227415 3061355656 100 101
```

You may use the following files for these fields:

```json
"cnmops_chrom_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/autosome.fai"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing Mark's comments about not giving specific file paths in this documentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a good reference for these?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes most sense to direct users to the JSONs in inputs/ in general rather than linking to a specific JSON for each input (cluttered, requires more maintenance)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to link to a specific file in the resources JSON; that is not much better than this, but we have ongoing internal discussions on how best to address such inputs. We should not point to the inputs/ directory without direct references as it will be a needle in a haystack given all the resources and a lot of variable name mismatches.

"cnmops_allo_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/allosome.fai"
```

- `cnmops_exclude_list`: You may use the following file for this field.
```
gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/GRCh38_Nmask.bed
```

#### GATK-gCNV inputs

The following inputs are configured based on the outputs generated in the [`TrainGCNV`](./gcnv) workflow.

- `contig_ploidy_model_tar`: [`TrainGCNV.cohort_contig_ploidy_model_tar`](./gcnv#contig-ploidy-model-tarball)
- `gcnv_model_tars`: [`TrainGCNV.cohort_gcnv_model_tars`](./gcnv#model-tarballs)


The workflow also enables setting a few optional arguments of gCNV.
The arguments and their default values are as the following,
and each argument is documented on
[this page](https://gatk.broadinstitute.org/hc/en-us/articles/360037593411-PostprocessGermlineCNVCalls).
VJalili marked this conversation as resolved.
Show resolved Hide resolved

```json
"gcnv_qs_cutoff": 30,
VJalili marked this conversation as resolved.
Show resolved Hide resolved
"gcnv_caller_internal_admixing_rate": 0.5,
"gcnv_caller_update_convergence_threshold": 0.000001,
"gcnv_cnv_coherence_length": 1000,
"gcnv_convergence_snr_averaging_window": 100,
"gcnv_convergence_snr_countdown_window": 10,
"gcnv_convergence_snr_trigger_threshold": 0.2,
"gcnv_copy_number_posterior_expectation_mode": "EXACT",
"gcnv_depth_correction_tau": 10000,
"gcnv_learning_rate": 0.03,
"gcnv_log_emission_sampling_median_rel_error": 0.001,
"gcnv_log_emission_sampling_rounds": 20,
"gcnv_max_advi_iter_first_epoch": 1000,
"gcnv_max_advi_iter_subsequent_epochs": 200,
"gcnv_max_training_epochs": 5,
"gcnv_min_training_epochs": 1,
"gcnv_num_thermal_advi_iters": 250,
"gcnv_p_alt": 0.000001,
"gcnv_sample_psi_scale": 0.000001,
"ref_copy_number_autosomal_contigs": 2
```


#### Docker images

The workflow needs the following Docker images, which you may find a link to their
latest images from [this file](https://github.com/broadinstitute/gatk-sv/blob/main/inputs/values/dockers.json).
VJalili marked this conversation as resolved.
Show resolved Hide resolved

- `cnmops_docker`;
- `condense_counts_docker`;
- `linux_docker`;
- `sv_base_docker`;
- `sv_base_mini_docker`;
- `sv_pipeline_docker`;
- `sv_pipeline_qc_docker`;
- `gcnv_gatk_docker`;
- `gatk_docker`.

#### Static inputs

```json
"primary_contigs_fai": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/contig.fai",
"cytoband": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/cytobands_hg38.bed.gz",
"ref_dict": "gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict",
"mei_bed": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/mei_hg38.bed.gz",
"genome_file": "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/hg38.genome",
"sd_locs_vcf": "gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf"
```

#### Optional Inputs
The following is the list of a few optional inputs of the
workflow, with an example of possible values.

- `"allosomal_contigs": [["chrX", "chrY"]]`
- `"ploidy_sample_psi_scale": 0.001`





## Outputs

- Combined read count matrix, SR, PE, and BAF files
VJalili marked this conversation as resolved.
Show resolved Hide resolved
- Standardized call VCFs
Expand Down
69 changes: 63 additions & 6 deletions website/docs/modules/gather_sample_evidence.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,77 @@ slug: gse
---

Runs raw evidence collection on each sample with the following SV callers:
Manta, Wham, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence,
Manta, Wham, Scramble, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence,
refer to the Sample Exclusion section.

Note: a list of sample IDs must be provided. Refer to the sample ID
requirements for specifications of allowable sample IDs.
The downstream dependencies of the GatherSampleEvidence workflow
are illustrated in the following diagram.

```mermaid

stateDiagram
direction LR

classDef inModules stroke-width:0px,fill:#00509d,color:#caf0f8
classDef thisModule font-weight:bold,stroke-width:0px,fill:#ff9900,color:white
classDef outModules stroke-width:0px,fill:#caf0f8,color:#00509d

gse: GatherSampleEvidence
eqc: EvidenceQC
gcnv: TrainGCNV
gbe: GatherBatchEvidence
gse --> eqc
gse --> gcnv
gse --> gbe
VJalili marked this conversation as resolved.
Show resolved Hide resolved

class gse thisModule
class eqc, gcnv, gbe outModules
```


## Inputs
VJalili marked this conversation as resolved.
Show resolved Hide resolved

#### `bam_or_cram_file`
A BAM or CRAM file aligned to hg38. Index file (.bai) must be provided if using BAM.

#### `sample_id`
Refer to the [sample ID requirements](/docs/gs/inputs#sampleids) for specifications of allowable sample IDs.
IDs that do not meet these requirements may cause errors.

### Inputs
#### `preprocessed_intervals`
Piccard interval list.
VJalili marked this conversation as resolved.
Show resolved Hide resolved

#### `sd_locs_vcf`
(`sd`: site depth)
A VCF file containing allele counts at common SNP loci of the genome, which is used for calculating BAF.
For human genome, you may use [`dbSNP`](https://www.ncbi.nlm.nih.gov/snp/)
that contains a complete list of common and clinical human single nucleotide variations,
microsatellites, and small-scale insertions and deletions.
You may download the file from the following link.

- Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs.
```shell
gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant since all resource files are available in /inputs/values and also feels inconsistent to list it for just this input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; this would also lead to having to keep updated references in various places. I changed the link to refer to the reference file, would this be better?


### Outputs
## Outputs

- Caller VCFs (Manta, MELT, and/or Wham)
VJalili marked this conversation as resolved.
Show resolved Hide resolved
- Binned read counts file
- Split reads (SR) file
- Discordant read pairs (PE) file

#### `manta_vcf` {#manta-vcf}

#### `melt_vcf` {#melt-vcf}

#### `scramble_vcf` {#scramble-vcf}

#### `wham_vcf` {#wham-vcf}

#### `coverage_counts` {#coverage-counts}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there supposed to be descriptions here? Feels inconsistent with the other sections

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a description of these. We discussed leaving them as placeholders to make sure we will populate them. If you have a description, feel free to suggest one.


#### `pesr_disc` {#pesr-disc}

#### `pesr_split` {#pesr-split}

#### `pesr_sd` {#pesr-sd}
Loading
Loading