Skip to content

Commit

Permalink
Merge branch 'dev' into 124-fix-oom-for-long-contigs
Browse files Browse the repository at this point in the history
Signed-off-by: Joon Klaps <[email protected]>
  • Loading branch information
Joon-Klaps authored Jun 5, 2024
2 parents d07ff47 + 84e37f2 commit 32e0892
Show file tree
Hide file tree
Showing 22 changed files with 207 additions and 115 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Initial release of Joon-Klaps/viralgenie, created with the [nf-core](https://nf-

### `Enhancement`

- 119 include sspace cobra for contig extension ([#123](https://github.com/Joon-Klaps/viralgenie/pull/123))

### `Fixed`

- OOM with longer contigs for lowcov_to_reference, uses more RAM now ([#125](https://github.com/Joon-Klaps/viralgenie/pull/125))
Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,10 @@

> Bankevich, Anton et al. “SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.” Journal of computational biology : a journal of computational molecular cell biology vol. 19,5 (2012): 455-77. doi:10.1089/cmb.2012.0021
- [SSPACE Basic](https://pubmed.ncbi.nlm.nih.gov/21149342/)

> Boetzer, Marten et al. “Scaffolding pre-assembled contigs using SSPACE.” Bioinformatics (Oxford, England) vol. 27,4 (2011): 578-9. doi:10.1093/bioinformatics/btq683
- [Trimmomatic](https://pubmed.ncbi.nlm.nih.gov/24695404/)

> Bolger, Anthony M et al. “Trimmomatic: a flexible trimmer for Illumina sequence data.” Bioinformatics (Oxford, England) vol. 30,15 (2014): 2114-20. doi:10.1093/bioinformatics/btu170
Expand Down
35 changes: 18 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
<!-- [![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23viralgenie-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/viralgenie)-->

> [!TIP]
> Make sure to checkout the [viralgenie website](https://joon-klaps.github.io/viralgenie/) for more elaborate documentation!
> Make sure to checkout the [viralgenie website](https://joon-klaps.github.io/viralgenie/latest/) for more elaborate documentation!
## Introduction

Expand All @@ -41,31 +41,32 @@
- [`Kaiju`](https://kaiju.binf.ku.dk/)
- Plotting Kraken2 and Kaiju ([`Krona`](https://hpc.nih.gov/apps/kronatools.html))
4. Denovo assembly ([`SPAdes`](http://cab.spbu.ru/software/spades/), [`TRINITY`](https://github.com/trinityrnaseq/trinityrnaseq), [`megahit`](https://github.com/voutcn/megahit)), combine contigs.
5. Contig reference idententification ([`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch))
5. [Optional] extend the contigs with [sspace_basic](https://github.com/nsoranzo/sspace_basic)
6. Contig reference idententification ([`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch))
- Identify top 5 blast hits
- Merge blast hit and all contigs of a sample
6. [Optional] Precluster contigs based on taxonomy
7. [Optional] Precluster contigs based on taxonomy
- Identify taxonomy [`Kraken2`](https://ccb.jhu.edu/software/kraken2/) and\or [`Kaiju`](https://kaiju.binf.ku.dk/)
- Resolve potential inconsistencies in taxonomy & taxon filtering | simplification `bin/extract_precluster.py`
7. Cluster contigs (or every taxonomic bin) of samples, options are:
8. Cluster contigs (or every taxonomic bin) of samples, options are:
- [`cdhitest`](https://sites.google.com/view/cd-hit)
- [`vsearch`](https://github.com/torognes/vsearch/wiki/Clustering)
- [`mmseqs-linclust`](https://github.com/soedinglab/MMseqs2/wiki#linear-time-clustering-using-mmseqs-linclust)
- [`mmseqs-cluster`](https://github.com/soedinglab/MMseqs2/wiki#cascaded-clustering)
- [`vRhyme`](https://github.com/AnantharamanLab/vRhyme)
- [`Mash`](https://github.com/marbl/Mash)
8. Scaffolding of contigs to centroid ([`Minimap2`](https://github.com/lh3/minimap2), [`iVar-consensus`](https://andersen-lab.github.io/ivar/html/manualpage.html))
9. [Optional] Annotate 0-depth regions with external reference `bin/lowcov_to_reference.py`.
10. [Optional] Select best reference from `--mapping_constrains`:
9. Scaffolding of contigs to centroid ([`Minimap2`](https://github.com/lh3/minimap2), [`iVar-consensus`](https://andersen-lab.github.io/ivar/html/manualpage.html))
10. [Optional] Annotate 0-depth regions with external reference `bin/lowcov_to_reference.py`.
11. [Optional] Select best reference from `--mapping_constrains`:
- [`Mash sketch`](https://github.com/marbl/Mash)
- [`Mash screen`](https://github.com/marbl/Mash)
11. Mapping filtered reads to supercontig and mapping constrains([`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) and [`BWA`](https://github.com/lh3/bwa))
12. [Optional] Deduplicate reads ([`Picard`](https://broadinstitute.github.io/picard/) or if UMI's are used [`UMI-tools`](https://umi-tools.readthedocs.io/en/latest/QUICK_START.html))
13. Variant calling and filtering ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
14. Create consensus genome ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
15. Repeat step 11-14 multiple times for the denovo contig route
16. Consensus evaluation and annotation ([`QUAST`](http://quast.sourceforge.net/quast),[`CheckV`](https://bitbucket.org/berkeleylab/checkv/src/master/),[`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi), [`mmseqs-search`](https://github.com/soedinglab/MMseqs2/wiki#batch-sequence-searching-using-mmseqs-search))
17. Result summary visualisation for raw read, alignment, assembly, variant calling and consensus calling results ([`MultiQC`](http://multiqc.info/))
12. Mapping filtered reads to supercontig and mapping constrains([`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) and [`BWA`](https://github.com/lh3/bwa))
13. [Optional] Deduplicate reads ([`Picard`](https://broadinstitute.github.io/picard/) or if UMI's are used [`UMI-tools`](https://umi-tools.readthedocs.io/en/latest/QUICK_START.html))
14. Variant calling and filtering ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
15. Create consensus genome ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
16. Repeat step 12-15 multiple times for the denovo contig route
17. Consensus evaluation and annotation ([`QUAST`](http://quast.sourceforge.net/quast),[`CheckV`](https://bitbucket.org/berkeleylab/checkv/src/master/),[`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi), [`mmseqs-search`](https://github.com/soedinglab/MMseqs2/wiki#batch-sequence-searching-using-mmseqs-search))
18. Result summary visualisation for raw read, alignment, assembly, variant calling and consensus calling results ([`MultiQC`](http://multiqc.info/))


## Usage
Expand All @@ -86,7 +87,7 @@ nextflow run Joon-Klaps/viralgenie \
Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).

For more details and further functionality, please refer to the [usage documentation](https://github.io/Joon-klaps/viralgenie/usage) and the [parameter documentation](https://github.io/Joon-klaps/viralgenie/parameters).
For more details and further functionality, please refer to the [usage documentation](https://joon-klaps.github.io/viralgenie/latest/usage) and the [parameter documentation](https://joon-klaps.github.io/viralgenie/latest/parameters).

## Credits

Expand All @@ -101,7 +102,7 @@ We thank the following people for their extensive assistance in the development

## Contributions and Support

If you would like to contribute to this pipeline, please see the [contributing guidelines](https://github.io/Joon-klaps/viralgenie/CONTRIBUTING).
If you would like to contribute to this pipeline, please see the [contributing guidelines](https://joon-klaps.github.io/viralgenie/latest/CONTRIBUTING).

<!--
For further information or help, don't hesitate to get in touch on the [Slack `#viralgenie` channel](https://nfcore.slack.com/channels/viralgenie) (you can join with [this invite](https://nf-co.re/join/slack)).
Expand All @@ -111,7 +112,7 @@ For further information or help, don't hesitate to get in touch on the [Slack `#

<!-- TODO: If you use Joon-Klaps/viralgenie for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](https://github.io/Joon-klaps/viralgenie/CITATIONS) file.
An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](https://joon-klaps.github.io/viralgenie/latest/CITATIONS) file.

You can cite the `nf-core` publication as follows:

Expand Down
2 changes: 1 addition & 1 deletion assets/multiqc_config.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
report_comment: >
This report has been generated by the <a href="https://github.com/Joon-Klaps/viralgenie/dev/" target="_blank">Joon-Klaps/viralgenie</a>
analysis pipeline. For information about how to interpret these results, please see the
<a href="https://joon-klaps.github.io/viralgenie/dev/usage/" target="_blank">documentation</a>.
<a href="https://joon-klaps.github.io/viralgenie/latest/dev/usage/" target="_blank">documentation</a>.
export_plots: true

Expand Down
2 changes: 1 addition & 1 deletion bin/extract_clust.py
Original file line number Diff line number Diff line change
Expand Up @@ -386,7 +386,7 @@ def parse_args(argv=None):
metavar="PATTERN",
type=str,
help="Regex pattern to filter clusters by centroid sequence name.",
default="^(TRINITY)|(NODE)|(k\d+)", # Default pattern matches Trinity, SPADes and MEGAHIT assembly names
default="^(TRINITY)|(NODE)|(k\d+)|(scaffold\d+)", # Default pattern matches Trinity, SPADes, MEGAHIT, sspace_basice assembly names
)
parser.add_argument(
"-l",
Expand Down
22 changes: 22 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -434,6 +434,28 @@ process {
]
}

withName: SSPACE_BASIC {
ext.args = [
"-x 1",
"-o 15",
"-r 0.75",
].join(' ').trim()
publishDir = [
[
path: { "${params.outdir}/assembly/assemblers/sspace_basic/scaffolds" },
mode: params.publish_dir_mode,
pattern: '*.fasta',
saveAs: { filename -> params.prefix || params.global_prefix ? "${params.global_prefix}-$filename" : filename }
],
[
path: { "${params.outdir}/assembly/assemblers/sspace_basic/logs" },
mode: params.publish_dir_mode,
pattern: '*.txt',
saveAs: { filename -> params.prefix || params.global_prefix ? "${params.global_prefix}-$filename" : filename }
],
]
}

if (!params.skip_polishing){

withName: BLAST_BLASTN{
Expand Down
1 change: 1 addition & 0 deletions conf/tests/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ params {
skip_complexity_filtering = false
trim_tool = 'fastp'
assemblers = 'spades,megahit'
skip_sspace_basic = false
host_k2_db = 'https://github.com/nf-core/test-datasets/raw/viralrecon/genome/kraken2/kraken2_hs22.tar.gz'

skip_read_classification = true
Expand Down
9 changes: 5 additions & 4 deletions conf/tests/test_fail_db.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,17 +25,18 @@ params {
metadata = "${projectDir}/assets/samplesheets/metadata_test.tsv"

skip_complexity_filtering = false
trim_tool ='fastp'
trim_tool = 'fastp'
skip_hostremoval = true
assemblers ='spades'
assemblers = 'spades'
skip_sspace_basic = false
skip_iterative_refinement = true

skip_read_classification = true
skip_read_classification = true
kaiju_db = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"
reference_pool = "https://raw.githubusercontent.com/Joon-Klaps/nextclade_data/old_datasets/data/nextstrain/ebola/zaire/sequences.fasta"

skip_variant_calling = true
intermediate_mapping_stats = true
intermediate_mapping_stats = true
skip_checkv = true
}

Expand Down
5 changes: 3 additions & 2 deletions conf/tests/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,12 @@ params {


skip_complexity_filtering = false
trim_tool ='fastp'
trim_tool = 'fastp'
host_k2_db = 'https://github.com/nf-core/test-datasets/raw/viralrecon/genome/kraken2/kraken2_hs22.tar.gz'
skip_sspace_basic = false


skip_read_classification = false
skip_read_classification = false
save_databases = true
kaiju_db = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"
reference_pool = "https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/MN908947/sequences.fasta"
Expand Down
1 change: 1 addition & 0 deletions conf/tests/test_umi.config
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ params {

skip_read_classification = true
save_databases = true
skip_sspace_basic = false
kaiju_db = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"

reference_pool = "https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/MN908947/sequences.fasta"
Expand Down
10 changes: 10 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,16 @@ Finally, the results of the assemblers are combined and stored in the `tools_com
- `assemblers`
- `tools_combined/<sample-id>.combined.fa` : Contigs generated by combining the results of the assemblers.

### SSPACE Basic

[SSPACE Basic](https://github.com/nsoranzo/sspace_basic) is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step.

???- abstract "Output files"

- `sspace_basic/`
- `scaffolds/<sample-id>.scaffolds.fasta`: Scaffolds generated by SSPACE Basic.
- `log/<sample-id>.*.txt`: Various txt files containig log and summary information on the SSPACE Basic run.

### BLAST

[BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) is a sequence comparison tool that can be used to compare a query sequence against a database of sequences. In viralgenie, BLAST is used to compare the contigs generated by the assemblers to a database of viral sequences.
Expand Down
1 change: 1 addition & 0 deletions docs/workflow/assembly_polishing.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Three assemblers are used, [SPAdes](http://cab.spbu.ru/software/spades/), [Megah
> Specify the assemblers to use with the `--assemblers` parameter where the assemblers are separated with a ','. The default is `spades,megahit,trinity`.
Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step.

## Reference Matching
The newly assembled contigs are compared to a reference sequence pool (--reference_pool) using a [BLASTn search](https://www.ncbi.nlm.nih.gov/books/NBK153387/). This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.
Expand Down
3 changes: 2 additions & 1 deletion docs/workflow/preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
Viralgenie offers three main preprocessing steps for the preprocessing of raw sequencing reads:

- [Read quality control](#read-quality-control): read quality assessment and filtering.
- [Read processing](#read-processing): adapter clipping and pair-merging.
- [Adapter trimming](#adapter-trimming): adapter clipping and pair-merging.
- [UMI deduplication](#umi-deduplication): removal of PCR duplicates based on Unique Molecular Identifiers (UMIs) on a read level.
- [Complexity filtering](#complexity-filtering): removal of low-sequence complexity reads.
- [Host read-removal](#host-read-removal): removal of reads aligning to reference genome(s) of a host.

Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
site_name: Viralgenie
repo_name: Joon-Klaps/viralgenie
repo_url: https://github.com/Joon-Klaps/viralgenie
site_url: https://joon-klaps.github.io/viralgenie/
site_url: https://joon-klaps.github.io/viralgenie/latest/

nav:
- Home:
Expand Down
71 changes: 0 additions & 71 deletions modules/local/calib/main.nf

This file was deleted.

2 changes: 1 addition & 1 deletion modules/local/select_reference/environment.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: blast_filter
name: select_reference
channels:
- conda-forge
- bioconda
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name: calib
name: sspace_basic
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- bioconda::calib=0.3.4
- bioconda::sspace_basic=2.1.1
Loading

0 comments on commit 32e0892

Please sign in to comment.