Merge branch 'dev' into 124-fix-oom-for-long-contigs

Signed-off-by: Joon Klaps <[email protected]>
Joon-Klaps · Jun 5, 2024 · 32e0892 · 32e0892
2 parents d07ff47 + 84e37f2
commit 32e0892
Show file tree

Hide file tree

Showing 22 changed files with 207 additions and 115 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,8 @@ Initial release of Joon-Klaps/viralgenie, created with the [nf-core](https://nf-
 
 ### `Enhancement`
 
+- 119 include sspace cobra for contig extension ([#123](https://github.com/Joon-Klaps/viralgenie/pull/123))
+
 ### `Fixed`
 
 - OOM with longer contigs for lowcov_to_reference, uses more RAM now ([#125](https://github.com/Joon-Klaps/viralgenie/pull/125))

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -115,6 +115,10 @@
 
     > Bankevich, Anton et al. “SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.” Journal of computational biology : a journal of computational molecular cell biology vol. 19,5 (2012): 455-77. doi:10.1089/cmb.2012.0021
 
+- [SSPACE Basic](https://pubmed.ncbi.nlm.nih.gov/21149342/)
+
+    > Boetzer, Marten et al. “Scaffolding pre-assembled contigs using SSPACE.” Bioinformatics (Oxford, England) vol. 27,4 (2011): 578-9. doi:10.1093/bioinformatics/btq683
+
 - [Trimmomatic](https://pubmed.ncbi.nlm.nih.gov/24695404/)
 
     > Bolger, Anthony M et al. “Trimmomatic: a flexible trimmer for Illumina sequence data.” Bioinformatics (Oxford, England) vol. 30,15 (2014): 2114-20. doi:10.1093/bioinformatics/btu170

diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@
 <!-- [![Get help on Slack](http://img.shields.io/badge/slack-nf--core%20%23viralgenie-4A154B?labelColor=000000&logo=slack)](https://nfcore.slack.com/channels/viralgenie)-->
 
 > [!TIP]
-> Make sure to checkout the [viralgenie website](https://joon-klaps.github.io/viralgenie/) for more elaborate documentation!
+> Make sure to checkout the [viralgenie website](https://joon-klaps.github.io/viralgenie/latest/) for more elaborate documentation!
 
 ## Introduction
 
@@ -41,31 +41,32 @@
         - [`Kaiju`](https://kaiju.binf.ku.dk/)
     - Plotting Kraken2 and Kaiju ([`Krona`](https://hpc.nih.gov/apps/kronatools.html))
 4. Denovo assembly ([`SPAdes`](http://cab.spbu.ru/software/spades/), [`TRINITY`](https://github.com/trinityrnaseq/trinityrnaseq), [`megahit`](https://github.com/voutcn/megahit)), combine contigs.
-5. Contig reference idententification ([`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch))
+5. [Optional] extend the contigs with [sspace_basic](https://github.com/nsoranzo/sspace_basic)
+6. Contig reference idententification ([`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch))
     -   Identify top 5 blast hits
     -   Merge blast hit and all contigs of a sample
-6. [Optional] Precluster contigs based on taxonomy
+7. [Optional] Precluster contigs based on taxonomy
     - Identify taxonomy [`Kraken2`](https://ccb.jhu.edu/software/kraken2/) and\or [`Kaiju`](https://kaiju.binf.ku.dk/)
     - Resolve potential inconsistencies in taxonomy & taxon filtering | simplification `bin/extract_precluster.py`
-7. Cluster contigs (or every taxonomic bin) of samples, options are:
+8. Cluster contigs (or every taxonomic bin) of samples, options are:
     - [`cdhitest`](https://sites.google.com/view/cd-hit)
     - [`vsearch`](https://github.com/torognes/vsearch/wiki/Clustering)
     - [`mmseqs-linclust`](https://github.com/soedinglab/MMseqs2/wiki#linear-time-clustering-using-mmseqs-linclust)
     - [`mmseqs-cluster`](https://github.com/soedinglab/MMseqs2/wiki#cascaded-clustering)
     - [`vRhyme`](https://github.com/AnantharamanLab/vRhyme)
     - [`Mash`](https://github.com/marbl/Mash)
-8. Scaffolding of contigs to centroid ([`Minimap2`](https://github.com/lh3/minimap2), [`iVar-consensus`](https://andersen-lab.github.io/ivar/html/manualpage.html))
-9. [Optional] Annotate 0-depth regions with external reference `bin/lowcov_to_reference.py`.
-10. [Optional] Select best reference from `--mapping_constrains`:
+9. Scaffolding of contigs to centroid ([`Minimap2`](https://github.com/lh3/minimap2), [`iVar-consensus`](https://andersen-lab.github.io/ivar/html/manualpage.html))
+10. [Optional] Annotate 0-depth regions with external reference `bin/lowcov_to_reference.py`.
+11. [Optional] Select best reference from `--mapping_constrains`:
     - [`Mash sketch`](https://github.com/marbl/Mash)
     - [`Mash screen`](https://github.com/marbl/Mash)
-11. Mapping filtered reads to supercontig and mapping constrains([`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) and [`BWA`](https://github.com/lh3/bwa))
-12. [Optional] Deduplicate reads ([`Picard`](https://broadinstitute.github.io/picard/) or if UMI's are used [`UMI-tools`](https://umi-tools.readthedocs.io/en/latest/QUICK_START.html))
-13. Variant calling and filtering ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
-14. Create consensus genome ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
-15. Repeat step 11-14 multiple times for the denovo contig route
-16. Consensus evaluation and annotation ([`QUAST`](http://quast.sourceforge.net/quast),[`CheckV`](https://bitbucket.org/berkeleylab/checkv/src/master/),[`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi), [`mmseqs-search`](https://github.com/soedinglab/MMseqs2/wiki#batch-sequence-searching-using-mmseqs-search))
-17. Result summary visualisation for raw read, alignment, assembly, variant calling and consensus calling results ([`MultiQC`](http://multiqc.info/))
+12. Mapping filtered reads to supercontig and mapping constrains([`BowTie2`](http://bowtie-bio.sourceforge.net/bowtie2/),[`BWAmem2`](https://github.com/bwa-mem2/bwa-mem2) and [`BWA`](https://github.com/lh3/bwa))
+13. [Optional] Deduplicate reads ([`Picard`](https://broadinstitute.github.io/picard/) or if UMI's are used [`UMI-tools`](https://umi-tools.readthedocs.io/en/latest/QUICK_START.html))
+14. Variant calling and filtering ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
+15. Create consensus genome ([`BCFTools`](http://samtools.github.io/bcftools/bcftools.html),[`iVar`](https://andersen-lab.github.io/ivar/html/manualpage.html))
+16. Repeat step 12-15 multiple times for the denovo contig route
+17. Consensus evaluation and annotation ([`QUAST`](http://quast.sourceforge.net/quast),[`CheckV`](https://bitbucket.org/berkeleylab/checkv/src/master/),[`blastn`](https://blast.ncbi.nlm.nih.gov/Blast.cgi), [`mmseqs-search`](https://github.com/soedinglab/MMseqs2/wiki#batch-sequence-searching-using-mmseqs-search))
+18. Result summary visualisation for raw read, alignment, assembly, variant calling and consensus calling results ([`MultiQC`](http://multiqc.info/))
 
 
 ## Usage
@@ -86,7 +87,7 @@ nextflow run Joon-Klaps/viralgenie \
      Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
      see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
 
-For more details and further functionality, please refer to the [usage documentation](https://github.io/Joon-klaps/viralgenie/usage) and the [parameter documentation](https://github.io/Joon-klaps/viralgenie/parameters).
+For more details and further functionality, please refer to the [usage documentation](https://joon-klaps.github.io/viralgenie/latest/usage) and the [parameter documentation](https://joon-klaps.github.io/viralgenie/latest/parameters).
 
 ## Credits
 
@@ -101,7 +102,7 @@ We thank the following people for their extensive assistance in the development
 
 ## Contributions and Support
 
-If you would like to contribute to this pipeline, please see the [contributing guidelines](https://github.io/Joon-klaps/viralgenie/CONTRIBUTING).
+If you would like to contribute to this pipeline, please see the [contributing guidelines](https://joon-klaps.github.io/viralgenie/latest/CONTRIBUTING).
 
 <!--
 For further information or help, don't hesitate to get in touch on the [Slack `#viralgenie` channel](https://nfcore.slack.com/channels/viralgenie) (you can join with [this invite](https://nf-co.re/join/slack)).
@@ -111,7 +112,7 @@ For further information or help, don't hesitate to get in touch on the [Slack `#
 
 <!-- TODO: If you use  Joon-Klaps/viralgenie for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->
 
-An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](https://github.io/Joon-klaps/viralgenie/CITATIONS) file.
+An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](https://joon-klaps.github.io/viralgenie/latest/CITATIONS) file.
 
 You can cite the `nf-core` publication as follows:
 

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -1,7 +1,7 @@
 report_comment: >
   This report has been generated by the <a href="https://github.com/Joon-Klaps/viralgenie/dev/" target="_blank">Joon-Klaps/viralgenie</a>
   analysis pipeline. For information about how to interpret these results, please see the
-  <a href="https://joon-klaps.github.io/viralgenie/dev/usage/" target="_blank">documentation</a>.
+  <a href="https://joon-klaps.github.io/viralgenie/latest/dev/usage/" target="_blank">documentation</a>.
 
 export_plots: true
 

diff --git a/bin/extract_clust.py b/bin/extract_clust.py
@@ -386,7 +386,7 @@ def parse_args(argv=None):
         metavar="PATTERN",
         type=str,
         help="Regex pattern to filter clusters by centroid sequence name.",
-        default="^(TRINITY)|(NODE)|(k\d+)",  # Default pattern matches Trinity, SPADes and MEGAHIT assembly names
+        default="^(TRINITY)|(NODE)|(k\d+)|(scaffold\d+)",  # Default pattern matches Trinity, SPADes, MEGAHIT, sspace_basice assembly names
     )
     parser.add_argument(
         "-l",

diff --git a/conf/modules.config b/conf/modules.config
@@ -434,6 +434,28 @@ process {
             ]
         }
 
+        withName: SSPACE_BASIC {
+            ext.args = [
+                "-x 1",
+                "-o 15",
+                "-r 0.75",
+            ].join(' ').trim()
+            publishDir = [
+                [
+                    path: { "${params.outdir}/assembly/assemblers/sspace_basic/scaffolds" },
+                    mode: params.publish_dir_mode,
+                    pattern: '*.fasta',
+                    saveAs: { filename -> params.prefix || params.global_prefix  ? "${params.global_prefix}-$filename" : filename }
+                ],
+                [
+                    path: { "${params.outdir}/assembly/assemblers/sspace_basic/logs" },
+                    mode: params.publish_dir_mode,
+                    pattern: '*.txt',
+                    saveAs: { filename -> params.prefix || params.global_prefix  ? "${params.global_prefix}-$filename" : filename }
+                ],
+            ]
+        }
+
         if (!params.skip_polishing){
 
             withName: BLAST_BLASTN{

diff --git a/conf/tests/test.config b/conf/tests/test.config
@@ -27,6 +27,7 @@ params {
     skip_complexity_filtering   = false
     trim_tool                   = 'fastp'
     assemblers                  = 'spades,megahit'
+    skip_sspace_basic           = false
     host_k2_db                  = 'https://github.com/nf-core/test-datasets/raw/viralrecon/genome/kraken2/kraken2_hs22.tar.gz'
 
     skip_read_classification    = true

diff --git a/conf/tests/test_fail_db.config b/conf/tests/test_fail_db.config
@@ -25,17 +25,18 @@ params {
     metadata  = "${projectDir}/assets/samplesheets/metadata_test.tsv"
 
     skip_complexity_filtering   = false
-    trim_tool                   ='fastp'
+    trim_tool                   = 'fastp'
     skip_hostremoval            = true
-    assemblers                  ='spades'
+    assemblers                  = 'spades'
+    skip_sspace_basic           = false
     skip_iterative_refinement   = true
 
-    skip_read_classification  = true
+    skip_read_classification    = true
     kaiju_db                    = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"
     reference_pool              = "https://raw.githubusercontent.com/Joon-Klaps/nextclade_data/old_datasets/data/nextstrain/ebola/zaire/sequences.fasta"
 
     skip_variant_calling        = true
-    intermediate_mapping_stats      = true
+    intermediate_mapping_stats  = true
     skip_checkv                 = true
 }
 

diff --git a/conf/tests/test_full.config b/conf/tests/test_full.config
@@ -27,11 +27,12 @@ params {
 
 
     skip_complexity_filtering   = false
-    trim_tool                   ='fastp'
+    trim_tool                   = 'fastp'
     host_k2_db                  = 'https://github.com/nf-core/test-datasets/raw/viralrecon/genome/kraken2/kraken2_hs22.tar.gz'
+    skip_sspace_basic           = false
 
 
-    skip_read_classification  = false
+    skip_read_classification    = false
     save_databases              = true
     kaiju_db                    = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"
     reference_pool              = "https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/MN908947/sequences.fasta"

diff --git a/conf/tests/test_umi.config b/conf/tests/test_umi.config
@@ -36,6 +36,7 @@ params {
 
     skip_read_classification    = true
     save_databases              = true
+    skip_sspace_basic           = false
     kaiju_db                    = "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_viruses_2023-05-26.tgz"
 
     reference_pool              = "https://github.com/Joon-Klaps/nextclade_data/raw/old_datasets/data/nextstrain/sars-cov-2/MN908947/sequences.fasta"

diff --git a/docs/output.md b/docs/output.md
@@ -178,6 +178,16 @@ Finally, the results of the assemblers are combined and stored in the `tools_com
     - `assemblers`
         - `tools_combined/<sample-id>.combined.fa` : Contigs generated by combining the results of the assemblers.
 
+### SSPACE Basic
+
+[SSPACE Basic](https://github.com/nsoranzo/sspace_basic) is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step.
+
+???- abstract "Output files"
+
+    - `sspace_basic/`
+        - `scaffolds/<sample-id>.scaffolds.fasta`: Scaffolds generated by SSPACE Basic.
+        - `log/<sample-id>.*.txt`: Various txt files containig log and summary information on the SSPACE Basic run.
+
 ### BLAST
 
 [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) is a sequence comparison tool that can be used to compare a query sequence against a database of sequences. In viralgenie, BLAST is used to compare the contigs generated by the assemblers to a database of viral sequences.

diff --git a/docs/workflow/assembly_polishing.md b/docs/workflow/assembly_polishing.md
@@ -23,6 +23,7 @@ Three assemblers are used, [SPAdes](http://cab.spbu.ru/software/spades/), [Megah
 
 > Specify the assemblers to use with the `--assemblers` parameter where the assemblers are separated with a ','. The default is `spades,megahit,trinity`.
 
+Contigs can be extended using [SSPACE Basic](https://github.com/nsoranzo/sspace_basic) with the `--skip_sspace_basic false` parameter. SSPACE is a tool for scaffolding contigs using paired-end reads. It is modified from SSAKE assembler and has the feature of extending contigs using reads that are unmappable in the contig assembly step.
 
 ## Reference Matching
 The newly assembled contigs are compared to a reference sequence pool (--reference_pool) using a [BLASTn search](https://www.ncbi.nlm.nih.gov/books/NBK153387/). This process not only helps annotate the contigs but also assists in linking together sets of contigs that are distant within a single genome. Essentially, it aids in identifying contigs belonging to the same genomic segment and choosing the right reference for scaffolding purposes.

diff --git a/docs/workflow/preprocessing.md b/docs/workflow/preprocessing.md
@@ -3,7 +3,8 @@
 Viralgenie offers three main preprocessing steps for the preprocessing of raw sequencing reads:
 
 - [Read quality control](#read-quality-control): read quality assessment and filtering.
-- [Read processing](#read-processing): adapter clipping and pair-merging.
+- [Adapter trimming](#adapter-trimming): adapter clipping and pair-merging.
+- [UMI deduplication](#umi-deduplication): removal of PCR duplicates based on Unique Molecular Identifiers (UMIs) on a read level.
 - [Complexity filtering](#complexity-filtering): removal of low-sequence complexity reads.
 - [Host read-removal](#host-read-removal): removal of reads aligning to reference genome(s) of a host.
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -1,7 +1,7 @@
 site_name: Viralgenie
 repo_name: Joon-Klaps/viralgenie
 repo_url: https://github.com/Joon-Klaps/viralgenie
-site_url: https://joon-klaps.github.io/viralgenie/
+site_url: https://joon-klaps.github.io/viralgenie/latest/
 
 nav:
   - Home:

diff --git a/modules/local/calib/main.nf b/modules/local/calib/main.nf
diff --git a/modules/local/select_reference/environment.yml b/modules/local/select_reference/environment.yml
@@ -1,4 +1,4 @@
-name: blast_filter
+name: select_reference
 channels:
   - conda-forge
   - bioconda

diff --git a/modules/local/calib/environment.yml → modules/local/sspace_basic/environment.yml b/modules/local/calib/environment.yml → modules/local/sspace_basic/environment.yml
@@ -1,7 +1,7 @@
-name: calib
+name: sspace_basic
 channels:
   - conda-forge
   - bioconda
   - defaults
 dependencies:
-  - bioconda::calib=0.3.4
+  - bioconda::sspace_basic=2.1.1