diff --git a/README.md b/README.md index 86065adaf..a54512da3 100644 --- a/README.md +++ b/README.md @@ -1,438 +1,16 @@ -# GATK-SV +# gnomAD-SV v3 _post hoc_ filtering & QC -A structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data. +This repo is a fork of [the official GATK-SV repo](https://github.com/broadinstitute/gatk-sv). -## Table of Contents -* [Requirements](#requirements) -* [Citation](#citation) -* [Quickstart](#quickstart) -* [Pipeline Overview](#overview) - * [Cohort mode](#cohort-mode) - * [Single-sample mode](#single-sample-mode) - * [gCNV model](#gcnv-training-overview) -* [Module Descriptions](#descriptions) - * [Module 00a](#module00a) - Raw callers and evidence collection - * [Module 00b](#module00b) - Batch QC - * [gCNV training](#gcnv-training) - gCNV model creation - * [Module 00c](#module00c) - Batch evidence merging, BAF generation, and depth callers - * [Module 01](#module01) - Site clustering - * [Module 02](#module02) - Site metrics - * [Module 03](#module03) - Filtering - * [Gather Cohort VCFs](#gather-vcfs) - Cross-batch site merging - * [Module 04](#module04) - Genotyping - * [Module 04b](#module04b) - Genotype refinement (optional) - * [Module 05/06](#module0506) - Cross-batch integration, complex event resolution, and VCF cleanup - * [Module 07](#module07) - Downstream Filtering - * [Module 08](#module08) - Annotation - * [Module 09](#module09) - QC and Visualization - * Additional modules - Mosaic and de novo -* [Troubleshooting](#troubleshooting) +This repo covers the development of filtering, quality control, and annotation of the gnomAD-SV v3 callset. +For more information on GATK-SV, please refer to [the documentation provided in the main GATK-SV repo](https://github.com/broadinstitute/gatk-sv) or to the supplementary methods from [the gnomAD-SV v2 paper](https://www.nature.com/articles/s41586-020-2287-8). -## Requirements +For more information about gnomAD, please visit [the official gnomAD website](https://gnomad.broadinstitute.org/about). +### Contact & Credits -### Deployment and execution: -* A [Google Cloud](https://cloud.google.com/) account. -* A workflow execution system supporting the [Workflow Description Language](https://openwdl.org/) (WDL), either: - * [Cromwell](https://github.com/broadinstitute/cromwell) (v36 or higher). A dedicated server is highly recommended. - * or [Terra](https://terra.bio/) (note preconfigured GATK-SV workflows are not yet available for this platform) -* Recommended: [MELT](https://melt.igs.umaryland.edu/). Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm. -* Recommended: [cromshell](https://github.com/broadinstitute/cromshell) for interacting with a dedicated Cromwell server. -* Recommended: [WOMtool](https://cromwell.readthedocs.io/en/stable/WOMtool/) for validating WDL/json files. +Copyright (c) 2021 Talkowski Lab and The Broad Institute of M.I.T. and Harvard +Contact: [Ryan Collins](mailto:rlcollins@g.harvard.edu) -### Data: -* Illumina short-read whole-genome CRAMs or BAMs, aligned to hg38 with [bwa-mem](https://github.com/lh3/bwa). BAMs must also be indexed. -* Indexed GVCFs produced by GATK HaplotypeCaller, or a jointly genotyped VCF. -* Family structure definitions file in [PED format](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format). Sex aneuploidies (detected in [Module 00b](#module00b)) should be entered as sex = 0. - -#### Sample ID requirements: - -Sample IDs must: -* Be unique within the cohort -* Contain only alphanumeric characters and underscores (no dashes, whitespace, or special characters) - -Sample IDs should not: -* Contain only numeric characters -* Be a substring of another sample ID in the same cohort -* Contain any of the following substrings: `chr`, `name`, `DEL`, `DUP`, `CPX`, `CHROM` - -The same requirements apply to family IDs in the PED file, as well as batch IDs and the cohort ID provided as workflow inputs. - -Sample IDs are provided to [Module00a](#module00a) directly and need not match sample names from the BAM/CRAM headers or GVCFs. `GetSampleID.wdl` can be used to fetch BAM sample IDs and also generates a set of alternate IDs that are considered safe for this pipeline; alternatively, [this script](https://github.com/talkowski-lab/gnomad_sv_v3/blob/master/sample_id/convert_sample_ids.py) transforms a list of sample IDs to fit these requirements. Currently, sample IDs can be replaced again in [Module 00c](#module00c). - -The following inputs will need to be updated with the transformed sample IDs: -* Sample ID list for [Module00a](#module00a) or [Module 00c](#module00c) -* PED file -* SNP VCF header (if using instead of GVCFs in [Module 00c](#module00c)) - - -## Citation -Please cite the following publication: -[Collins, Brand, et al. 2020. "A structural variation reference for medical and population genetics." Nature 581, 444-451.](https://doi.org/10.1038/s41586-020-2287-8) - -Additional references: -[Werling et al. 2018. "An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder." Nature genetics 50.5, 727-736.](http://dx.doi.org/10.1038/s41588-018-0107-y) - - -## Quickstart - -#### WDLs -There are two scripts for running the full pipeline: -* `wdl/GATKSVPipelineBatch.wdl`: Runs GATK-SV on a batch of samples. -* `wdl/GATKSVPipelineSingleSample.wdl`: Runs GATK-SV on a single sample, given a reference panel - -#### Inputs -Example workflow inputs can be found in `/inputs`. All required resources are available in public Google buckets. - -#### MELT -**Important**: The example input files contain MELT inputs that are NOT public (see [Requirements](#requirements)). These include: - -* `GATKSVPipelineSingleSample.melt_docker` and `GATKSVPipelineBatch.melt_docker` - MELT docker URI (see [Docker readme](https://github.com/talkowski-lab/gatk-sv-v1/blob/master/dockerfiles/README.md)) -* `GATKSVPipelineSingleSample.ref_std_melt_vcfs` - Standardized MELT VCFs ([Module00c](#module00c)) - -The input values are provided only as an example and are not publicly accessible. In order to include MELT, these values must be provided by the user. MELT can be disabled by deleting these inputs and setting `GATKSVPipelineBatch.use_melt` to `false`. - -#### Requester pays buckets -**Important**: The following parameters must be set when certain input data is in requester pays (RP) buckets: - -* `GATKSVPipelineSingleSample.requester_pays_cram` and `GATKSVPipelineBatch.Module00aBatch.requester_pays_crams` - set to `True` if inputs are CRAM format and in an RP bucket, otherwise `False`. -* `GATKSVPipelineBatch.GATKSVPipelinePhase1.gcs_project_for_requester_pays` - set to your Google Cloud Project ID if gVCFs are in an RP bucket, otherwise omit this parameter. - -#### Execution -We recommend running the pipeline on a dedicated [Cromwell](https://github.com/broadinstitute/cromwell) server with a [cromshell](https://github.com/broadinstitute/cromshell) client. A batch run can be started with the following commands: - -``` -> mkdir gatksv_run && cd gatksv_run -> mkdir wdl && cd wdl -> cp $GATK_SV_V1_ROOT/wdl/*.wdl . -> zip dep.zip *.wdl -> cd .. -> cp $GATK_SV_V1_ROOT/inputs/GATKSVPipelineBatch.ref_panel_1kg.json GATKSVPipelineBatch.my_run.json -> cromshell submit wdl/GATKSVPipelineBatch.wdl GATKSVPipelineBatch.my_run.json cromwell_config.json wdl/dep.zip -``` - -where `cromwell_config.json` is a Cromwell [workflow options file](https://cromwell.readthedocs.io/en/stable/wf_options/Overview/). Note users will need to re-populate batch/sample-specific parameters (e.g. BAMs and sample IDs). - -## Pipeline Overview -The pipeline consists of a series of modules that perform the following: -* [Module 00a](#module00a): SV evidence collection, including calls from a configurable set of algorithms (Delly, Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE). -* [Module 00b](#module00b): Dosage bias scoring and ploidy estimation -* [Module 00c](#module00c): Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation -* [Module 01](#module01): Variant clustering -* [Module 02](#module02): Variant filtering metric generation -* [Module 03](#module03): Variant filtering; outlier exclusion -* [Module 04](#module04): Genotyping -* [Module 05/06](#module0506): Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup -* [Module 07](#module07): Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration; -* [Module 08](#module08): Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets; -* [Module 09](#module09): Visualization, including scripts that generates IGV screenshots and rd plots. -* Additional modules to be added: de novo and mosaic scripts - - -Repository structure: -* `/inputs`: Example workflow parameter files for running gCNV training, GATK-SV batch mode, and GATK-SV single-sample mode -* `/dockerfiles`: Resources for building pipeline docker images (see [readme](https://github.com/talkowski-lab/gatk-sv-v1/blob/master/dockerfiles/README.md)) -* `/wdl`: WDLs running the pipeline. There is a master WDL for running each module, e.g. `Module01.wdl`. -* `/scripts`: scripts for running tests, building dockers, and analyzing cromwell metadata files -* `/src`: main pipeline scripts - * `/RdTest`: scripts for depth testing - * `/sv-pipeline`: various scripts and packages used throughout the pipeline - * `/svqc`: Python module for checking that pipeline metrics fall within acceptable limits - * `/svtest`: Python module for generating various summary metrics from module outputs - * `/svtk`: Python module of tools for SV-related datafile parsing and analysis - * `/WGD`: whole-genome dosage scoring scripts -* `/test`: WDL test parameter files. Please note that file inputs may not be publicly available. - - -## Cohort mode -A minimum cohort size of 100 with roughly equal number of males and females is recommended. For modest cohorts (~100-500 samples), the pipeline can be run as a single batch using `GATKSVPipelineBatch.wdl`. - -For larger cohorts, samples should be split up into batches of ~100-500 samples. We recommend batching based on overall coverage and dosage score (WGD), which can be generated in [Module 00b](#module00b). - -The pipeline should be executed as follows: -* Modules [00a](#module00a) and [00b](#module00b) can be run on arbitrary cohort partitions -* Modules [00c](#module00c), [01](#module01), [02](#module02), and [03](#module03) are run separately per batch -* [Module 04](#module04) is run separately per batch, using filtered variants ([Module 03](#module03) output) combined across all batches -* [Module 05/06](#module0506) and beyond are run on all batches together - -Note: [Module 00c](#module00c) requires a [trained gCNV model](#gcnv-training). - - -## Single-sample mode -`GATKSVPipelineSingleSample.wdl` runs the pipeline on a single sample using a fixed reference panel. An example reference panel containing 156 samples from the [NYGC 1000G Terra workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019) is provided with `inputs/GATKSVPipelineSingleSample.ref_panel_1kg.na12878.json`. - -Custom reference panels can be generated by running `GATKSVPipelineBatch.wdl` and `trainGCNV.wdl` and using the outputs to replace the following single-sample workflow inputs: - -* `GATKSVPipelineSingleSample.ref_ped_file` : `batch.ped` - Manually created (see [data requirements](#requirements)) -* `GATKSVPipelineSingleSample.contig_ploidy_model_tar` : `batch-contig-ploidy-model.tar.gz` - gCNV contig ploidy model ([gCNV training](#gcnv-training)) -* `GATKSVPipelineSingleSample.gcnv_model_tars` : `batch-model-files-*.tar.gz` - gCNV model tarballs ([gCNV training](#gcnv-training)) -* `GATKSVPipelineSingleSample.ref_pesr_disc_files` - `sample.disc.txt.gz` - Paired-end evidence files ([Module 00a](#module00a)) -* `GATKSVPipelineSingleSample.ref_pesr_split_files` - `sample.split.txt.gz` - Split read evidence files ([Module 00a](#module00a)) -* `GATKSVPipelineSingleSample.ref_panel_bincov_matrix`: `batch.RD.txt.gz` - Read counts matrix ([Module 00c](#module00c)) -* `GATKSVPipelineSingleSample.ref_panel_del_bed` : `batch.DEL.bed.gz` - Depth deletion calls ([Module 00c](#module00c)) -* `GATKSVPipelineSingleSample.ref_panel_dup_bed` : `batch.DUP.bed.gz` - Depth duplication calls ([Module 00c](#module00c)) -* `GATKSVPipelineSingleSample.ref_samples` - Reference panel sample IDs -* `GATKSVPipelineSingleSample.ref_std_manta_vcfs` - `std_XXX.manta.sample.vcf.gz` - Standardized Manta VCFs ([Module 00c](#module00c)) -* `GATKSVPipelineSingleSample.ref_std_melt_vcfs` - `std_XXX.melt.sample.vcf.gz` - Standardized Melt VCFs ([Module 00c](#module00c)) -* `GATKSVPipelineSingleSample.ref_std_wham_vcfs` - `std_XXX.wham.sample.vcf.gz` - Standardized Wham VCFs ([Module 00c](#module00c)) -* `GATKSVPipelineSingleSample.cutoffs` : `batch.cutoffs` - Filtering cutoffs ([Module 03](#module03)) -* `GATKSVPipelineSingleSample.genotype_pesr_pesr_sepcutoff` : `genotype_pesr.pesr_sepcutoff.txt` - Genotyping cutoffs ([Module 04](#module04)) -* `GATKSVPipelineSingleSample.genotype_pesr_depth_sepcutoff` : `genotype_pesr.depth_sepcutoff.txt` - Genotyping cutoffs ([Module 04](#module04)) -* `GATKSVPipelineSingleSample.genotype_depth_pesr_sepcutoff` : `genotype_depth.pesr_sepcutoff.txt` - Genotyping cutoffs ([Module 04](#module04)) -* `GATKSVPipelineSingleSample.genotype_depth_depth_sepcutoff` : `genotype_depth.depth_sepcutoff.txt` - Genotyping cutoffs ([Module 04](#module04)) -* `GATKSVPipelineSingleSample.PE_metrics` : `pe_metric_file.txt` - Paired-end evidence genotyping metrics ([Module 04](#module04)) -* `GATKSVPipelineSingleSample.SR_metrics` : `sr_metric_file.txt` - Split read evidence genotyping metrics ([Module 04](#module04)) -* `GATKSVPipelineSingleSample.ref_panel_vcf` : `batch.cleaned.vcf.gz` - Final output VCF ([Module 05/06](#module0506)) - - -## gCNV Training -Both the cohort and single-sample modes use the GATK gCNV depth calling pipeline, which requires a [trained model](#gcnv-training) as input. The samples used for training should be technically homogeneous and similar to the samples to be processed (i.e. same sample type, library prep protocol, sequencer, sequencing center, etc.). The samples to be processed may comprise all or a subset of the training set. For small cohorts, a single gCNV model is usually sufficient. If a cohort contains multiple data sources, we recommend clustering them using the dosage score, and training a separate model for each cluster. - - -## Module Descriptions -The following sections briefly describe each module and highlights inter-dependent input/output files. Note that input/output mappings can also be gleaned from `GATKSVPipelineBatch.wdl`, and example input files for each module can be found in `/test`. - -## Module 00a -Runs raw evidence collection on each sample. - -Note: a list of sample IDs must be provided. Refer to the [sample ID requirements](#sampleids) for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors. - -#### Inputs: -* Per-sample BAM or CRAM files aligned to hg38. Index files (`.bai`) must be provided if using BAMs. - -#### Outputs: -* Caller VCFs (Delly, Manta, MELT, and/or Wham) -* Binned read counts file -* Split reads (SR) file -* Discordant read pairs (PE) file -* B-allele fraction (BAF) file - - -## Module 00b -Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching. - -For large cohorts, we recommend dividing samples into smaller batches (~500 samples) with ~1:1 male:female ratio. - -We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file. - -#### Prerequisites: -* [Module 00a](#module00a) - -#### Inputs: -* Read count files ([Module 00a](#module00a)) -* (Optional) SV call VCFs ([Module 00a](#module00a)) - -#### Outputs: -* Per-sample dosage scores with plots -* Ploidy estimates, sex assignments, with plots -* (Optional) Outlier samples detected by call counts - - -## gCNV Training -Trains a gCNV model for use in [Module 00c](#module00c). The WDL can be found at `/gcnv/trainGCNV.wdl`. - -#### Prerequisites: -* [Module 00a](#module00a) -* (Recommended) [Module 00b](#module00b) - -#### Inputs: -* Read count files ([Module 00a](#module00a)) - -#### Outputs: -* Contig ploidy model tarball -* gCNV model tarballs - - -## Module 00c -Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample raw evidence into a batch. See [above]("#cohort-mode") for more information on batching. - -#### Prerequisites: -* [Module 00a](#module00a) -* (Recommended) [Module 00b](#module00b) -* gCNV training - -#### Inputs: -* PED file (updated with [Module 00b](#module00b) sex assignments, including sex = 0 for sex aneuploidies. Calls will not be made on sex chromosomes when sex = 0 in order to avoid generating many confusing calls or upsetting normalized copy numbers for the batch.) -* Per-sample GVCFs generated with HaplotypeCaller (`gvcfs` input), or a jointly-genotyped VCF (position-sharded, `snp_vcfs` input) -* Read count, BAF, PE, and SR files ([Module 00a](#module00a)) -* Caller VCFs ([Module 00a](#module00a)) -* Contig ploidy model and gCNV model files (gCNV training) - -#### Outputs: -* Combined read count matrix, SR, PE, and BAF files -* Standardized call VCFs -* Depth-only (DEL/DUP) calls -* Per-sample median coverage estimates -* (Optional) Evidence QC plots - - -## Module 01 -Clusters SV calls across a batch. - -#### Prerequisites: -* [Module 00c](#module00c) - -#### Inputs: -* Standardized call VCFs ([Module 00c](#module00c)) -* Depth-only (DEL/DUP) calls ([Module 00c](#module00c)) - -#### Outputs: -* Clustered SV VCFs -* Clustered depth-only call VCF - - -## Module 02 -Generates variant metrics for filtering. - -#### Prerequisites: -* [Module 01](#module01) - -#### Inputs: -* Combined read count matrix, SR, PE, and BAF files ([Module 00c](#module00c)) -* Per-sample median coverage estimates ([Module 00c](#module00c)) -* Clustered SV VCFs ([Module 01](#module01)) -* Clustered depth-only call VCF ([Module 01](#module01)) - -#### Outputs: -* Metrics file - - -## Module 03 -Filters poor quality variants and filters outlier samples. - -#### Prerequisites: -* [Module 02](#module02) - -#### Inputs: -* Batch PED file -* Metrics file ([Module 02](#module02)) -* Clustered SV and depth-only call VCFs ([Module 01](#module01)) - -#### Outputs: -* Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded -* Filtered depth-only call VCF with outlier samples excluded -* Random forest cutoffs file -* PED file with outlier samples excluded - - -## Merge Cohort VCFs -Combines filtered variants across batches. The WDL can be found at: `/wdl/MergeCohortVcfs.wdl`. - -#### Prerequisites: -* [Module 03](#module03) - -#### Inputs: -* List of filtered PESR VCFs ([Module 03](#module03)) -* List of filtered depth VCFs ([Module 03](#module03)) - -#### Outputs: -* Combined cohort PESR and depth VCFs -* Cohort and clustered depth variant BED files - - -## Module 04 -Genotypes a batch of samples across unfiltered variants combined across all batches. - -#### Prerequisites: -* [Module 03](#module03) -* Merge Cohort VCFs - -#### Inputs: -* Batch PESR and depth VCFs ([Module 03](#module03)) -* Cohort PESR and depth VCFs (Merge Cohort VCFs) -* Batch read count, PE, and SR files ([Module 00c](#module00c)) - -#### Outputs: -* Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded -* Filtered depth-only call VCF with outlier samples excluded -* PED file with outlier samples excluded -* List of SR pass variants -* List of SR fail variants -* (Optional) Depth re-genotyping intervals list - - -## Module 04b -Re-genotypes probable mosaic variants across multiple batches. - -#### Prerequisites: -* [Module 04](#module04) - -#### Inputs: -* Per-sample median coverage estimates ([Module 00c](#module00c)) -* Pre-genotyping depth VCFs ([Module 03](#module03)) -* Batch PED files ([Module 03](#module03)) -* Clustered depth variant BED file (Merge Cohort VCFs) -* Cohort depth VCF (Merge Cohort VCFs) -* Genotyped depth VCFs ([Module 04](#module04)) -* Genotyped depth RD cutoffs file ([Module 04](#module04)) - -#### Outputs: -* Re-genotyped depth VCFs - - -## Module 05/06 -Combines variants across multiple batches, resolves complex variants, re-genotypes, and performs final VCF clean-up. - -#### Prerequisites: -* [Module 04](#module04) -* (Optional) [Module 04b](#module04b) - -#### Inputs: -* RD, PE and SR file URIs ([Module 00c](#module00c)) -* Batch filtered PED file URIs ([Module 03](#module03)) -* Genotyped PESR VCF URIs ([Module 04](#module04)) -* Genotyped depth VCF URIs ([Module 04](#module04) or [04b](#module04b)) -* SR pass variant file URIs ([Module 04](#module04)) -* SR fail variant file URIs ([Module 04](#module04)) -* Genotyping cutoff file URIs ([Module 04](#module04)) -* Batch IDs -* Sample ID list URIs - -#### Outputs: -* Finalized "cleaned" VCF and QC plots - -## Module 07 (in development) -Apply downstream filtering steps to the cleaned vcf to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects. - -Filterings methods include: -* minGQ - remove variants based on the genotype quality across populations. -Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found here: -``` -gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt -gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt -gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt -``` - -* BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches -* FilterOutlierSamples - remove outlier samples with unusually high or low number of SVs -* FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation - -## Module 08 (in development) -Add annotations, such as the inferred function and allele frequencies of variants, to final vcf. - -Annotations methods include: -* Functional annotation - annotate SVs with inferred function on protein coding regions, regulatory regions such as UTR and Promoters and other non coding elements; -* Allele Frequency annotation - annotate SVs with their allele frequencies across all samples, and samples of specific sex, as well as specific sub-populations. -* Allele Frequency annotation with external callset - annotate SVs with the allele frequencies of their overlapping SVs in another callset, eg. gnomad SV callset. - -## Module 09 (in development) -Visualize SVs with [IGV](http://software.broadinstitute.org/software/igv/) screenshots and read depth plots. - -Visualization methods include: -* RD Visualization - generate RD plots across all samples, ideal for visualizing large CNVs. -* IGV Visualization - generate IGV plots of each SV for individual sample, ideal for visualizing de novo small SVs. -* Module09.visualize.wdl - generate RD plots and IGV plots, and combine them for easy review. - - -## Troubleshooting - -### VM runs out of memory or disk -* Default pipeline settings are tuned for batches of 100 samples. Larger batches or cohorts may require additional VM resources. Most runtime attributes can be modified through the `RuntimeAttr` inputs. These are formatted like this in the json: -``` -"ModuleX.runtime_attr_override": { - "disk_gb": 100, - "mem_gb": 16 -}, -``` -Note that a subset of the struct attributes can be specified. See `wdl/Structs.wdl` for available attributes. +Core analysis team: Ryan Collins, Xuefang Zhao, Mark Walker, Harrison Brand, Emma Pierce-Hoffman, Alba Sanchis-Juan, and Chris Whelan diff --git a/dockerfiles/expansion-hunter-denovo/Dockerfile b/dockerfiles/expansion-hunter-denovo/Dockerfile index 53a19a1df..c10607230 100644 --- a/dockerfiles/expansion-hunter-denovo/Dockerfile +++ b/dockerfiles/expansion-hunter-denovo/Dockerfile @@ -1,10 +1,17 @@ -FROM alpine:latest -RUN apk --no-cache add curl && \ +FROM python:3.7-slim +RUN apt-get update && apt-get install -y \ + wget \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* && \ + wget https://github.com/Illumina/ExpansionHunterDenovo/releases/download/v0.9.0/ExpansionHunterDenovo-v0.9.0-linux_x86_64.tar.gz && \ mkdir ehdn_extract && \ tar -xf *.tar.gz --strip-components=1 -C ehdn_extract && \ rm -rf *.tar.gz && \ mkdir ehdn && \ mv ehdn_extract/bin/* ehdn/ && \ - rm -rf ehdn_extract + mv ehdn_extract/scripts ehdn/ && \ + rm -rf ehdn_extract && \ + pip install -r /ehdn/scripts/requirements.txt ENV PATH="/ehdn/:$PATH" +ENV SCRIPTS_DIR /ehdn/scripts diff --git a/dockerfiles/igv/MakeRDtest.py b/dockerfiles/igv/MakeRDtest.py index 7022509de..2506e8948 100755 --- a/dockerfiles/igv/MakeRDtest.py +++ b/dockerfiles/igv/MakeRDtest.py @@ -1,228 +1,263 @@ import os import numpy as np from PIL import Image -import PIL from PIL import ImageFont from PIL import ImageDraw import argparse -#Image helper function -#stack two or more images vertically -def vstack(lst,outname): - # given a list of image files, stack them vertically then save as +# Image helper function +# stack two or more images vertically + + +def vstack(lst, outname): + # given a list of image files, stack them vertically then save as list_im = lst # list of image files - imgs = [ Image.open(i) for i in list_im ] + imgs = [Image.open(i) for i in list_im] # pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here) - min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1][0] - imgs_comb = np.vstack([np.asarray( i.resize((min_shape,int(i.size[1]/i.size[0]*min_shape))) ) for i in imgs ] ) + min_shape = sorted([(np.sum(i.size), i.size) for i in imgs])[0][1][0] + imgs_comb = np.vstack([np.asarray( + i.resize((min_shape, int(i.size[1] / i.size[0] * min_shape)))) for i in imgs]) # save that beautiful picture - imgs_comb = Image.fromarray( imgs_comb,"RGB") - imgs_comb.save( outname ) + imgs_comb = Image.fromarray(imgs_comb, "RGB") + imgs_comb.save(outname) # combine two images side by side -def hstack(f1,f2,name): + + +def hstack(f1, f2, name): # given two images, put them side by side, then save to name - list_im = [f1,f2] - imgs = [ Image.open(i) for i in list_im ] + list_im = [f1, f2] + imgs = [Image.open(i) for i in list_im] # pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here) - min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1] + min_shape = sorted([(np.sum(i.size), i.size) for i in imgs])[0][1] # print(min_shape) - imgs_comb = np.hstack( [np.asarray( i.resize((min_shape))) for i in imgs ]) + imgs_comb = np.hstack([np.asarray(i.resize((min_shape))) for i in imgs]) # save that beautiful picture - imgs_comb = Image.fromarray( imgs_comb,"RGB") - imgs_comb.save( name ) + imgs_comb = Image.fromarray(imgs_comb, "RGB") + imgs_comb.save(name) + -def words(STR1,STR2,outfile,n=100): +def words(STR1, STR2, outfile, n=100): font = ImageFont.truetype("arial.ttf", 70) - img = Image.new("RGB", (1800,300), (255,255,255)) + img = Image.new("RGB", (1800, 300), (255, 255, 255)) draw = ImageDraw.Draw(img) - draw.text((n,10), STR1, (0,0,0),font=font) + draw.text((n, 10), STR1, (0, 0, 0), font=font) draw = ImageDraw.Draw(img) - draw.text((n,150), STR2, (0,0,0),font=font) + draw.text((n, 150), STR2, (0, 0, 0), font=font) draw = ImageDraw.Draw(img) img.save(outfile) ########## # class Rdplotprefix(): - # def __init__(self,variantfile,GetVariantFunc=GetVariants,pedfile,prefixfile,pesrdir,rddir): - # self.variants=GetVariantFunc(inputfile,pedfile,prefixfile).variants +# def __init__(self,variantfile,GetVariantFunc=GetVariants,pedfile,prefixfile,pesrdir,rddir): +# self.variants=GetVariantFunc(inputfile,pedfile,prefixfile).variants + + class Variant(): - def __init__(self,chr,start,end,name,type,samples,varname,prefix): - self.chr=chr - self.coord=str(chr)+":"+str(start)+"-"+str(end) - self.start=start - self.end=end - self.name=name - self.type=type - self.prefix=prefix - self.varname=varname - self.sample=samples - self.samples=samples.split(",") - def pesrplotname(self,dir): - if os.path.isfile(dir+self.varname+".png"): - return dir+self.varname+".png" - elif os.path.isfile(dir+self.varname+".left.png") and os.path.isfile(dir+self.varname+".right.png"): - hstack(dir+self.varname+".left.png",dir+self.varname+".right.png",dir+self.varname+".png") - return dir+self.varname+".png" - else: - raise Exception(dir+self.varname+".png"+" PESR files not found") - def rdplotname(self,dir,maxcutoff=float("inf")): - if int(self.end)-int(self.start)>maxcutoff: - medium=(int(self.end)+int(self.start))/2 - newstart=str(round(medium-maxcutoff/2)) - newend=str(round(medium+maxcutoff/2)) - else: - newstart=self.start - newend=self.end - if os.path.isfile(dir+self.chr+"_"+newstart+"_"+newend+"_"+self.samples[0]+"_"+self.name+"_"+self.prefix+".jpg"): - return dir+self.chr+"_"+newstart+"_"+newend+"_"+self.samples[0]+"_"+self.name+"_"+self.prefix+".jpg" - elif os.path.isfile(dir+self.chr+"_"+newstart+"_"+newend+"_"+self.samples[0]+"_"+self.name+"_"+self.prefix+".jpg"): - return dir+self.chr+"_"+newstart+"_"+newend+"_"+self.samples[0]+"_"+self.name+"_"+self.prefix+".jpg" - else: - raise Exception(dir+self.chr+"_"+newstart+"_"+newend+"_"+self.samples[0]+"_"+self.name+"_"+self.prefix+".jpg"+" Rdplot not found") - def makeplot(self,pedir,rddir,outdir,flank,build="hg38"): - if self.type!="INS": - if int(self.end)-int(self.start)<2000: - STR2=self.varname+" "+str(int(self.end)-int(self.start))+'bp' - else: - STR2=self.varname+" "+str(int((int(self.end)-int(self.start))/1000))+'kb' - else: - STR2=self.varname - pesrplot=self.pesrplotname(pedir) - if self.type=="DUP" or self.type=="DEL": - rdplot=self.rdplotname(rddir, flank) - img = Image.open(rdplot) #rd plot - img2 = img.crop((0, 230, img.size[0], img.size[1])) # crop out original RD plot annotations - img2.save("croprd.jpg") - # get new annotation - STR1=self.chr+":"+'{0:,}'.format(int(self.start))+'-'+'{0:,}'.format(int(self.end))+" (+"+build+")" - outfile='info.jpg' - words(STR1,STR2,outfile,100) # new Rd plot - vstack(['info.jpg',"croprd.jpg",pesrplot],outdir+self.varname+"_denovo.png") # combine rd pe and sr together - else: - STR1=self.chr+":"+'{0:,}'.format(int(self.start))+'-'+'{0:,}'.format(int(self.end))+" (hg38)" - outfile='info.jpg' - words(STR1,STR2,outfile,50) - vstack(['info.jpg',pesrplot],outdir+self.varname+"_denovo.png") + def __init__(self, chr, start, end, name, type, samples, varname, prefix): + self.chr = chr + self.coord = str(chr) + ":" + str(start) + "-" + str(end) + self.start = start + self.end = end + self.name = name + self.type = type + self.prefix = prefix + self.varname = varname + self.sample = samples + self.samples = samples.split(",") + + def pesrplotname(self, dir): + if os.path.isfile(dir + self.varname + ".png"): + return dir + self.varname + ".png" + elif os.path.isfile(dir + self.varname + ".left.png") and os.path.isfile(dir + self.varname + ".right.png"): + hstack(dir + self.varname + ".left.png", dir + + self.varname + ".right.png", dir + self.varname + ".png") + return dir + self.varname + ".png" + else: + raise Exception(dir + self.varname + ".png" + + " PESR files not found") + + def rdplotname(self, dir, maxcutoff=float("inf")): + if int(self.end) - int(self.start) > maxcutoff: + medium = (int(self.end) + int(self.start)) / 2 + newstart = str(round(medium - maxcutoff / 2)) + newend = str(round(medium + maxcutoff / 2)) + else: + newstart = self.start + newend = self.end + if os.path.isfile(dir + self.chr + "_" + newstart + "_" + newend + "_" + self.samples[0] + "_" + self.name + "_" + self.prefix + ".jpg"): + return dir + self.chr + "_" + newstart + "_" + newend + "_" + self.samples[0] + "_" + self.name + "_" + self.prefix + ".jpg" + elif os.path.isfile(dir + self.chr + "_" + newstart + "_" + newend + "_" + self.samples[0] + "_" + self.name + "_" + self.prefix + ".jpg"): + return dir + self.chr + "_" + newstart + "_" + newend + "_" + self.samples[0] + "_" + self.name + "_" + self.prefix + ".jpg" + else: + raise Exception(dir + self.chr + "_" + newstart + "_" + newend + "_" + + self.samples[0] + "_" + self.name + "_" + self.prefix + ".jpg" + " Rdplot not found") + + def makeplot(self, pedir, rddir, outdir, flank, build="hg38"): + if self.type != "INS": + if int(self.end) - int(self.start) < 2000: + STR2 = self.varname + " " + \ + str(int(self.end) - int(self.start)) + 'bp' + else: + STR2 = self.varname + " " + \ + str(int((int(self.end) - int(self.start)) / 1000)) + 'kb' + else: + STR2 = self.varname + pesrplot = self.pesrplotname(pedir) + if self.type == "DUP" or self.type == "DEL": + rdplot = self.rdplotname(rddir, flank) + img = Image.open(rdplot) # rd plot + # crop out original RD plot annotations + img2 = img.crop((0, 230, img.size[0], img.size[1])) + img2.save("croprd.jpg") + # get new annotation + STR1 = self.chr + ":" + \ + '{0:,}'.format(int(self.start)) + '-' + \ + '{0:,}'.format(int(self.end)) + " (+" + build + ")" + outfile = 'info.jpg' + words(STR1, STR2, outfile, 100) # new Rd plot + vstack(['info.jpg', "croprd.jpg", pesrplot], outdir + + self.varname + "_denovo.png") # combine rd pe and sr together + else: + STR1 = self.chr + ":" + \ + '{0:,}'.format(int(self.start)) + '-' + \ + '{0:,}'.format(int(self.end)) + " (hg38)" + outfile = 'info.jpg' + words(STR1, STR2, outfile, 50) + vstack(['info.jpg', pesrplot], outdir + + self.varname + "_denovo.png") + class VariantInfo(): - def __init__(self,pedfile,prefix): - self.pedfile=pedfile - self.prefixdir={} - if os.path.isfile(prefix): - self.prefixfile=prefix - self.prefix=set([]) - with open(self.prefixfile,"r") as f: - for line in f: - if "#" not in line: - prefix,sample=line.rstrip().split() - self.prefixdir[sample]=prefix - self.prefix.add(prefix) - else: - self.prefix=prefix - famdct={} - reversedct={} - self.samplelist=[] - with open(pedfile,"r") as f: - for line in f: - dat=line.split() - [fam,sample,father,mother]=dat[0:4] - if father+","+mother not in famdct.keys(): - famdct[father+","+mother]=[sample] + def __init__(self, pedfile, prefix): + self.pedfile = pedfile + self.prefixdir = {} + if os.path.isfile(prefix): + self.prefixfile = prefix + self.prefix = set([]) + with open(self.prefixfile, "r") as f: + for line in f: + if "#" not in line: + prefix, sample = line.rstrip().split() + self.prefixdir[sample] = prefix + self.prefix.add(prefix) else: - famdct[father+","+mother].append(sample) - reversedct[sample]=father+","+mother - self.samplelist.append(sample) - self.famdct=famdct - self.reversedct=reversedct - ## QC - # if self.prefixdir!={}: - # if set(self.samplelist)!=set(self.prefixdir.keys()): + self.prefix = prefix + famdct = {} + reversedct = {} + self.samplelist = [] + with open(pedfile, "r") as f: + for line in f: + dat = line.split() + [fam, sample, father, mother] = dat[0:4] + if father + "," + mother not in famdct.keys(): + famdct[father + "," + mother] = [sample] + else: + famdct[father + "," + mother].append(sample) + reversedct[sample] = father + "," + mother + self.samplelist.append(sample) + self.famdct = famdct + self.reversedct = reversedct + # QC + # if self.prefixdir!={}: + # if set(self.samplelist)!=set(self.prefixdir.keys()): # raise Exception("prefix file and ped file has samples mismatch") - - def getprefix(self,sample): - if self.prefixdir=={}: - return self.prefix - else: - return self.prefixdir[sample] - def getnuclear(self,sample): - parents=self.reversedct[sample] - if parents!="0,0": - kids=self.famdct[parents].copy() - kids.remove(sample) - return sample+','+parents - else: - return sample + + def getprefix(self, sample): + if self.prefixdir == {}: + return self.prefix + else: + return self.prefixdir[sample] + + def getnuclear(self, sample): + parents = self.reversedct[sample] + if parents != "0,0": + kids = self.famdct[parents].copy() + kids.remove(sample) + return sample + ',' + parents + else: + return sample + class GetVariants(): - def __init__(self,inputfile,pedfile,prefix): - self.inputfile=inputfile - self.variants=[] - self.variantinfo=VariantInfo(pedfile,prefix) - with open(inputfile,"r") as f: - for line in f: - if "#" not in line: - dat=line.rstrip().split("\t") - [chr,start,end,name,type,samples]=dat[0:6] - sample=samples.split(',')[0] - varname=samples.split(',')[0]+'_'+name - if "," in sample: - raise Exception("should only have 1 sample per variant") - prefix=self.variantinfo.getprefix(sample) - nuclearfam=self.variantinfo.getnuclear(sample) - variant=Variant(chr,start,end,name,type,nuclearfam,varname,prefix) - self.variants.append(variant) - def GetRdfiles(self): - with open(self.inputfile+".igv","w") as g: - if self.variantinfo.prefixdir!={}: - for prefix in self.variantinfo.prefix: - open(self.inputfile+'_'+prefix+".txt", 'w').close() - else: - open(self.inputfile+'_'+self.variantinfo.prefix+".txt", 'w').close() - for variant in self.variants: - f=open(self.inputfile+'_'+variant.prefix+".txt",'a') - f.write("\t".join([variant.chr,variant.start,variant.end,variant.name,variant.type,variant.sample])+'\n') - g.write("\t".join([variant.chr,variant.start,variant.end,variant.name,variant.type,variant.sample,variant.varname])+'\n') - f.close() + def __init__(self, inputfile, pedfile, prefix): + self.inputfile = inputfile + self.variants = [] + self.variantinfo = VariantInfo(pedfile, prefix) + with open(inputfile, "r") as f: + for line in f: + if "#" not in line: + dat = line.rstrip().split("\t") + [chr, start, end, name, type, samples] = dat[0:6] + sample = samples.split(',')[0] + varname = samples.split(',')[0] + '_' + name + if "," in sample: + raise Exception( + "should only have 1 sample per variant") + prefix = self.variantinfo.getprefix(sample) + nuclearfam = self.variantinfo.getnuclear(sample) + variant = Variant(chr, start, end, name, + type, nuclearfam, varname, prefix) + self.variants.append(variant) + + def GetRdfiles(self): + with open(self.inputfile + ".igv", "w") as g: + if self.variantinfo.prefixdir != {}: + for prefix in self.variantinfo.prefix: + open(self.inputfile + '_' + prefix + ".txt", 'w').close() + else: + open(self.inputfile + '_' + + self.variantinfo.prefix + ".txt", 'w').close() + for variant in self.variants: + f = open(self.inputfile + '_' + variant.prefix + ".txt", 'a') + f.write("\t".join([variant.chr, variant.start, variant.end, + variant.name, variant.type, variant.sample]) + '\n') + g.write("\t".join([variant.chr, variant.start, variant.end, + variant.name, variant.type, variant.sample, variant.varname]) + '\n') + f.close() + class GetDenovoPlots(): - def __init__(self,inputfile,pedfile,prefix,pedir,rddir,outdir,flank,build="hg38",GetVariantFunc=GetVariants): - self.variants=GetVariantFunc(inputfile,pedfile,prefix).variants - if pedir[-1]=="/": - self.pedir=pedir - else: - self.pedir=pedir+"/" - if rddir[-1]=="/": - self.rddir=rddir - else: - self.rddir=rddir+"/" - if outdir[-1]=="/": - self.outdir=outdir - else: - self.outdir=outdir+"/" - self.build=build - self.flank=flank - def getplots(self): - for variant in self.variants: - variant.makeplot(self.pedir,self.rddir,self.outdir, self.flank ,self.build) - -#Main block -def main(): - parser = argparse.ArgumentParser( - description=__doc__, - formatter_class=argparse.RawDescriptionHelpFormatter) - parser.add_argument('varfile') - parser.add_argument('pedfile') - parser.add_argument('prefix') - parser.add_argument('flank') - parser.add_argument('pedir') - parser.add_argument('rddir') - parser.add_argument('outdir') - args = parser.parse_args() - obj=GetDenovoPlots(args.varfile,args.pedfile,args.prefix,args.pedir,args.rddir,args.outdir,int(args.flank),"hg38",GetVariants) - obj.getplots() -if __name__ == '__main__': - main() + def __init__(self, inputfile, pedfile, prefix, pedir, rddir, outdir, flank, build="hg38", GetVariantFunc=GetVariants): + self.variants = GetVariantFunc(inputfile, pedfile, prefix).variants + if pedir[-1] == "/": + self.pedir = pedir + else: + self.pedir = pedir + "/" + if rddir[-1] == "/": + self.rddir = rddir + else: + self.rddir = rddir + "/" + if outdir[-1] == "/": + self.outdir = outdir + else: + self.outdir = outdir + "/" + self.build = build + self.flank = flank + + def getplots(self): + for variant in self.variants: + variant.makeplot(self.pedir, self.rddir, + self.outdir, self.flank, self.build) +# Main block +def main(): + parser = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + parser.add_argument('varfile') + parser.add_argument('pedfile') + parser.add_argument('prefix') + parser.add_argument('flank') + parser.add_argument('pedir') + parser.add_argument('rddir') + parser.add_argument('outdir') + args = parser.parse_args() + obj = GetDenovoPlots(args.varfile, args.pedfile, args.prefix, args.pedir, + args.rddir, args.outdir, int(args.flank), "hg38", GetVariants) + obj.getplots() + +if __name__ == '__main__': + main() diff --git a/dockerfiles/igv/igv.py b/dockerfiles/igv/igv.py index d7de680de..1a124f446 100755 --- a/dockerfiles/igv/igv.py +++ b/dockerfiles/igv/igv.py @@ -1,13 +1,12 @@ import sys -import sys -[_,varfile]=sys.argv -plotdir="plots" -igvfile="igv.txt" -igvsh="igv.sh" -with open(varfile,'r') as f: +[_, varfile] = sys.argv +plotdir = "plots" +igvfile = "igv.txt" +igvsh = "igv.sh" +with open(varfile, 'r') as f: for line in f: - dat=line.split('\t') - chr=dat[0] - start=dat[1] - end=dat[2] - data=dat[3].split(',') \ No newline at end of file + dat = line.split('\t') + chr = dat[0] + start = dat[1] + end = dat[2] + data = dat[3].split(',') diff --git a/dockerfiles/igv/makeigv_cram.py b/dockerfiles/igv/makeigv_cram.py index 8fe4731fa..24d4fbc1b 100755 --- a/dockerfiles/igv/makeigv_cram.py +++ b/dockerfiles/igv/makeigv_cram.py @@ -1,18 +1,22 @@ -import sys,os,argparse -#[_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end -# Example +import os +import argparse +# [_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end +# Example # python makeigv.py /data/talkowski/xuefang/local/src/IGV_2.4.14/IL_DUP/IL.DUP.HG00514.V2.bed /data/talkowski/Samples/1000Genomes/HGSV_Illumina_Alignment_GRCh38 400 # bash IL.DUP.HG00514.V2.sh # bash igv.sh -b IL.DUP.HG00514.V2.txt parser = argparse.ArgumentParser("makeigvsplit_cram.py") -parser.add_argument('varfile', type=str, help='name of variant file in bed format, with cram and SVID in last two columns') -parser.add_argument('buff', type=str, help='length of buffer to add around variants') +parser.add_argument('varfile', type=str, + help='name of variant file in bed format, with cram and SVID in last two columns') +parser.add_argument( + 'buff', type=str, help='length of buffer to add around variants') parser.add_argument('fasta', type=str, help='reference sequences') parser.add_argument('sample', type=str, help='name of sample to make igv on') -parser.add_argument('chromosome', type=str, help='name of chromosome to plot igv on') +parser.add_argument('chromosome', type=str, + help='name of chromosome to plot igv on') args = parser.parse_args() @@ -20,51 +24,58 @@ fasta = args.fasta varfile = args.varfile -outstring=os.path.basename(varfile)[0:-4] -bamdir="bam" -outdir="screenshot" -igvfile="igv.txt" -bamfiscript="igv.sh" +outstring = os.path.basename(varfile)[0:-4] +bamdir = "bam" +outdir = "screenshot" +igvfile = "igv.txt" +bamfiscript = "igv.sh" ################################### -with open(bamfiscript,'w') as h: +with open(bamfiscript, 'w') as h: h.write("#!/bin/bash\n") h.write("set -e\n") h.write("mkdir -p {}\n".format(bamdir)) h.write("mkdir -p {}\n".format(outdir)) - with open(igvfile,'w') as g: + with open(igvfile, 'w') as g: g.write('new\n') g.write('genome {}\n'.format(fasta)) - with open(varfile,'r') as f: + with open(varfile, 'r') as f: for line in f: - dat=line.rstrip().split("\t") - Chr=dat[0] - if not Chr == args.chromosome: continue - Start=str(int(dat[1])-buff) - End=str(int(dat[2])+buff) - Dat=dat[3].split(',') - ID=dat[4] + dat = line.rstrip().split("\t") + Chr = dat[0] + if not Chr == args.chromosome: + continue + Start = str(int(dat[1]) - buff) + End = str(int(dat[2]) + buff) + Dat = dat[3].split(',') + ID = dat[4] for cram in Dat: - #sample=cram.split("/")[-1].split('.')[0] - g.write('load '+bamdir+'/'+args.sample+'_'+args.chromosome+'.bam\n') - if int(End)-int(Start)<10000: - g.write('goto '+Chr+":"+Start+'-'+End+'\n') + # sample=cram.split("/")[-1].split('.')[0] + g.write('load ' + bamdir + '/' + args.sample + + '_' + args.chromosome + '.bam\n') + if int(End) - int(Start) < 10000: + g.write('goto ' + Chr + ":" + Start + '-' + End + '\n') g.write('sort base\n') g.write('viewaspairs\n') g.write('collapse\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+args.sample+'_'+ID+'.png\n' ) + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + args.sample + '_' + ID + '.png\n') else: - g.write('goto '+Chr+":"+Start+'-'+str(int(Start)+1000)+'\n') # Extra 1kb buffer if variant large + # Extra 1kb buffer if variant large + g.write('goto ' + Chr + ":" + Start + + '-' + str(int(Start) + 1000) + '\n') g.write('sort base\n') g.write('viewaspairs\n') g.write('collapse\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+args.sample+'_'+ID+'.left.png\n' ) - g.write('goto '+Chr+":"+str(int(End)-1000)+'-'+End+'\n') + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + args.sample + + '_' + ID + '.left.png\n') + g.write('goto ' + Chr + ":" + + str(int(End) - 1000) + '-' + End + '\n') g.write('sort base\n') g.write('collapse\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+args.sample+'_'+ID+'.right.png\n' ) + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + args.sample + + '_' + ID + '.right.png\n') # g.write('goto '+Chr+":"+Start+'-'+End+'\n') # g.write('sort base\n') # g.write('viewaspairs\n') @@ -73,4 +84,4 @@ # g.write('snapshot '+ID+'.png\n' ) g.write('new\n') g.write('exit\n') -# with open(bamfiscript,'w') as g: \ No newline at end of file +# with open(bamfiscript,'w') as g: diff --git a/dockerfiles/igv/makeigvpesr_cram.py b/dockerfiles/igv/makeigvpesr_cram.py index 05a01bc99..6e700001b 100755 --- a/dockerfiles/igv/makeigvpesr_cram.py +++ b/dockerfiles/igv/makeigvpesr_cram.py @@ -1,21 +1,26 @@ -import sys,os,argparse -#[_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end -# Usage +import os +import argparse +# [_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end +# Usage # python makeigvpesr_cram.py varfile fasta sample ped cram_list buffer chromosome # bash IL.DUP.HG00514.V2.sh # bash igv.sh -b IL.DUP.HG00514.V2.txt parser = argparse.ArgumentParser("makeigvsplit_cram.py") -parser.add_argument('varfile', type=str, help='name of variant file in bed format, with cram and SVID in last two columns') +parser.add_argument('varfile', type=str, + help='name of variant file in bed format, with cram and SVID in last two columns') parser.add_argument('fasta', type=str, help='reference sequences') -#parser.add_argument('bam', type=str, help='name of bam to make igv on') +# parser.add_argument('bam', type=str, help='name of bam to make igv on') parser.add_argument('sample', type=str, help='name of sample to make igv on') parser.add_argument('ped', type=str, help='name of ped file') -parser.add_argument('cram_list', type=str, help='a file including sample and cram path') -parser.add_argument('outdir', type=str, help = 'output folder') -parser.add_argument('-b','--buff', type=str, help='length of buffer to add around variants', default=500) -parser.add_argument('-c','--chromosome', type=str, help='name of chromosome to make igv on', default='all') +parser.add_argument('cram_list', type=str, + help='a file including sample and cram path') +parser.add_argument('outdir', type=str, help='output folder') +parser.add_argument('-b', '--buff', type=str, + help='length of buffer to add around variants', default=500) +parser.add_argument('-c', '--chromosome', type=str, + help='name of chromosome to make igv on', default='all') args = parser.parse_args() @@ -24,84 +29,91 @@ fasta = args.fasta varfile = args.varfile -outstring=os.path.basename(varfile)[0:-4] -bamdir="pe_bam" -outdir=args.outdir -igvfile="pe.txt" -bamfiscript="pe.sh" +outstring = os.path.basename(varfile)[0:-4] +bamdir = "pe_bam" +outdir = args.outdir +igvfile = "pe.txt" +bamfiscript = "pe.sh" ################################### sample = args.sample chromosome = args.chromosome + def ped_info_readin(ped_file): - out={} - fin=open(ped_file) + out = {} + fin = open(ped_file) for line in fin: - pin=line.strip().split() + pin = line.strip().split() if not pin[1] in out.keys(): - out[pin[1]]=[pin[1]] - if not(pin[2])==0: + out[pin[1]] = [pin[1]] + if not(pin[2]) == 0: out[pin[1]].append(pin[2]) - if not(pin[3])==0: + if not(pin[3]) == 0: out[pin[1]].append(pin[3]) fin.close() return out + def cram_info_readin(cram_file): - out={} - fin=open(cram_file) + out = {} + fin = open(cram_file) for line in fin: - pin=line.strip().split() + pin = line.strip().split() if not pin[0] in out.keys(): - out[pin[0]]=pin[1:] + out[pin[0]] = pin[1:] fin.close() return(out) + ped_info = ped_info_readin(args.ped) cram_info = cram_info_readin(args.cram_list) -cram_list=[] +cram_list = [] for member in ped_info[sample]: if member in cram_info.keys(): cram_list.append(cram_info[member][0]) -with open(bamfiscript,'w') as h: +with open(bamfiscript, 'w') as h: h.write("#!/bin/bash\n") h.write("set -e\n") h.write("mkdir -p {}\n".format(bamdir)) h.write("mkdir -p {}\n".format(outdir)) - with open(igvfile,'w') as g: + with open(igvfile, 'w') as g: g.write('new\n') g.write('genome {}\n'.format(fasta)) - with open(varfile,'r') as f: + with open(varfile, 'r') as f: for line in f: - dat=line.rstrip().split("\t") - Chr=dat[0] - if not chromosome=='all': - if not Chr == chromosome: continue - Start=str(int(dat[1])-buff) - End=str(int(dat[2])+buff) - ID=dat[4] + dat = line.rstrip().split("\t") + Chr = dat[0] + if not chromosome == 'all': + if not Chr == chromosome: + continue + Start = str(int(dat[1]) - buff) + End = str(int(dat[2]) + buff) + ID = dat[4] for cram in cram_list: - g.write('load '+cram+'\n') - if int(End)-int(Start)<10000: - g.write('goto '+Chr+":"+Start+'-'+End+'\n') + g.write('load ' + cram + '\n') + if int(End) - int(Start) < 10000: + g.write('goto ' + Chr + ":" + Start + '-' + End + '\n') g.write('sort base\n') g.write('viewaspairs\n') g.write('squish\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+sample+'_'+ID+'.png\n' ) + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + sample + '_' + ID + '.png\n') else: - g.write('goto '+Chr+":"+Start+'-'+str(int(Start)+1000)+'\n') # Extra 1kb buffer if variant large + # Extra 1kb buffer if variant large + g.write('goto ' + Chr + ":" + Start + + '-' + str(int(Start) + 1000) + '\n') g.write('sort base\n') g.write('viewaspairs\n') g.write('squish\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+sample+'_'+ID+'.left.png\n' ) - g.write('goto '+Chr+":"+str(int(End)-1000)+'-'+End+'\n') + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + sample + '_' + ID + '.left.png\n') + g.write('goto ' + Chr + ":" + + str(int(End) - 1000) + '-' + End + '\n') g.write('sort base\n') g.write('squish\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+sample+'_'+ID+'.right.png\n' ) + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + sample + '_' + ID + '.right.png\n') # g.write('goto '+Chr+":"+Start+'-'+End+'\n') # g.write('sort base\n') # g.write('viewaspairs\n') @@ -110,4 +122,4 @@ def cram_info_readin(cram_file): # g.write('snapshot '+ID+'.png\n' ) g.write('new\n') g.write('exit\n') -# with open(bamfiscript,'w') as g: \ No newline at end of file +# with open(bamfiscript,'w') as g: diff --git a/dockerfiles/igv/makeigvpesr_trio.py b/dockerfiles/igv/makeigvpesr_trio.py index f1a39655f..dcdf078fb 100755 --- a/dockerfiles/igv/makeigvpesr_trio.py +++ b/dockerfiles/igv/makeigvpesr_trio.py @@ -1,19 +1,24 @@ -import sys,os,argparse -#[_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end -# Usage +import os +import argparse +# [_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end +# Usage # python makeigvpesr_cram.py varfile fasta sample ped cram_list buffer chromosome # bash IL.DUP.HG00514.V2.sh # bash igv.sh -b IL.DUP.HG00514.V2.txt parser = argparse.ArgumentParser("makeigvsplit_cram.py") -parser.add_argument('varfile', type=str, help='name of variant file in bed format, with cram and SVID in last two columns') +parser.add_argument('varfile', type=str, + help='name of variant file in bed format, with cram and SVID in last two columns') parser.add_argument('fasta', type=str, help='reference sequences') parser.add_argument('sample', type=str, help='name of sample to make igv on') -parser.add_argument('cram_list', type=str, help='a file including sample and cram path') -parser.add_argument('outdir', type=str, help = 'output folder') -parser.add_argument('-b','--buff', type=str, help='length of buffer to add around variants', default=500) -parser.add_argument('-c','--chromosome', type=str, help='name of chromosome to make igv on', default='all') +parser.add_argument('cram_list', type=str, + help='a file including sample and cram path') +parser.add_argument('outdir', type=str, help='output folder') +parser.add_argument('-b', '--buff', type=str, + help='length of buffer to add around variants', default=500) +parser.add_argument('-c', '--chromosome', type=str, + help='name of chromosome to make igv on', default='all') args = parser.parse_args() @@ -22,81 +27,88 @@ fasta = args.fasta varfile = args.varfile -outstring=os.path.basename(varfile)[0:-4] -bamdir="pe_bam" -outdir=args.outdir -igvfile="pe.txt" -bamfiscript="pe.sh" +outstring = os.path.basename(varfile)[0:-4] +bamdir = "pe_bam" +outdir = args.outdir +igvfile = "pe.txt" +bamfiscript = "pe.sh" ################################### sample = args.sample chromosome = args.chromosome + def ped_info_readin(ped_file): - out={} - fin=open(ped_file) + out = {} + fin = open(ped_file) for line in fin: - pin=line.strip().split() + pin = line.strip().split() if not pin[1] in out.keys(): - out[pin[1]]=[pin[1]] - if not(pin[2])==0: + out[pin[1]] = [pin[1]] + if not(pin[2]) == 0: out[pin[1]].append(pin[2]) - if not(pin[3])==0: + if not(pin[3]) == 0: out[pin[1]].append(pin[3]) fin.close() return out + def cram_info_readin(cram_file): - out={} - fin=open(cram_file) + out = {} + fin = open(cram_file) for line in fin: - pin=line.strip().split() + pin = line.strip().split() if not pin[0] in out.keys(): - out[pin[0]]=pin[1:] + out[pin[0]] = pin[1:] fin.close() return(out) -#ped_info = ped_info_readin(args.ped) -#cram_info = cram_info_readin(args.cram_list) -cram_list=args.cram_list.split(',') -with open(bamfiscript,'w') as h: +# ped_info = ped_info_readin(args.ped) +# cram_info = cram_info_readin(args.cram_list) +cram_list = args.cram_list.split(',') + +with open(bamfiscript, 'w') as h: h.write("#!/bin/bash\n") h.write("set -e\n") h.write("mkdir -p {}\n".format(bamdir)) h.write("mkdir -p {}\n".format(outdir)) - with open(igvfile,'w') as g: + with open(igvfile, 'w') as g: g.write('new\n') g.write('genome {}\n'.format(fasta)) - with open(varfile,'r') as f: + with open(varfile, 'r') as f: for line in f: - dat=line.rstrip().split("\t") - Chr=dat[0] - if not chromosome=='all': - if not Chr == chromosome: continue - Start=str(int(dat[1])-buff) - End=str(int(dat[2])+buff) - ID=dat[4] + dat = line.rstrip().split("\t") + Chr = dat[0] + if not chromosome == 'all': + if not Chr == chromosome: + continue + Start = str(int(dat[1]) - buff) + End = str(int(dat[2]) + buff) + ID = dat[4] for cram in cram_list: - g.write('load '+cram+'\n') - if int(End)-int(Start)<10000: - g.write('goto '+Chr+":"+Start+'-'+End+'\n') + g.write('load ' + cram + '\n') + if int(End) - int(Start) < 10000: + g.write('goto ' + Chr + ":" + Start + '-' + End + '\n') g.write('sort base\n') g.write('viewaspairs\n') g.write('squish\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+sample+'_'+ID+'.png\n' ) + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + sample + '_' + ID + '.png\n') else: - g.write('goto '+Chr+":"+Start+'-'+str(int(Start)+1000)+'\n') # Extra 1kb buffer if variant large + # Extra 1kb buffer if variant large + g.write('goto ' + Chr + ":" + Start + + '-' + str(int(Start) + 1000) + '\n') g.write('sort base\n') g.write('viewaspairs\n') g.write('squish\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+sample+'_'+ID+'.left.png\n' ) - g.write('goto '+Chr+":"+str(int(End)-1000)+'-'+End+'\n') + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + sample + '_' + ID + '.left.png\n') + g.write('goto ' + Chr + ":" + + str(int(End) - 1000) + '-' + End + '\n') g.write('sort base\n') g.write('squish\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+sample+'_'+ID+'.right.png\n' ) + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + sample + '_' + ID + '.right.png\n') # g.write('goto '+Chr+":"+Start+'-'+End+'\n') # g.write('sort base\n') # g.write('viewaspairs\n') @@ -105,4 +117,4 @@ def cram_info_readin(cram_file): # g.write('snapshot '+ID+'.png\n' ) g.write('new\n') g.write('exit\n') -# with open(bamfiscript,'w') as g: \ No newline at end of file +# with open(bamfiscript,'w') as g: diff --git a/dockerfiles/igv/makeigvsplit_cram.py b/dockerfiles/igv/makeigvsplit_cram.py index a1d1859e2..246bc7882 100755 --- a/dockerfiles/igv/makeigvsplit_cram.py +++ b/dockerfiles/igv/makeigvsplit_cram.py @@ -1,19 +1,23 @@ -import sys,os,argparse -#[_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end -# Example +import os +import argparse +# [_,varfile,buff,fasta]=sys.argv #assume the varfile has *.bed in the end +# Example # python makeigv.py /data/talkowski/xuefang/local/src/IGV_2.4.14/IL_DUP/IL.DUP.HG00514.V2.bed /data/talkowski/Samples/1000Genomes/HGSV_Illumina_Alignment_GRCh38 400 # bash IL.DUP.HG00514.V2.sh # bash igv.sh -b IL.DUP.HG00514.V2.txt parser = argparse.ArgumentParser("makeigvsplit_cram.py") -parser.add_argument('varfile', type=str, help='name of variant file in bed format, with cram and SVID in last two columns') -parser.add_argument('buff', type=str, help='length of buffer to add around variants') +parser.add_argument('varfile', type=str, + help='name of variant file in bed format, with cram and SVID in last two columns') +parser.add_argument( + 'buff', type=str, help='length of buffer to add around variants') parser.add_argument('fasta', type=str, help='reference sequences') parser.add_argument('bam', type=str, help='name of bam to make igv on') parser.add_argument('sample', type=str, help='name of sample to make igv on') -parser.add_argument('chromosome', type=str, help='name of chromosome to make igv on', default='all') +parser.add_argument('chromosome', type=str, + help='name of chromosome to make igv on', default='all') args = parser.parse_args() @@ -22,53 +26,57 @@ fasta = args.fasta varfile = args.varfile -outstring=os.path.basename(varfile)[0:-4] -bamdir="pe_bam" -outdir="pe_screenshot" -igvfile="pe.txt" -bamfiscript="pe.sh" +outstring = os.path.basename(varfile)[0:-4] +bamdir = "pe_bam" +outdir = "pe_screenshot" +igvfile = "pe.txt" +bamfiscript = "pe.sh" ################################### sample = args.sample chromosome = args.chromosome -with open(bamfiscript,'w') as h: +with open(bamfiscript, 'w') as h: h.write("#!/bin/bash\n") h.write("set -e\n") h.write("mkdir -p {}\n".format(bamdir)) h.write("mkdir -p {}\n".format(outdir)) - with open(igvfile,'w') as g: + with open(igvfile, 'w') as g: g.write('new\n') g.write('genome {}\n'.format(fasta)) - with open(varfile,'r') as f: + with open(varfile, 'r') as f: for line in f: - dat=line.rstrip().split("\t") - Chr=dat[0] - if not chromosome=='all': - if not Chr == chromosome: continue - Start=str(int(dat[1])-buff) - End=str(int(dat[2])+buff) - Dat=dat[3].split(',') - ID=dat[4] + dat = line.rstrip().split("\t") + Chr = dat[0] + if not chromosome == 'all': + if not Chr == chromosome: + continue + Start = str(int(dat[1]) - buff) + End = str(int(dat[2]) + buff) + Dat = dat[3].split(',') + ID = dat[4] for cram in Dat: - g.write('load '+args.bam+'\n') - if int(End)-int(Start)<10000: - g.write('goto '+Chr+":"+Start+'-'+End+'\n') + g.write('load ' + args.bam + '\n') + if int(End) - int(Start) < 10000: + g.write('goto ' + Chr + ":" + Start + '-' + End + '\n') g.write('sort base\n') g.write('viewaspairs\n') g.write('squish\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+sample+'_'+ID+'.png\n' ) + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + sample + '_' + ID + '.png\n') else: - g.write('goto '+Chr+":"+Start+'-'+str(int(Start)+1000)+'\n') # Extra 1kb buffer if variant large + # Extra 1kb buffer if variant large + g.write('goto ' + Chr + ":" + Start + + '-' + str(int(Start) + 1000) + '\n') g.write('sort base\n') g.write('viewaspairs\n') g.write('squish\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+sample+'_'+ID+'.left.png\n' ) - g.write('goto '+Chr+":"+str(int(End)-1000)+'-'+End+'\n') + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + sample + '_' + ID + '.left.png\n') + g.write('goto ' + Chr + ":" + + str(int(End) - 1000) + '-' + End + '\n') g.write('sort base\n') g.write('squish\n') - g.write('snapshotDirectory '+outdir+'\n') - g.write('snapshot '+sample+'_'+ID+'.right.png\n' ) + g.write('snapshotDirectory ' + outdir + '\n') + g.write('snapshot ' + sample + '_' + ID + '.right.png\n') # g.write('goto '+Chr+":"+Start+'-'+End+'\n') # g.write('sort base\n') # g.write('viewaspairs\n') @@ -77,5 +85,3 @@ # g.write('snapshot '+ID+'.png\n' ) g.write('new\n') g.write('exit\n') - - diff --git a/dockerfiles/rdpesr/Modify_vcf_by_steps.py b/dockerfiles/rdpesr/Modify_vcf_by_steps.py index cdb029a14..aafa73643 100755 --- a/dockerfiles/rdpesr/Modify_vcf_by_steps.py +++ b/dockerfiles/rdpesr/Modify_vcf_by_steps.py @@ -10,42 +10,45 @@ import pysam import argparse -def modify_vcf(vcf_in_file,vcf_out_file,step_size,contig): + +def modify_vcf(vcf_in_file, vcf_out_file, step_size, contig): vcf_in = pysam.VariantFile(vcf_in_file) - vcf_out = pysam.VariantFile(vcf_out_file,'w',header = vcf_in.header) + vcf_out = pysam.VariantFile(vcf_out_file, 'w', header=vcf_in.header) for rec in vcf_in: - rec.pos+=step_size - rec.stop+=step_size - if rec.pos>0 and not rec.stop>contig[rec.contig]: + rec.pos += step_size + rec.stop += step_size + if rec.pos > 0 and not rec.stop > contig[rec.contig]: vcf_out.write(rec) vcf_out.close() vcf_in.close() def contig_readin(contig): - out={} - fin=open(contig) + out = {} + fin = open(contig) for line in fin: - pin=line.strip().split() - out[pin[0]]=int(pin[1]) + pin = line.strip().split() + out[pin[0]] = int(pin[1]) fin.close() return out + def main(): - parser = argparse.ArgumentParser(description='Shift each variants in vcf by a fixed step.') - parser.add_argument('vcf_in', metavar='', type=str, + parser = argparse.ArgumentParser( + description='Shift each variants in vcf by a fixed step.') + parser.add_argument('vcf_in', metavar='', type=str, help='name of vcf file to be modified') - parser.add_argument('vcf_out', metavar='', type=str, + parser.add_argument('vcf_out', metavar='', type=str, help='name of output vcf file') - parser.add_argument('-s','--step_size', type=int, + parser.add_argument('-s', '--step_size', type=int, help='size of step to be shifted.') - parser.add_argument('-c','--contig', type=str, + parser.add_argument('-c', '--contig', type=str, help='contig files, or reference index.') args = parser.parse_args() contig = contig_readin(args.contig) - modify_vcf(args.vcf_in,args.vcf_out,args.step_size,contig) + modify_vcf(args.vcf_in, args.vcf_out, args.step_size, contig) + if __name__ == "__main__": main() - diff --git a/dockerfiles/rdpesr/add_RD_to_SVs.py b/dockerfiles/rdpesr/add_RD_to_SVs.py index 1c381f6a9..15d2d7722 100755 --- a/dockerfiles/rdpesr/add_RD_to_SVs.py +++ b/dockerfiles/rdpesr/add_RD_to_SVs.py @@ -1,63 +1,75 @@ -#script to add cov to SVs - -def add_ILL_cov(pb_uni_svs,bincov): - for i in pb_uni_svs.keys(): - for j in pb_uni_svs[i]: - cov_list=cov_SV_readin(j, bincov) - if len(cov_list)>0: - j+=[len(cov_list),np.median(cov_list), np.mean(cov_list),np.std(cov_list)] - else: - j+=[0, 'nan', 'nan', 'nan'] - #print(j) - return pb_uni_svs +# script to add cov to SVs + +import os +import argparse +import numpy as np + + +def add_ILL_cov(pb_uni_svs, bincov): + for i in pb_uni_svs.keys(): + for j in pb_uni_svs[i]: + cov_list = cov_SV_readin(j, bincov) + if len(cov_list) > 0: + j += [len(cov_list), np.median(cov_list), + np.mean(cov_list), np.std(cov_list)] + else: + j += [0, 'nan', 'nan', 'nan'] + # print(j) + return pb_uni_svs + def bed_info_readin(input): - fin=open(input) - out={} - for line in fin: - pin=line.strip().split() - if pin[0][0]=='#': continue - if not pin[0] in out.keys(): - out[pin[0]]=[] - out[pin[0]].append([pin[0],int(pin[1]),int(pin[2])]+pin[3:]) - fin.close() - return out + fin = open(input) + out = {} + for line in fin: + pin = line.strip().split() + if pin[0][0] == '#': + continue + if not pin[0] in out.keys(): + out[pin[0]] = [] + out[pin[0]].append([pin[0], int(pin[1]), int(pin[2])] + pin[3:]) + fin.close() + return out + def cov_SV_readin(svpos, bincov): - fin=os.popen(r'''tabix %s %s:%d-%d'''%(bincov, svpos[0],svpos[1],svpos[2])) - normCov_list=[] - for line in fin: - pin=line.strip().split() - normCov_list.append(float(pin[-1])) - fin.close() - return normCov_list + fin = os.popen(r'''tabix %s %s:%d-%d''' % + (bincov, svpos[0], svpos[1], svpos[2])) + normCov_list = [] + for line in fin: + pin = line.strip().split() + normCov_list.append(float(pin[-1])) + fin.close() + return normCov_list + def path_modify(path): - if not path[-1]=='/': - path+='/' - return path + if not path[-1] == '/': + path += '/' + return path -def write_output(output,pb_uni_svs): - fo=open(output,'w') - for k1 in pb_uni_svs.keys(): - for k2 in pb_uni_svs[k1]: - print('\t'.join([str(i) for i in k2]),file=fo) - fo.close() -def main(): - parser = argparse.ArgumentParser(description='S2a.calcu.Seq_Cov.of.PB_Uni.py') - parser.add_argument('input', help='name of input file containing PacBio unique SVs in bed format') - parser.add_argument('bincov',help='name of bincov metrics of the sample to be processed') - parser.add_argument('output',help='name of bincov metrics of the sample to be processed') - args = parser.parse_args() - pb_uni_svs=bed_info_readin(args.input) - pb_uni_svs=add_ILL_cov(pb_uni_svs,args.bincov) - write_output(args.output,pb_uni_svs) +def write_output(output, pb_uni_svs): + fo = open(output, 'w') + for k1 in pb_uni_svs.keys(): + for k2 in pb_uni_svs[k1]: + print('\t'.join([str(i) for i in k2]), file=fo) + fo.close() -import os -import numpy as np -import argparse -main() +def main(): + parser = argparse.ArgumentParser( + description='S2a.calcu.Seq_Cov.of.PB_Uni.py') + parser.add_argument( + 'input', help='name of input file containing PacBio unique SVs in bed format') + parser.add_argument( + 'bincov', help='name of bincov metrics of the sample to be processed') + parser.add_argument( + 'output', help='name of bincov metrics of the sample to be processed') + args = parser.parse_args() + pb_uni_svs = bed_info_readin(args.input) + pb_uni_svs = add_ILL_cov(pb_uni_svs, args.bincov) + write_output(args.output, pb_uni_svs) +main() diff --git a/dockerfiles/rdpesr/add_SR_PE_to_PB_INS.V2.py b/dockerfiles/rdpesr/add_SR_PE_to_PB_INS.V2.py index 860f14f6c..a0c8c4a43 100755 --- a/dockerfiles/rdpesr/add_SR_PE_to_PB_INS.V2.py +++ b/dockerfiles/rdpesr/add_SR_PE_to_PB_INS.V2.py @@ -1,126 +1,147 @@ +import os + + def INS_readin(filein): - fin=open(filein) - out=[] + fin = open(filein) + out = [] for line in fin: - pin=line.strip().split() - if pin[0][0]=='#': continue - #if pin[4]=='INS': + pin = line.strip().split() + if pin[0][0] == '#': + continue + # if pin[4]=='INS': out.append(pin) fin.close() return out -def add_Num_SR_le(sr_index,info, flank_length=100): - #eg of info: ['chr1', '137221', '137339', 'HOM', 'INS'] - fin=os.popen(r'''tabix %s %s:%d-%d'''%(sr_index, info[0],int(info[1])-flank_length, int(info[1])+flank_length)) - tmp=[] + +def add_Num_SR_le(sr_index, info, flank_length=100): + # eg of info: ['chr1', '137221', '137339', 'HOM', 'INS'] + fin = os.popen(r'''tabix %s %s:%d-%d''' % (sr_index, + info[0], int(info[1]) - flank_length, int(info[1]) + flank_length)) + tmp = [] for line in fin: - pin=line.strip().split() + pin = line.strip().split() tmp.append(pin) fin.close() - if len(tmp)==0: + if len(tmp) == 0: return 0 else: return max([int(i[3]) for i in tmp]) -def add_Num_SR_ri(sr_index,info, flank_length=100): - #eg of info: ['chr1', '137221', '137339', 'HOM', 'INS'] - fin=os.popen(r'''tabix %s %s:%d-%d'''%(sr_index, info[0],int(info[2])-flank_length, int(info[2])+flank_length)) - tmp=[] + +def add_Num_SR_ri(sr_index, info, flank_length=100): + # eg of info: ['chr1', '137221', '137339', 'HOM', 'INS'] + fin = os.popen(r'''tabix %s %s:%d-%d''' % (sr_index, + info[0], int(info[2]) - flank_length, int(info[2]) + flank_length)) + tmp = [] for line in fin: - pin=line.strip().split() + pin = line.strip().split() tmp.append(pin) fin.close() - if len(tmp)==0: + if len(tmp) == 0: return 0 else: return max([int(i[3]) for i in tmp]) + def add_Num_PE_le(pe_index, info, flank_length=300): - fin=os.popen(r'''tabix %s %s:%d-%d'''%(pe_index, info[0],int(info[1])-2*flank_length, int(info[1])+flank_length)) - tmp=[] + fin = os.popen(r'''tabix %s %s:%d-%d''' % (pe_index, + info[0], int(info[1]) - 2 * flank_length, int(info[1]) + flank_length)) + tmp = [] for line in fin: - pin=line.strip().split() - if 'INS' in pin[4] or pin[4] in ['INS','ALU','LINE1','SVA']: + pin = line.strip().split() + if 'INS' in pin[4] or pin[4] in ['INS', 'ALU', 'LINE1', 'SVA']: tmp.append(pin) else: - if pin[0]==pin[3]: - if abs(int(pin[4])-int(pin[1]))>100*(int(info[2])-int(info[1])): continue - else: tmp.append(pin) + if pin[0] == pin[3]: + if abs(int(pin[4]) - int(pin[1])) > 100 * (int(info[2]) - int(info[1])): + continue + else: + tmp.append(pin) fin.close() - #if len(tmp)==0: + # if len(tmp)==0: # return 0 - #else: + # else: # cluster_hash= cluster_pe_mate(tmp) # return cluster_hash[0] return len(tmp) + def add_Num_PE_ri(pe_index, info, flank_length=300): - fin=os.popen(r'''tabix %s %s:%d-%d'''%(pe_index, info[0],int(info[2])-flank_length, int(info[2])+2*flank_length)) - tmp=[] + fin = os.popen(r'''tabix %s %s:%d-%d''' % (pe_index, + info[0], int(info[2]) - flank_length, int(info[2]) + 2 * flank_length)) + tmp = [] for line in fin: - pin=line.strip().split() - if 'INS' in pin[4] or pin[4] in ['INS','ALU','LINE1','SVA']: + pin = line.strip().split() + if 'INS' in pin[4] or pin[4] in ['INS', 'ALU', 'LINE1', 'SVA']: tmp.append(pin) else: - if pin[0]==pin[3]: - if abs(int(pin[4])-int(pin[1]))>100*(int(info[2])-int(info[1])): continue - else: tmp.append(pin) + if pin[0] == pin[3]: + if abs(int(pin[4]) - int(pin[1])) > 100 * (int(info[2]) - int(info[1])): + continue + else: + tmp.append(pin) fin.close() - #if len(tmp)==0: + # if len(tmp)==0: # return 0 - #else: + # else: # cluster_hash= cluster_pe_mate(tmp) # return cluster_hash[0] return len(tmp) + def cluster_pe_mate(tmp): - out={} + out = {} for i in tmp: if not i[3] in out.keys(): - out[i[3]]=[] + out[i[3]] = [] out[i[3]].append(int(i[4])) - key_name=[i for i in out.keys()] - key_lengh=[len(out[i]) for i in key_name] + key_name = [i for i in out.keys()] + key_lengh = [len(out[i]) for i in key_name] most_abundant = key_name[key_lengh.index(max(key_lengh))] - return [most_abundant,sorted(out[most_abundant])] + return [most_abundant, sorted(out[most_abundant])] + def write_Num_SR(info_list, fileout): - fo=open(fileout, 'w') + fo = open(fileout, 'w') for i in info_list: print('\t'.join([str(j) for j in i]), file=fo) fo.close() + def main(): import argparse parser = argparse.ArgumentParser("add_SR_PE_to_PB_INS.py") - parser.add_argument('PB_bed', type=str, help='name of input PacBio bed file') - parser.add_argument('pe_file', type=str, help='name of pe files with index') - parser.add_argument('sr_file', type=str, help='name of sr files with index') - parser.add_argument('output', type=str, help='name of output files with index') + parser.add_argument('PB_bed', type=str, + help='name of input PacBio bed file') + parser.add_argument('pe_file', type=str, + help='name of pe files with index') + parser.add_argument('sr_file', type=str, + help='name of sr files with index') + parser.add_argument('output', type=str, + help='name of output files with index') args = parser.parse_args() - import os filein = args.PB_bed pe_index = args.pe_file sr_index = args.sr_file - info_list=INS_readin(filein) + info_list = INS_readin(filein) for i in info_list: - i+=[add_Num_PE_le(pe_index,i)] - i+=[add_Num_PE_ri(pe_index,i)] - if i[4]=='INS' or i[4]=='MEI': - i+=[add_Num_SR_le(sr_index,i,50)] - i+=[add_Num_SR_ri(sr_index,i,50)] + i += [add_Num_PE_le(pe_index, i)] + i += [add_Num_PE_ri(pe_index, i)] + if i[4] == 'INS' or i[4] == 'MEI': + i += [add_Num_SR_le(sr_index, i, 50)] + i += [add_Num_SR_ri(sr_index, i, 50)] else: - if int(i[5])<300: - i+=[add_Num_SR_le(sr_index,i,int(i[5])/2)] - i+=[add_Num_SR_ri(sr_index,i,int(i[5])/2)] - else: - i+=[add_Num_SR_le(sr_index,i,150)] - i+=[add_Num_SR_ri(sr_index,i,150)] - i+=[add_Num_SR_le(sr_index,i,0)] - i+=[add_Num_SR_ri(sr_index,i,0)] + if int(i[5]) < 300: + i += [add_Num_SR_le(sr_index, i, int(i[5]) / 2)] + i += [add_Num_SR_ri(sr_index, i, int(i[5]) / 2)] + else: + i += [add_Num_SR_le(sr_index, i, 150)] + i += [add_Num_SR_ri(sr_index, i, 150)] + i += [add_Num_SR_le(sr_index, i, 0)] + i += [add_Num_SR_ri(sr_index, i, 0)] write_Num_SR(info_list, args.output) -import os if __name__ == '__main__': - main() + main() diff --git a/dockerfiles/rdpesr/add_SR_PE_to_breakpoints.py b/dockerfiles/rdpesr/add_SR_PE_to_breakpoints.py index b26ff646e..9dbe207d5 100755 --- a/dockerfiles/rdpesr/add_SR_PE_to_breakpoints.py +++ b/dockerfiles/rdpesr/add_SR_PE_to_breakpoints.py @@ -1,69 +1,75 @@ +import os + + def INS_readin(filein): - fin=open(filein) - out=[] + fin = open(filein) + out = [] for line in fin: - pin=line.strip().split() - #if pin[4]=='INS': + pin = line.strip().split() + # if pin[4]=='INS': out.append(pin) fin.close() return out -def add_Num_SR(sr_index,info, flank_length=50): - #eg of info: ['chr1', '137221', '137339', 'HOM', 'INS'] - fin=os.popen(r'''tabix %s %s:%d-%d'''%(sr_index, info[0],int(info[1])-flank_length, int(info[1])+flank_length)) - tmp=[] + +def add_Num_SR(sr_index, info, flank_length=50): + # eg of info: ['chr1', '137221', '137339', 'HOM', 'INS'] + fin = os.popen(r'''tabix %s %s:%d-%d''' % (sr_index, + info[0], int(info[1]) - flank_length, int(info[1]) + flank_length)) + tmp = [] for line in fin: - pin=line.strip().split() + pin = line.strip().split() tmp.append(pin) fin.close() - if len(tmp)==0: + if len(tmp) == 0: return 0 else: return max([int(i[3]) for i in tmp]) + def add_Num_PE(pe_index, info, flank_length=100): - fin=os.popen(r'''tabix %s %s:%d-%d'''%(pe_index, info[0],int(info[1])-flank_length, int(info[1])+flank_length)) - tmp=[] + fin = os.popen(r'''tabix %s %s:%d-%d''' % (pe_index, + info[0], int(info[1]) - flank_length, int(info[1]) + flank_length)) + tmp = [] for line in fin: - pin=line.strip().split() - #if int(pin[2])-int(pin[1])>100*(int(info[2])-int(info[1])): continue + pin = line.strip().split() + # if int(pin[2])-int(pin[1])>100*(int(info[2])-int(info[1])): continue tmp.append(pin) fin.close() - #if len(tmp)==0: + # if len(tmp)==0: # return 0 - #else: + # else: # cluster_hash= cluster_pe_mate(tmp) # return cluster_hash[0] return len(tmp) + def write_Num_SR(info_list, fileout): - fo=open(fileout, 'w') + fo = open(fileout, 'w') for i in info_list: print('\t'.join([str(j) for j in i]), file=fo) fo.close() + def main(): import argparse parser = argparse.ArgumentParser("add_SR_PE_to_PB_INS.py") - parser.add_argument('PB_bed', type=str, help='name of input PacBio bed file') - parser.add_argument('pe_file', type=str, help='name of pe files with index') - #parser.add_argument('sr_file', type=str, help='name of sr files with index') + parser.add_argument('PB_bed', type=str, + help='name of input PacBio bed file') + parser.add_argument('pe_file', type=str, + help='name of pe files with index') + # parser.add_argument('sr_file', type=str, help='name of sr files with index') args = parser.parse_args() - import os filein = args.PB_bed pe_index = args.pe_file - #sr_index = args.sr_file - info_list=INS_readin(filein) + # sr_index = args.sr_file + info_list = INS_readin(filein) for i in info_list: - #i+=[add_Num_SR(sr_index,i)] - i+=[add_Num_PE(pe_index,i)] + # i+=[add_Num_SR(sr_index,i)] + i += [add_Num_PE(pe_index, i)] print(i) - write_Num_SR(info_list, filein+'.with_INS_PE') + write_Num_SR(info_list, filein + '.with_INS_PE') -import os if __name__ == '__main__': - main() - - - + main() diff --git a/dockerfiles/rdpesr/calcu_inheri_stat.py b/dockerfiles/rdpesr/calcu_inheri_stat.py index 0c2e9fb74..3ccba3afd 100755 --- a/dockerfiles/rdpesr/calcu_inheri_stat.py +++ b/dockerfiles/rdpesr/calcu_inheri_stat.py @@ -1,98 +1,108 @@ -def calcu_inheri_hash(vcf_file,fam_file): - fam_info = trio_info_readin(fam_file) - fvcf = pysam.VariantFile(vcf_file) - inheri_hash={} - for child in fam_info.keys(): - inheri_hash[child] = [] - for record in fvcf: - print(record.id) - for child in fam_info.keys(): - trio = fam_info[child]+[child] - trio_len = sum([1 if i in record.samples.keys() else 0 for i in trio]) - if trio_len==3: - gt = [record.samples[i]['GT'] for i in trio] - if (None, None) in gt: continue - if gt == [(0, 0), (0, 0), (0, 0)]: continue - else: - if gt[1]==(0, 0) and not gt[0]==(0, 0) and not gt[2]==(0, 0): - inheri_hash[child].append(['fa_pb',record.info['SVTYPE']]) - if gt[0]==(0, 0) and not gt[1]==(0, 0) and not gt[2]==(0, 0): - inheri_hash[child].append(['mo_pb',record.info['SVTYPE']]) - if not gt[0]==(0, 0) and not gt[1]==(0, 0) and not gt[2]==(0, 0): - inheri_hash[child].append(['fa_mo_pb',record.info['SVTYPE']]) - if gt[0]==(0, 0) and gt[1]==(0, 0) and not gt[2]==(0, 0): - inheri_hash[child].append(['denovo',record.info['SVTYPE']]) - fvcf.close() - return inheri_hash +import pysam +import argparse + + +def calcu_inheri_hash(vcf_file, fam_file): + fam_info = trio_info_readin(fam_file) + fvcf = pysam.VariantFile(vcf_file) + inheri_hash = {} + for child in fam_info.keys(): + inheri_hash[child] = [] + for record in fvcf: + print(record.id) + for child in fam_info.keys(): + trio = fam_info[child] + [child] + trio_len = sum( + [1 if i in record.samples.keys() else 0 for i in trio]) + if trio_len == 3: + gt = [record.samples[i]['GT'] for i in trio] + if (None, None) in gt: + continue + if gt == [(0, 0), (0, 0), (0, 0)]: + continue + else: + if gt[1] == (0, 0) and not gt[0] == (0, 0) and not gt[2] == (0, 0): + inheri_hash[child].append( + ['fa_pb', record.info['SVTYPE']]) + if gt[0] == (0, 0) and not gt[1] == (0, 0) and not gt[2] == (0, 0): + inheri_hash[child].append( + ['mo_pb', record.info['SVTYPE']]) + if not gt[0] == (0, 0) and not gt[1] == (0, 0) and not gt[2] == (0, 0): + inheri_hash[child].append( + ['fa_mo_pb', record.info['SVTYPE']]) + if gt[0] == (0, 0) and gt[1] == (0, 0) and not gt[2] == (0, 0): + inheri_hash[child].append( + ['denovo', record.info['SVTYPE']]) + fvcf.close() + return inheri_hash + def inheri_hash_to_stat(inheri_hash): - inheri_stat={} - for child in inheri_hash.keys(): - inheri_stat[child] = {} - for rec in inheri_hash[child]: - if not rec[1] in inheri_stat[child].keys(): - inheri_stat[child][rec[1]]={} - if not rec[0] in inheri_stat[child][rec[1]].keys(): - inheri_stat[child][rec[1]][rec[0]]=0 - inheri_stat[child][rec[1]][rec[0]]+=1 - return inheri_stat + inheri_stat = {} + for child in inheri_hash.keys(): + inheri_stat[child] = {} + for rec in inheri_hash[child]: + if not rec[1] in inheri_stat[child].keys(): + inheri_stat[child][rec[1]] = {} + if not rec[0] in inheri_stat[child][rec[1]].keys(): + inheri_stat[child][rec[1]][rec[0]] = 0 + inheri_stat[child][rec[1]][rec[0]] += 1 + return inheri_stat + def trio_info_readin(fam_file): - fam_info = {} - fin=open(fam_file) - for line in fin: - pin=line.strip().split() - if pin[2]=='0' and pin[3]=='0': continue - if not pin[1] in fam_info.keys(): - fam_info[pin[1]]=pin[2:4] - fin.close() - return fam_info + fam_info = {} + fin = open(fam_file) + for line in fin: + pin = line.strip().split() + if pin[2] == '0' and pin[3] == '0': + continue + if not pin[1] in fam_info.keys(): + fam_info[pin[1]] = pin[2:4] + fin.close() + return fam_info + def unique_list(list): - out=[] - for i in list: - if not i in out: - out.append(i) - return out + out = [] + for i in list: + if i not in out: + out.append(i) + return out -def write_output_stat(fileout,inheri_stat): - fo=open(fileout,'w') - print('\t'.join(['sample', 'svtype', 'fa_mo_pb','fa_pb','mo_pb','denovo']), file=fo) - for samp in inheri_stat.keys(): - for svt in inheri_stat[samp].keys(): - tmp = [] - for inh in ['fa_mo_pb','fa_pb','mo_pb','denovo']: - if inh in inheri_stat[samp][svt].keys(): - tmp.append(inheri_stat[samp][svt][inh]) - else: - tmp.append(0) - print('\t'.join([str(i) for i in [samp, svt]+tmp]), file=fo) - fo.close() +def write_output_stat(fileout, inheri_stat): + fo = open(fileout, 'w') + print('\t'.join(['sample', 'svtype', 'fa_mo_pb', + 'fa_pb', 'mo_pb', 'denovo']), file=fo) + for samp in inheri_stat.keys(): + for svt in inheri_stat[samp].keys(): + tmp = [] + for inh in ['fa_mo_pb', 'fa_pb', 'mo_pb', 'denovo']: + if inh in inheri_stat[samp][svt].keys(): + tmp.append(inheri_stat[samp][svt][inh]) + else: + tmp.append(0) + print('\t'.join([str(i) for i in [samp, svt] + tmp]), file=fo) + fo.close() -import os -import argparse -import pysam def main(): - import argparse - parser = argparse.ArgumentParser("GATK-SV.S1.vcf2bed.py") - parser.add_argument('fam_file', type=str, help='fam / ped file') - parser.add_argument('vcf_file', type=str, help='vcf file') - parser.add_argument('inheri_stat', type=str, help='name of output stat') - args = parser.parse_args() - #read_write_basic_vcf(args.vcfname,args.bedname) - fam_file = args.fam_file - vcf_file = args.vcf_file - fileout = args.inheri_stat - ##readin fam information - ## only complete trios would be read in here - inheri_hash = calcu_inheri_hash(vcf_file,fam_file) - inheri_stat=inheri_hash_to_stat(inheri_hash) - write_output_stat(fileout,inheri_stat) - -import os -if __name__ == '__main__': - main() + parser = argparse.ArgumentParser("GATK-SV.S1.vcf2bed.py") + parser.add_argument('fam_file', type=str, help='fam / ped file') + parser.add_argument('vcf_file', type=str, help='vcf file') + parser.add_argument('inheri_stat', type=str, help='name of output stat') + args = parser.parse_args() + # read_write_basic_vcf(args.vcfname,args.bedname) + fam_file = args.fam_file + vcf_file = args.vcf_file + fileout = args.inheri_stat + # readin fam information + # only complete trios would be read in here + inheri_hash = calcu_inheri_hash(vcf_file, fam_file) + inheri_stat = inheri_hash_to_stat(inheri_hash) + write_output_stat(fileout, inheri_stat) +if __name__ == '__main__': + main() diff --git a/dockerfiles/str/Dockerfile b/dockerfiles/str/Dockerfile new file mode 100644 index 000000000..3cea38b1c --- /dev/null +++ b/dockerfiles/str/Dockerfile @@ -0,0 +1,63 @@ +# This docker image contains the following +# list of tools and their dependencies: +# - GangSTR +# - TRTools +# - ExpansionHunter + +FROM ubuntu:20.04 + +RUN apt-get update && DEBIAN_FRONTEND="noninteractive" apt-get install --no-install-recommends -qqy \ + python3-dev \ + python3-pip \ + python \ + python-dev \ + awscli \ + build-essential \ + git \ + libbz2-dev \ + liblzma-dev \ + make \ + pkg-config \ + wget \ + unzip \ + zlib1g-dev + +RUN pip3 install pybedtools==0.8.2 pyvcf==0.6.8 scipy==1.7.1 numpy==1.21.1 + +# Install samtools (needed to index reference fasta files) +RUN wget -O samtools-1.9.tar.bz2 https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 \ + && tar -xjf samtools-1.9.tar.bz2 \ + && cd samtools-1.9 \ + && ./configure --without-curses && make && make install \ + && cd .. + +# Install bedtools (needed for DumpSTR) +## Option 1: install from source +RUN wget -O bedtools-2.27.1.tar.gz https://github.com/arq5x/bedtools2/releases/download/v2.27.1/bedtools-2.27.1.tar.gz +RUN tar -xzvf bedtools-2.27.1.tar.gz +WORKDIR bedtools2 +RUN make && make install +WORKDIR .. +## Option 2: install from apt +#RUN apt-get install bedtools + +# Download, compile, and install GangSTR +RUN wget -O GangSTR-2.4.tar.gz https://github.com/gymreklab/GangSTR/releases/download/v2.4/GangSTR-2.4.tar.gz \ + && tar -xzvf GangSTR-2.4.tar.gz \ + && cd GangSTR-2.4 \ + && ./install-gangstr.sh \ + && ldconfig \ + && cd .. + +# Download and install TRTools +RUN git clone https://github.com/gymreklab/TRTools \ + && cd TRTools \ + && python3 setup.py install \ + && cd .. + +ENV EH_VERSION=v4.0.2 +RUN wget https://github.com/Illumina/ExpansionHunter/releases/download/${EH_VERSION}/ExpansionHunter-${EH_VERSION}-linux_x86_64.tar.gz \ + && tar xzf ExpansionHunter-${EH_VERSION}-linux_x86_64.tar.gz \ + && rm ExpansionHunter-${EH_VERSION}-linux_x86_64.tar.gz \ + && mv /ExpansionHunter-${EH_VERSION}-linux_x86_64 /ExpansionHunter +ENV PATH="/ExpansionHunter/bin/:$PATH" diff --git a/dockerfiles/sv-pipeline-base/Dockerfile b/dockerfiles/sv-pipeline-base/Dockerfile index a3d8db155..a83db01ba 100644 --- a/dockerfiles/sv-pipeline-base/Dockerfile +++ b/dockerfiles/sv-pipeline-base/Dockerfile @@ -64,7 +64,7 @@ ARG CONDA_DEP_TRANSIENT="make git wget" ARG CONDA_DEP="software-properties-common zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev libssl-dev libblas-dev liblapack-dev libatlas-base-dev g++ gfortran ${CONDA_DEP_TRANSIENT}" # versions of bedtools > 2.27.0 seem to have lost the ability to read gzipped files # pandas 1.0.0 causes problem with bedtools in aggregate.py -ARG PYTHON_PKGS="wheel=0.34.2 bzip2=1.0.8 cython=0.29.14 numpy=1.18.1 pandas=0.25.3 scikit-learn=0.22.1 scipy=1.4.1 intervaltree=3.0.2 matplotlib=3.1.3 natsort=7.0.1 bedtools=2.27.0 pybedtools=0.8.1 pysam=0.14.1=py36_htslib1.7_0" +ARG PYTHON_PKGS="setuptools=52.0.0 wheel=0.34.2 bzip2=1.0.8 cython=0.29.14 numpy=1.18.1 pandas=0.25.3 scikit-learn=0.22.1 scipy=1.4.1 intervaltree=3.0.2 matplotlib=3.1.3 natsort=7.0.1 bedtools=2.27.0 pybedtools=0.8.1 pysam=0.14.1=py36_htslib1.7_0" ENV LANG=C.UTF-8 ENV LC_ALL=C.UTF-8 ARG CONDA_INSTALL_DIR="/opt/conda" diff --git a/scripts/cromwell/analyze_monitoring_logs.py b/scripts/cromwell/analyze_monitoring_logs.py index 194af4c18..dc0f34081 100644 --- a/scripts/cromwell/analyze_monitoring_logs.py +++ b/scripts/cromwell/analyze_monitoring_logs.py @@ -39,264 +39,282 @@ def write_data(data, file_path, header): - with open(file_path, 'w') as f: - f.write(header) - for key in data.index: - f.write(key + '\t' + '\t'.join([str(x) for x in data.loc(key)]) + '\n') + with open(file_path, 'w') as f: + f.write(header) + for key in data.index: + f.write(key + '\t' + '\t'.join([str(x) + for x in data.loc(key)]) + '\n') + def read_data(dir, overhead_min=0): - data = {} - for filepath in glob.glob(dir + '/*.monitoring.log'): - with open(filepath, 'r') as f: - mem_gb_data_f = [] - disk_gb_data_f = [] - mem_pct_data_f = [] - disk_pct_data_f = [] - cpu_pct_data_f = [] - total_mem = 0 - total_disk = 0 - total_cpu = 0 - start_time = None - end_time = None - for line in f: - tokens = line.strip().split(' ') - if start_time is None and line.startswith('['): - start_time = datetime.strptime(line.strip()[1:-1], TIME_FORMAT) - if line.startswith('['): - end_time = datetime.strptime(line.strip()[1:-1], TIME_FORMAT) - if line.startswith('Total Memory:'): - total_mem = float(tokens[2]) - elif line.startswith('#CPU:'): - total_cpu = float(tokens[1]) - elif line.startswith('Total Disk space:'): - total_disk = float(tokens[3]) - elif line.startswith('* Memory usage:'): - mem_gb = float(tokens[3]) - mem_pct = float(tokens[5][:-1]) / 100.0 - mem_gb_data_f.append(mem_gb) - mem_pct_data_f.append(mem_pct) - elif line.startswith('* Disk usage:'): - disk_gb = float(tokens[3]) - disk_pct = float(tokens[5][:-1]) / 100.0 - disk_gb_data_f.append(disk_gb) - disk_pct_data_f.append(disk_pct) - elif line.startswith('* CPU usage:'): - if len(tokens) == 4: - cpu_pct = float(tokens[3].replace("%","")) / 100.0 - else: - cpu_pct = 1 - cpu_pct_data_f.append(cpu_pct) - if len(mem_gb_data_f) > 0 and len(disk_gb_data_f) > 0: - filename = filepath.split('/')[-1] - entry = filename.replace(".monitoring.log", "") - task = entry.split('.')[0] - - max_mem_gb = max(mem_gb_data_f) - max_mem_pct = max(mem_pct_data_f) - max_disk_gb = max(disk_gb_data_f) - max_disk_pct = max(disk_pct_data_f) - max_cpu_pct = max(cpu_pct_data_f) - max_cpu = max_cpu_pct * total_cpu - - delta_time = end_time - start_time - delta_hour = (delta_time.total_seconds() / 3600.) + (overhead_min / 60.0) - cpu_hour = total_cpu * delta_hour - mem_hour = total_mem * delta_hour - disk_hour = total_disk * delta_hour - max_cpu_hour = max_cpu_pct * total_cpu * delta_hour - max_mem_hour = max_mem_gb * delta_hour - max_disk_hour = max_disk_gb * delta_hour - - cost_mem = COST_PER_GB_MEM_HR * mem_hour - cost_mem_opt = COST_PER_GB_MEM_HR * max(max_mem_gb, MIN_MEM_GB) * delta_hour - - cost_disk = COST_PER_GB_DISK_HR * (total_disk + BOOT_DISK_GB) * delta_hour - cost_disk_opt = COST_PER_GB_DISK_HR * (max(max_disk_gb, MIN_DISK_GB) + BOOT_DISK_GB) * delta_hour - - cost_cpu = COST_CPU_HR * total_cpu * delta_hour - cost_cpu_opt = COST_CPU_HR * max(max_cpu, MIN_MEM_GB) * delta_hour - - data[entry] = { - "task": task, - "delta_hour": delta_hour, - "total_cpu": total_cpu, - "total_mem": total_mem, - "total_disk": total_disk, - "max_cpu" : max_cpu, - "max_cpu_pct": max_cpu_pct, - "max_mem_gb": max_mem_gb, - "max_mem_pct": max_mem_pct, - "max_disk_gb": max_disk_gb, - "max_disk_pct": max_disk_pct, - "cpu_hour" : cpu_hour, - "mem_hour": mem_hour, - "disk_hour": disk_hour, - "max_cpu_hour" : max_cpu_hour, - "max_mem_hour": max_mem_hour, - "max_disk_hour": max_disk_hour, - "cost_cpu": cost_cpu, - "cost_cpu_opt": cost_cpu_opt, - "cost_mem": cost_mem, - "cost_mem_opt": cost_mem_opt, - "cost_disk": cost_disk, - "cost_disk_opt": cost_disk_opt - } - return data + data = {} + for filepath in glob.glob(dir + '/*.monitoring.log'): + with open(filepath, 'r') as f: + mem_gb_data_f = [] + disk_gb_data_f = [] + mem_pct_data_f = [] + disk_pct_data_f = [] + cpu_pct_data_f = [] + total_mem = 0 + total_disk = 0 + total_cpu = 0 + start_time = None + end_time = None + for line in f: + tokens = line.strip().split(' ') + if start_time is None and line.startswith('['): + start_time = datetime.strptime( + line.strip()[1:-1], TIME_FORMAT) + if line.startswith('['): + end_time = datetime.strptime( + line.strip()[1:-1], TIME_FORMAT) + if line.startswith('Total Memory:'): + total_mem = float(tokens[2]) + elif line.startswith('#CPU:'): + total_cpu = float(tokens[1]) + elif line.startswith('Total Disk space:'): + total_disk = float(tokens[3]) + elif line.startswith('* Memory usage:'): + mem_gb = float(tokens[3]) + mem_pct = float(tokens[5][:-1]) / 100.0 + mem_gb_data_f.append(mem_gb) + mem_pct_data_f.append(mem_pct) + elif line.startswith('* Disk usage:'): + disk_gb = float(tokens[3]) + disk_pct = float(tokens[5][:-1]) / 100.0 + disk_gb_data_f.append(disk_gb) + disk_pct_data_f.append(disk_pct) + elif line.startswith('* CPU usage:'): + if len(tokens) == 4: + cpu_pct = float(tokens[3].replace("%", "")) / 100.0 + else: + cpu_pct = 1 + cpu_pct_data_f.append(cpu_pct) + if len(mem_gb_data_f) > 0 and len(disk_gb_data_f) > 0: + filename = filepath.split('/')[-1] + entry = filename.replace(".monitoring.log", "") + task = entry.split('.')[0] + + max_mem_gb = max(mem_gb_data_f) + max_mem_pct = max(mem_pct_data_f) + max_disk_gb = max(disk_gb_data_f) + max_disk_pct = max(disk_pct_data_f) + max_cpu_pct = max(cpu_pct_data_f) + max_cpu = max_cpu_pct * total_cpu + + delta_time = end_time - start_time + delta_hour = (delta_time.total_seconds() / + 3600.) + (overhead_min / 60.0) + cpu_hour = total_cpu * delta_hour + mem_hour = total_mem * delta_hour + disk_hour = total_disk * delta_hour + max_cpu_hour = max_cpu_pct * total_cpu * delta_hour + max_mem_hour = max_mem_gb * delta_hour + max_disk_hour = max_disk_gb * delta_hour + + cost_mem = COST_PER_GB_MEM_HR * mem_hour + cost_mem_opt = COST_PER_GB_MEM_HR * \ + max(max_mem_gb, MIN_MEM_GB) * delta_hour + + cost_disk = COST_PER_GB_DISK_HR * \ + (total_disk + BOOT_DISK_GB) * delta_hour + cost_disk_opt = COST_PER_GB_DISK_HR * \ + (max(max_disk_gb, MIN_DISK_GB) + BOOT_DISK_GB) * delta_hour + + cost_cpu = COST_CPU_HR * total_cpu * delta_hour + cost_cpu_opt = COST_CPU_HR * \ + max(max_cpu, MIN_MEM_GB) * delta_hour + + data[entry] = { + "task": task, + "delta_hour": delta_hour, + "total_cpu": total_cpu, + "total_mem": total_mem, + "total_disk": total_disk, + "max_cpu": max_cpu, + "max_cpu_pct": max_cpu_pct, + "max_mem_gb": max_mem_gb, + "max_mem_pct": max_mem_pct, + "max_disk_gb": max_disk_gb, + "max_disk_pct": max_disk_pct, + "cpu_hour": cpu_hour, + "mem_hour": mem_hour, + "disk_hour": disk_hour, + "max_cpu_hour": max_cpu_hour, + "max_mem_hour": max_mem_hour, + "max_disk_hour": max_disk_hour, + "cost_cpu": cost_cpu, + "cost_cpu_opt": cost_cpu_opt, + "cost_mem": cost_mem, + "cost_mem_opt": cost_mem_opt, + "cost_disk": cost_disk, + "cost_disk_opt": cost_disk_opt + } + return data + def get_data_field(name, data): - return [x[name] for x in data] + return [x[name] for x in data] + def calc_group(data): - task_names = data.task.unique() - group_data = {} - for task in task_names: - d = data.loc[data['task'] == task] - hours = np.sum(d["delta_hour"]) - avg_cpu = np.mean(d["total_cpu"]) - avg_mem = np.mean(d["total_mem"]) - max_mem = np.max(d["max_mem_gb"]) - max_cpu = np.max(d["max_cpu"]) - max_cpu_pct = np.max(d["max_cpu_pct"]) - max_mem_pct = np.max(d["max_mem_pct"]) - avg_disk = np.mean(d["total_disk"]) - max_disk = np.max(d["max_disk_gb"]) - max_disk_pct = np.max(d["max_disk_pct"]) - cpu_hour = np.sum(d["cpu_hour"]) - mem_hour = np.sum(d["mem_hour"]) - disk_hour = np.sum(d["disk_hour"]) - max_cpu_hour = np.max(d["max_cpu_hour"]) - max_mem_hour = np.max(d["max_mem_hour"]) - max_disk_hour = np.max(d["max_disk_hour"]) - cost_cpu = np.sum(d["cost_cpu"]) - cost_cpu_dyn = np.sum(d["cost_cpu_opt"]) - cost_mem = np.sum(d["cost_mem"]) - cost_mem_dyn = np.sum(d["cost_mem_opt"]) - cost_disk = np.sum(d["cost_disk"]) - cost_disk_dyn = np.sum(d["cost_disk_opt"]) - - cost_cpu_static = COST_CPU_HR * max(max_cpu, MIN_CPU) * hours - cost_mem_static = COST_PER_GB_MEM_HR * max(max_mem, MIN_MEM_GB) * hours - cost_disk_static = COST_PER_GB_DISK_HR * (max(max_disk, MIN_DISK_GB) + BOOT_DISK_GB) * hours - - group_data[task] = { - "hours": hours, - "avg_cpu": avg_cpu, - "avg_mem": avg_mem, - "avg_disk": avg_disk, - "max_cpu": max_cpu, - "max_cpu_pct": max_cpu_pct, - "max_mem": max_mem, - "max_mem_pct": max_mem_pct, - "max_disk": max_disk, - "max_disk_pct": max_disk_pct, - "cpu_hour": cpu_hour, - "mem_hour": mem_hour, - "disk_hour": disk_hour, - "max_cpu_hour": max_cpu_hour, - "max_mem_hour": max_mem_hour, - "max_disk_hour": max_disk_hour, - "cost_cpu": cost_cpu, - "cost_cpu_static": cost_cpu_static, - "cost_cpu_dyn": cost_cpu_dyn, - "cost_mem": cost_mem, - "cost_mem_static": cost_mem_static, - "cost_mem_dyn": cost_mem_dyn, - "cost_disk": cost_disk, - "cost_disk_static": cost_disk_static, - "cost_disk_dyn": cost_disk_dyn, - "total_cost": cost_cpu + cost_mem + cost_disk, - "total_cost_static": cost_cpu_static + cost_mem_static + cost_disk_static, - "total_cost_dyn": cost_cpu_dyn + cost_mem_dyn + cost_disk_dyn - } - return group_data + task_names = data.task.unique() + group_data = {} + for task in task_names: + d = data.loc[data['task'] == task] + hours = np.sum(d["delta_hour"]) + avg_cpu = np.mean(d["total_cpu"]) + avg_mem = np.mean(d["total_mem"]) + max_mem = np.max(d["max_mem_gb"]) + max_cpu = np.max(d["max_cpu"]) + max_cpu_pct = np.max(d["max_cpu_pct"]) + max_mem_pct = np.max(d["max_mem_pct"]) + avg_disk = np.mean(d["total_disk"]) + max_disk = np.max(d["max_disk_gb"]) + max_disk_pct = np.max(d["max_disk_pct"]) + cpu_hour = np.sum(d["cpu_hour"]) + mem_hour = np.sum(d["mem_hour"]) + disk_hour = np.sum(d["disk_hour"]) + max_cpu_hour = np.max(d["max_cpu_hour"]) + max_mem_hour = np.max(d["max_mem_hour"]) + max_disk_hour = np.max(d["max_disk_hour"]) + cost_cpu = np.sum(d["cost_cpu"]) + cost_cpu_dyn = np.sum(d["cost_cpu_opt"]) + cost_mem = np.sum(d["cost_mem"]) + cost_mem_dyn = np.sum(d["cost_mem_opt"]) + cost_disk = np.sum(d["cost_disk"]) + cost_disk_dyn = np.sum(d["cost_disk_opt"]) + + cost_cpu_static = COST_CPU_HR * max(max_cpu, MIN_CPU) * hours + cost_mem_static = COST_PER_GB_MEM_HR * max(max_mem, MIN_MEM_GB) * hours + cost_disk_static = COST_PER_GB_DISK_HR * \ + (max(max_disk, MIN_DISK_GB) + BOOT_DISK_GB) * hours + + group_data[task] = { + "hours": hours, + "avg_cpu": avg_cpu, + "avg_mem": avg_mem, + "avg_disk": avg_disk, + "max_cpu": max_cpu, + "max_cpu_pct": max_cpu_pct, + "max_mem": max_mem, + "max_mem_pct": max_mem_pct, + "max_disk": max_disk, + "max_disk_pct": max_disk_pct, + "cpu_hour": cpu_hour, + "mem_hour": mem_hour, + "disk_hour": disk_hour, + "max_cpu_hour": max_cpu_hour, + "max_mem_hour": max_mem_hour, + "max_disk_hour": max_disk_hour, + "cost_cpu": cost_cpu, + "cost_cpu_static": cost_cpu_static, + "cost_cpu_dyn": cost_cpu_dyn, + "cost_mem": cost_mem, + "cost_mem_static": cost_mem_static, + "cost_mem_dyn": cost_mem_dyn, + "cost_disk": cost_disk, + "cost_disk_static": cost_disk_static, + "cost_disk_dyn": cost_disk_dyn, + "total_cost": cost_cpu + cost_mem + cost_disk, + "total_cost_static": cost_cpu_static + cost_mem_static + cost_disk_static, + "total_cost_dyn": cost_cpu_dyn + cost_mem_dyn + cost_disk_dyn + } + return group_data def do_simple_bar(data, xticks, path, bar_width=0.35, height=12, width=12, xtitle='', ytitle='', title='', bottom_adjust=0, legend=[], yscale='linear', sort_values=None): - num_groups = max([d.shape[0] for d in data]) - if sort_values is not None: - sort_indexes = np.flip(np.argsort(sort_values)) - else: - sort_indexes = np.arange(num_groups) - plt.figure(num=None, figsize=(width, height), dpi=100, facecolor='w', edgecolor='k') - for i in range(len(data)): - if i < len(legend): - label = legend[i] + num_groups = max([d.shape[0] for d in data]) + if sort_values is not None: + sort_indexes = np.flip(np.argsort(sort_values)) else: - label = "data" + str(i) - x = (np.arange(num_groups)*len(data) + i) * bar_width - plt.bar(x, data[i][sort_indexes], label=label) - x = (np.arange(num_groups)*len(data)) * bar_width - plt.xticks(x, [xticks[i] for i in sort_indexes], rotation='vertical') - plt.xlabel(xtitle) - plt.ylabel(ytitle) - plt.title(title) - plt.subplots_adjust(bottom=bottom_adjust) - plt.yscale(yscale) - plt.legend() - plt.savefig(path) + sort_indexes = np.arange(num_groups) + plt.figure(num=None, figsize=(width, height), + dpi=100, facecolor='w', edgecolor='k') + for i in range(len(data)): + if i < len(legend): + label = legend[i] + else: + label = "data" + str(i) + x = (np.arange(num_groups) * len(data) + i) * bar_width + plt.bar(x, data[i][sort_indexes], label=label) + x = (np.arange(num_groups) * len(data)) * bar_width + plt.xticks(x, [xticks[i] for i in sort_indexes], rotation='vertical') + plt.xlabel(xtitle) + plt.ylabel(ytitle) + plt.title(title) + plt.subplots_adjust(bottom=bottom_adjust) + plt.yscale(yscale) + plt.legend() + plt.savefig(path) def create_graphs(data, out_files_base, semilog=False, num_samples=None): - tasks = data.index - if num_samples is not None: - data = data / num_samples - ytitle = "Cost, $/sample" - title = "Estimated Cost Per Sample" - else: - ytitle = "Cost, $" - title = "Estimated Total Cost" - - if semilog: - yscale = "log" - else: - yscale = "linear" - - do_simple_bar(data=[data["total_cost"], data["total_cost_static"], data["total_cost_dyn"]], - xticks=tasks, - path=out_files_base + ".cost.png", - bar_width=1, - height=8, - width=12, - xtitle="Task", - ytitle=ytitle, - title=title, - bottom_adjust=0.35, - legend=["Current", "Unif", "Pred"], - yscale=yscale, - sort_values=data["total_cost"]) + tasks = data.index + if num_samples is not None: + data = data / num_samples + ytitle = "Cost, $/sample" + title = "Estimated Cost Per Sample" + else: + ytitle = "Cost, $" + title = "Estimated Total Cost" + + if semilog: + yscale = "log" + else: + yscale = "linear" + + do_simple_bar(data=[data["total_cost"], data["total_cost_static"], data["total_cost_dyn"]], + xticks=tasks, + path=out_files_base + ".cost.png", + bar_width=1, + height=8, + width=12, + xtitle="Task", + ytitle=ytitle, + title=title, + bottom_adjust=0.35, + legend=["Current", "Unif", "Pred"], + yscale=yscale, + sort_values=data["total_cost"]) # Main function def main(): - parser = argparse.ArgumentParser() - parser.add_argument("log_dir", help="Path containing monitoring script logs ending in \".monitoring.log\"") - parser.add_argument("output_file", help="Output tsv file base path") - parser.add_argument("--overhead", help="Localization overhead in minutes") - parser.add_argument("--semilog", help="Plot semilog y", action="store_true") - parser.add_argument("--plot-norm", help="Specify number of samples to normalize plots to per sample") - args = parser.parse_args() - - if not args.overhead: - overhead = DEFAULT_OVERHEAD_MIN - else: - overhead = float(args.overhead) - - if args.plot_norm: - plot_norm = int(args.plot_norm) - else: - plot_norm = None - - log_dir = args.log_dir - out_file = args.output_file - data = read_data(log_dir, overhead_min=overhead) - df = pd.DataFrame(data).T - group_data = calc_group(df) - group_df = pd.DataFrame(group_data).T - df.to_csv(path_or_buf=out_file + ".all.tsv", sep="\t") - group_df.to_csv(path_or_buf=out_file + ".grouped.tsv", sep="\t") - create_graphs(group_df, out_file, semilog=args.semilog, num_samples=plot_norm) - -if __name__== "__main__": - main() + parser = argparse.ArgumentParser() + parser.add_argument( + "log_dir", help="Path containing monitoring script logs ending in \".monitoring.log\"") + parser.add_argument("output_file", help="Output tsv file base path") + parser.add_argument("--overhead", help="Localization overhead in minutes") + parser.add_argument("--semilog", help="Plot semilog y", + action="store_true") + parser.add_argument( + "--plot-norm", help="Specify number of samples to normalize plots to per sample") + args = parser.parse_args() + + if not args.overhead: + overhead = DEFAULT_OVERHEAD_MIN + else: + overhead = float(args.overhead) + + if args.plot_norm: + plot_norm = int(args.plot_norm) + else: + plot_norm = None + + log_dir = args.log_dir + out_file = args.output_file + data = read_data(log_dir, overhead_min=overhead) + df = pd.DataFrame(data).T + group_data = calc_group(df) + group_df = pd.DataFrame(group_data).T + df.to_csv(path_or_buf=out_file + ".all.tsv", sep="\t") + group_df.to_csv(path_or_buf=out_file + ".grouped.tsv", sep="\t") + create_graphs(group_df, out_file, semilog=args.semilog, + num_samples=plot_norm) + + +if __name__ == "__main__": + main() diff --git a/scripts/cromwell/analyze_monitoring_logs2.py b/scripts/cromwell/analyze_monitoring_logs2.py index 4fa8ce59e..49603e3fd 100644 --- a/scripts/cromwell/analyze_monitoring_logs2.py +++ b/scripts/cromwell/analyze_monitoring_logs2.py @@ -42,249 +42,272 @@ def check_table_columns(columns): - column_set = set(columns) - required_input_columns = ['ElapsedTime', 'nCPU', 'CPU', 'TotMem', 'Mem', 'MemPct', 'TotDisk', 'Disk', - 'DiskPct', 'task'] - missing_cols = [] - missing = False - for col in required_input_columns: - if col not in column_set: - missing = True - missing_cols.append(col) - if missing: - raise RuntimeError("Malformed input table; missing column(s): %s. Use TSV from get_cromwell_resource_usage2.sh -u -r" % ", ".join(missing_cols)) + column_set = set(columns) + required_input_columns = ['ElapsedTime', 'nCPU', 'CPU', 'TotMem', 'Mem', 'MemPct', 'TotDisk', 'Disk', + 'DiskPct', 'task'] + missing_cols = [] + missing = False + for col in required_input_columns: + if col not in column_set: + missing = True + missing_cols.append(col) + if missing: + raise RuntimeError( + "Malformed input table; missing column(s): %s. Use TSV from get_cromwell_resource_usage2.sh -u -r" % ", ".join(missing_cols)) def load_data(log_file, overhead_mins): - # columns in input: - # ['ElapsedTime', 'nCPU', 'CPU', 'TotMem', 'Mem', 'MemPct', 'TotDisk', 'Disk', 'DiskPct', 'IORead', 'IOWrite', 'task'] - data = pd.read_table(log_file, usecols=lambda x: x not in ('IORead', 'IOWrite')) - check_table_columns(data.columns) - # rename some columns for consistency, clarity - data.rename({'task': 'Task', 'ElapsedTime': 'Hours', 'Mem': 'MaxMem', 'CPU': 'PctCPU', - 'Disk': 'MaxDisk', 'MemPct': 'PctMem', 'DiskPct': 'PctDisk'}, axis='columns', inplace=True) - # add MaxCPU column - data['MaxCPU'] = (data['PctCPU'] / 100) * data['nCPU'] - # reorder so Task column is first, MaxCPU is after nCPU (without assuming input order of columns) - cols = data.columns.tolist() - cols = [col for col in cols if col not in ('Task', 'MaxCPU')] - cpu_ind = cols.index('nCPU') - cols = ['Task'] + cols[:cpu_ind + 1] + ['MaxCPU'] + cols[cpu_ind + 1:] - data = data[cols] - # modify formats - data['Hours'] = pd.to_timedelta(data['Hours']).dt.total_seconds() / 3600.0 # convert ElapsedTime to hours (float) - data['Hours'] += overhead_mins / 60.0 - # keep last (most specific) task name, attempt number, and shard number, if present - data['Task'] = data['Task'].str.replace('/shard', '.shard', regex=False) \ - .str.replace('/attempt', '.attempt', regex=False) \ - .str.rsplit('/', n=1).str[-1] - - return data + # columns in input: + # ['ElapsedTime', 'nCPU', 'CPU', 'TotMem', 'Mem', 'MemPct', 'TotDisk', 'Disk', 'DiskPct', 'IORead', 'IOWrite', 'task'] + data = pd.read_table( + log_file, usecols=lambda x: x not in ('IORead', 'IOWrite')) + check_table_columns(data.columns) + # rename some columns for consistency, clarity + data.rename({'task': 'Task', 'ElapsedTime': 'Hours', 'Mem': 'MaxMem', 'CPU': 'PctCPU', + 'Disk': 'MaxDisk', 'MemPct': 'PctMem', 'DiskPct': 'PctDisk'}, axis='columns', inplace=True) + # add MaxCPU column + data['MaxCPU'] = (data['PctCPU'] / 100) * data['nCPU'] + # reorder so Task column is first, MaxCPU is after nCPU (without assuming input order of columns) + cols = data.columns.tolist() + cols = [col for col in cols if col not in ('Task', 'MaxCPU')] + cpu_ind = cols.index('nCPU') + cols = ['Task'] + cols[:cpu_ind + 1] + ['MaxCPU'] + cols[cpu_ind + 1:] + data = data[cols] + # modify formats + data['Hours'] = pd.to_timedelta(data['Hours']).dt.total_seconds( + ) / 3600.0 # convert ElapsedTime to hours (float) + data['Hours'] += overhead_mins / 60.0 + # keep last (most specific) task name, attempt number, and shard number, if present + data['Task'] = data['Task'].str.replace('/shard', '.shard', regex=False) \ + .str.replace('/attempt', '.attempt', regex=False) \ + .str.rsplit('/', n=1).str[-1] + + return data def estimate_costs_per_task(data): - # columns after load_data(): - # ['Hours', 'nCPU', 'MaxCPU', 'PctCPU', 'TotMem', 'MaxMem', 'PctMem', 'TotDisk', 'MaxDisk', 'PctDisk', 'Task'] - # compute resource-hours : actual and with optimal settings based on maximum usage - data['TotCPUHour'] = data['nCPU'] * data['Hours'] - data['MaxCPUHour'] = data['MaxCPU'] * data['Hours'] - data['TotMemHour'] = data['TotMem'] * data['Hours'] - data['MaxMemHour'] = data['MaxMem'] * data['Hours'] - data['TotDiskHour'] = data['TotDisk'] * data['Hours'] - data['MaxDiskHour'] = data['MaxDisk'] * data['Hours'] - - # compute cost estimates : actual and with optimal resource settings based on maximum usage (per-task, so dynamic) - data['TotCPUCost'] = data['TotCPUHour'] * COST_CPU_HR - data['OptCPUCost'] = np.multiply(np.fmax(data['MaxCPU'], MIN_CPU), data['Hours']) * COST_CPU_HR - data['TotMemCost'] = data['TotMemHour'] * COST_PER_GB_MEM_HR - data['OptMemCost'] = np.multiply(np.fmax(data['MaxMem'], MIN_MEM_GB), data['Hours']) * COST_PER_GB_MEM_HR - data['TotDiskCost'] = np.multiply((data['TotDisk'] + BOOT_DISK_GB), data['Hours']) * COST_PER_GB_DISK_HR - data['OptDiskCost'] = np.multiply((np.fmax(data['MaxDisk'], MIN_DISK_GB) + BOOT_DISK_GB), data['Hours']) * COST_PER_GB_DISK_HR - data['TotTaskCost'] = data['TotCPUCost'] + data['TotMemCost'] + data['TotDiskCost'] - data['OptTaskCost'] = data['OptCPUCost'] + data['OptMemCost'] + data['OptDiskCost'] - - data.sort_values(by='TotTaskCost', inplace=True, ascending=False) - return data + # columns after load_data(): + # ['Hours', 'nCPU', 'MaxCPU', 'PctCPU', 'TotMem', 'MaxMem', 'PctMem', 'TotDisk', 'MaxDisk', 'PctDisk', 'Task'] + # compute resource-hours : actual and with optimal settings based on maximum usage + data['TotCPUHour'] = data['nCPU'] * data['Hours'] + data['MaxCPUHour'] = data['MaxCPU'] * data['Hours'] + data['TotMemHour'] = data['TotMem'] * data['Hours'] + data['MaxMemHour'] = data['MaxMem'] * data['Hours'] + data['TotDiskHour'] = data['TotDisk'] * data['Hours'] + data['MaxDiskHour'] = data['MaxDisk'] * data['Hours'] + + # compute cost estimates : actual and with optimal resource settings based on maximum usage (per-task, so dynamic) + data['TotCPUCost'] = data['TotCPUHour'] * COST_CPU_HR + data['OptCPUCost'] = np.multiply( + np.fmax(data['MaxCPU'], MIN_CPU), data['Hours']) * COST_CPU_HR + data['TotMemCost'] = data['TotMemHour'] * COST_PER_GB_MEM_HR + data['OptMemCost'] = np.multiply( + np.fmax(data['MaxMem'], MIN_MEM_GB), data['Hours']) * COST_PER_GB_MEM_HR + data['TotDiskCost'] = np.multiply( + (data['TotDisk'] + BOOT_DISK_GB), data['Hours']) * COST_PER_GB_DISK_HR + data['OptDiskCost'] = np.multiply((np.fmax( + data['MaxDisk'], MIN_DISK_GB) + BOOT_DISK_GB), data['Hours']) * COST_PER_GB_DISK_HR + data['TotTaskCost'] = data['TotCPUCost'] + \ + data['TotMemCost'] + data['TotDiskCost'] + data['OptTaskCost'] = data['OptCPUCost'] + \ + data['OptMemCost'] + data['OptDiskCost'] + + data.sort_values(by='TotTaskCost', inplace=True, ascending=False) + return data def estimate_costs_per_group(data): - data['TaskGroup'] = data['Task'].str.split('.').str[0] # remove shard number, attempt number if present - groups = data['TaskGroup'].unique() - data_grouped = pd.DataFrame(columns=['Task', 'Hours', 'AvgCPU', 'MaxCPU', 'PctCPU', 'AvgMem', 'MaxMem', 'PctMem', - 'AvgDisk', 'MaxDisk', 'PctDisk', 'TotCPUHour', 'PeakCPUHour', 'TotMemHour', 'PeakMemHour', - 'TotDiskHour', 'PeakDiskHour', 'TotCPUCost', 'StaticCPUCost', 'DynCPUCost', 'TotMemCost', - 'StaticMemCost', 'DynMemCost', 'TotDiskCost', 'StaticDiskCost', 'DynDiskCost', 'TotCost', - 'StaticCost', 'DynCost']) - - for group in groups: - """ - columns of d: ['Task', 'Hours', 'nCPU', 'MaxCPU', 'PctCPU', 'TotMem', 'MaxMem', - 'PctMem', 'TotDisk', 'MaxDisk', 'PctDisk', 'TotCPUHour', 'MaxCPUHour', - 'TotMemHour', 'MaxMemHour', 'TotDiskHour', 'MaxDiskHour', 'TotCPUCost', - 'OptCPUCost', 'TotMemCost', 'OptMemCost', 'TotDiskCost', 'OptDiskCost', - 'TotTaskCost', 'OptTaskCost'] - """ - d = data.loc[data['TaskGroup'] == group] - hours = np.sum(d['Hours']) - max_cpu = np.nan if np.isnan(d['MaxCPU']).all() else np.max(d['MaxCPU']) - max_mem = np.nan if np.isnan(d['MaxMem']).all() else np.max(d['MaxMem']) - max_disk = np.nan if np.isnan(d['MaxDisk']).all() else np.max(d['MaxDisk']) - group_data = { - 'Task': group, - 'Hours': hours, - 'AvgCPU': np.mean(d['nCPU']), - 'AvgMem': np.mean(d['TotMem']), - 'AvgDisk': np.mean(d['TotDisk']), - 'MaxCPU': max_cpu, - 'MaxMem': max_mem, - 'MaxDisk': max_disk, - 'PctCPU': np.nan if np.isnan(d['PctCPU']).all() else np.nanmax(d['PctCPU']), - 'PctMem': np.nan if np.isnan(d['PctMem']).all() else np.nanmax(d['PctMem']), - 'PctDisk': np.nan if np.isnan(d['PctDisk']).all() else np.nanmax(d['PctDisk']), - 'TotCPUHour': np.sum(d['TotCPUHour']), - 'TotMemHour': np.sum(d['TotMemHour']), - 'TotDiskHour': np.sum(d['TotDiskHour']), - 'PeakCPUHour': np.nan if np.isnan(d['MaxCPUHour']).all() else np.nanmax(d['MaxCPUHour']), - 'PeakMemHour': np.nan if np.isnan(d['MaxMemHour']).all() else np.nanmax(d['MaxMemHour']), - 'PeakDiskHour': np.nan if np.isnan(d['MaxDiskHour']).all() else np.nanmax(d['MaxDiskHour']), - 'TotCPUCost': np.sum(d['TotCPUCost']), - 'TotMemCost': np.sum(d['TotMemCost']), - 'TotDiskCost': np.sum(d['TotDiskCost']), - 'DynCPUCost': np.sum(d['OptCPUCost']), - 'DynMemCost': np.sum(d['OptMemCost']), - 'DynDiskCost': np.sum(d['OptDiskCost']), - 'StaticCPUCost': COST_CPU_HR * np.nanmax((max_cpu, MIN_CPU)) * hours, - 'StaticMemCost': COST_PER_GB_MEM_HR * np.nanmax((max_mem, MIN_MEM_GB)) * hours, - 'StaticDiskCost': COST_PER_GB_DISK_HR * (np.nanmax((max_disk, MIN_DISK_GB)) + BOOT_DISK_GB) * hours - } - group_data['TotCost'] = sum((group_data['TotCPUCost'], group_data['TotMemCost'], group_data['TotDiskCost'])) - group_data['StaticCost'] = sum((group_data['StaticCPUCost'], group_data['StaticMemCost'], group_data['StaticDiskCost'])) - group_data['DynCost'] = sum((group_data['DynCPUCost'], group_data['DynMemCost'], group_data['DynDiskCost'])) - - data_grouped = data_grouped.append(group_data, ignore_index=True) - - data_grouped.sort_values(by='TotCost', inplace=True, ascending=False) - return data_grouped + # remove shard number, attempt number if present + data['TaskGroup'] = data['Task'].str.split('.').str[0] + groups = data['TaskGroup'].unique() + data_grouped = pd.DataFrame(columns=['Task', 'Hours', 'AvgCPU', 'MaxCPU', 'PctCPU', 'AvgMem', 'MaxMem', 'PctMem', + 'AvgDisk', 'MaxDisk', 'PctDisk', 'TotCPUHour', 'PeakCPUHour', 'TotMemHour', 'PeakMemHour', + 'TotDiskHour', 'PeakDiskHour', 'TotCPUCost', 'StaticCPUCost', 'DynCPUCost', 'TotMemCost', + 'StaticMemCost', 'DynMemCost', 'TotDiskCost', 'StaticDiskCost', 'DynDiskCost', 'TotCost', + 'StaticCost', 'DynCost']) + + for group in groups: + """ + columns of d: ['Task', 'Hours', 'nCPU', 'MaxCPU', 'PctCPU', 'TotMem', 'MaxMem', + 'PctMem', 'TotDisk', 'MaxDisk', 'PctDisk', 'TotCPUHour', 'MaxCPUHour', + 'TotMemHour', 'MaxMemHour', 'TotDiskHour', 'MaxDiskHour', 'TotCPUCost', + 'OptCPUCost', 'TotMemCost', 'OptMemCost', 'TotDiskCost', 'OptDiskCost', + 'TotTaskCost', 'OptTaskCost'] + """ + d = data.loc[data['TaskGroup'] == group] + hours = np.sum(d['Hours']) + max_cpu = np.nan if np.isnan( + d['MaxCPU']).all() else np.max(d['MaxCPU']) + max_mem = np.nan if np.isnan( + d['MaxMem']).all() else np.max(d['MaxMem']) + max_disk = np.nan if np.isnan( + d['MaxDisk']).all() else np.max(d['MaxDisk']) + group_data = { + 'Task': group, + 'Hours': hours, + 'AvgCPU': np.mean(d['nCPU']), + 'AvgMem': np.mean(d['TotMem']), + 'AvgDisk': np.mean(d['TotDisk']), + 'MaxCPU': max_cpu, + 'MaxMem': max_mem, + 'MaxDisk': max_disk, + 'PctCPU': np.nan if np.isnan(d['PctCPU']).all() else np.nanmax(d['PctCPU']), + 'PctMem': np.nan if np.isnan(d['PctMem']).all() else np.nanmax(d['PctMem']), + 'PctDisk': np.nan if np.isnan(d['PctDisk']).all() else np.nanmax(d['PctDisk']), + 'TotCPUHour': np.sum(d['TotCPUHour']), + 'TotMemHour': np.sum(d['TotMemHour']), + 'TotDiskHour': np.sum(d['TotDiskHour']), + 'PeakCPUHour': np.nan if np.isnan(d['MaxCPUHour']).all() else np.nanmax(d['MaxCPUHour']), + 'PeakMemHour': np.nan if np.isnan(d['MaxMemHour']).all() else np.nanmax(d['MaxMemHour']), + 'PeakDiskHour': np.nan if np.isnan(d['MaxDiskHour']).all() else np.nanmax(d['MaxDiskHour']), + 'TotCPUCost': np.sum(d['TotCPUCost']), + 'TotMemCost': np.sum(d['TotMemCost']), + 'TotDiskCost': np.sum(d['TotDiskCost']), + 'DynCPUCost': np.sum(d['OptCPUCost']), + 'DynMemCost': np.sum(d['OptMemCost']), + 'DynDiskCost': np.sum(d['OptDiskCost']), + 'StaticCPUCost': COST_CPU_HR * np.nanmax((max_cpu, MIN_CPU)) * hours, + 'StaticMemCost': COST_PER_GB_MEM_HR * np.nanmax((max_mem, MIN_MEM_GB)) * hours, + 'StaticDiskCost': COST_PER_GB_DISK_HR * (np.nanmax((max_disk, MIN_DISK_GB)) + BOOT_DISK_GB) * hours + } + group_data['TotCost'] = sum( + (group_data['TotCPUCost'], group_data['TotMemCost'], group_data['TotDiskCost'])) + group_data['StaticCost'] = sum( + (group_data['StaticCPUCost'], group_data['StaticMemCost'], group_data['StaticDiskCost'])) + group_data['DynCost'] = sum( + (group_data['DynCPUCost'], group_data['DynMemCost'], group_data['DynDiskCost'])) + + data_grouped = data_grouped.append(group_data, ignore_index=True) + + data_grouped.sort_values(by='TotCost', inplace=True, ascending=False) + return data_grouped def get_out_file_path(output_base, output_end): - sep = "." - if basename(output_base) == "": - sep = "" - out_file = output_base + sep + output_end - return out_file + sep = "." + if basename(output_base) == "": + sep = "" + out_file = output_base + sep + output_end + return out_file def write_data(data, out_file): - logging.info("Writing %s" % out_file) - data.to_csv(out_file, sep='\t', na_rep='NaN', index=False) + logging.info("Writing %s" % out_file) + data.to_csv(out_file, sep='\t', na_rep='NaN', index=False) def do_simple_bar(data, xticks, path, bar_width=0.35, height=12, width=12, xtitle='', ytitle='', title='', bottom_adjust=0, legend=[], yscale='linear', sort_values=None): - num_groups = max([d.shape[0] for d in data]) - if sort_values is not None: - sort_indexes = np.flip(np.argsort(sort_values)) - else: - sort_indexes = np.arange(num_groups) - plt.figure(num=None, figsize=(width, height), dpi=100, facecolor='w', edgecolor='k') - for i in range(len(data)): - if i < len(legend): - label = legend[i] + num_groups = max([d.shape[0] for d in data]) + if sort_values is not None: + sort_indexes = np.flip(np.argsort(sort_values)) else: - label = "data" + str(i) - x = (np.arange(num_groups) * len(data) + i) * bar_width - plt.bar(x, data[i][sort_indexes], label=label) - x = (np.arange(num_groups) * len(data)) * bar_width - plt.xticks(x, [xticks[i] for i in sort_indexes], rotation='vertical') - plt.xlabel(xtitle) - plt.ylabel(ytitle) - plt.title(title) - plt.subplots_adjust(bottom=bottom_adjust) - plt.yscale(yscale) - plt.legend() - plt.savefig(path) + sort_indexes = np.arange(num_groups) + plt.figure(num=None, figsize=(width, height), + dpi=100, facecolor='w', edgecolor='k') + for i in range(len(data)): + if i < len(legend): + label = legend[i] + else: + label = "data" + str(i) + x = (np.arange(num_groups) * len(data) + i) * bar_width + plt.bar(x, data[i][sort_indexes], label=label) + x = (np.arange(num_groups) * len(data)) * bar_width + plt.xticks(x, [xticks[i] for i in sort_indexes], rotation='vertical') + plt.xlabel(xtitle) + plt.ylabel(ytitle) + plt.title(title) + plt.subplots_adjust(bottom=bottom_adjust) + plt.yscale(yscale) + plt.legend() + plt.savefig(path) def create_graphs(data, out_file, semilog=False, num_samples=None): - logging.info("Writing %s" % out_file) - data = data.loc[data.notna().all(axis=1)] # drop rows with any NA values before making plot - data.reset_index(drop=True, inplace=True) - if num_samples is not None: - data = data / num_samples - ytitle = "Cost ($/sample)" - title = "Estimated Cost Per Sample" - else: - ytitle = "Cost ($)" - title = "Estimated Total Cost" - - if semilog: - yscale = "log" - else: - yscale = "linear" - - do_simple_bar(data=[data["TotCost"], data["StaticCost"], data["DynCost"]], - xticks=data['Task'], - path=out_file, - bar_width=1, - height=8, - width=12, - xtitle="Task", - ytitle=ytitle, - title=title, - bottom_adjust=0.35, - legend=["Current", "Uniform", "Dynamic"], - yscale=yscale, - sort_values=data["TotCost"]) + logging.info("Writing %s" % out_file) + # drop rows with any NA values before making plot + data = data.loc[data.notna().all(axis=1)] + data.reset_index(drop=True, inplace=True) + if num_samples is not None: + data = data / num_samples + ytitle = "Cost ($/sample)" + title = "Estimated Cost Per Sample" + else: + ytitle = "Cost ($)" + title = "Estimated Total Cost" + + if semilog: + yscale = "log" + else: + yscale = "linear" + + do_simple_bar(data=[data["TotCost"], data["StaticCost"], data["DynCost"]], + xticks=data['Task'], + path=out_file, + bar_width=1, + height=8, + width=12, + xtitle="Task", + ytitle=ytitle, + title=title, + bottom_adjust=0.35, + legend=["Current", "Uniform", "Dynamic"], + yscale=yscale, + sort_values=data["TotCost"]) def check_file_nonempty(f): - if not isfile(f): - raise RuntimeError("Required input file %s does not exist." % f) - elif getsize(f) == 0: - raise RuntimeError("Required input file %s is empty." % f) + if not isfile(f): + raise RuntimeError("Required input file %s does not exist." % f) + elif getsize(f) == 0: + raise RuntimeError("Required input file %s is empty." % f) # Main function def main(): - parser = argparse.ArgumentParser() - parser.add_argument("log_summary_file", help="Path to log summary TSV from get_cromwell_resource_usage2.sh -u -r") - parser.add_argument("output_base", help="Output tsv file base path") - parser.add_argument("--overhead", help="Localization overhead in minutes") - parser.add_argument("--semilog", help="Plot semilog y", action="store_true") - parser.add_argument("--plot-norm", help="Specify number of samples to normalize plots to per sample") - parser.add_argument("--log-level", - help="Specify level of logging information, ie. info, warning, error (not case-sensitive)", - required=False, default="INFO") - args = parser.parse_args() - - if not args.overhead: - overhead = DEFAULT_OVERHEAD_MIN - else: - overhead = float(args.overhead) - - if args.plot_norm: - plot_norm = int(args.plot_norm) - else: - plot_norm = None - - log_level = args.log_level - numeric_level = getattr(logging, log_level.upper(), None) - if not isinstance(numeric_level, int): - raise ValueError('Invalid log level: %s' % log_level) - logging.basicConfig(level=numeric_level, format='%(levelname)s: %(message)s') - - log_file, output_base = args.log_summary_file, args.output_base - check_file_nonempty(log_file) - - data = load_data(log_file, overhead) - data = estimate_costs_per_task(data) - write_data(data, get_out_file_path(output_base, "all.tsv")) - grouped_data = estimate_costs_per_group(data) - write_data(grouped_data, get_out_file_path(output_base, "grouped.tsv")) - create_graphs(grouped_data, get_out_file_path(output_base, "cost.png"), semilog=args.semilog, num_samples=plot_norm) + parser = argparse.ArgumentParser() + parser.add_argument( + "log_summary_file", help="Path to log summary TSV from get_cromwell_resource_usage2.sh -u -r") + parser.add_argument("output_base", help="Output tsv file base path") + parser.add_argument("--overhead", help="Localization overhead in minutes") + parser.add_argument("--semilog", help="Plot semilog y", + action="store_true") + parser.add_argument( + "--plot-norm", help="Specify number of samples to normalize plots to per sample") + parser.add_argument("--log-level", + help="Specify level of logging information, ie. info, warning, error (not case-sensitive)", + required=False, default="INFO") + args = parser.parse_args() + + if not args.overhead: + overhead = DEFAULT_OVERHEAD_MIN + else: + overhead = float(args.overhead) + + if args.plot_norm: + plot_norm = int(args.plot_norm) + else: + plot_norm = None + + log_level = args.log_level + numeric_level = getattr(logging, log_level.upper(), None) + if not isinstance(numeric_level, int): + raise ValueError('Invalid log level: %s' % log_level) + logging.basicConfig(level=numeric_level, + format='%(levelname)s: %(message)s') + + log_file, output_base = args.log_summary_file, args.output_base + check_file_nonempty(log_file) + + data = load_data(log_file, overhead) + data = estimate_costs_per_task(data) + write_data(data, get_out_file_path(output_base, "all.tsv")) + grouped_data = estimate_costs_per_group(data) + write_data(grouped_data, get_out_file_path(output_base, "grouped.tsv")) + create_graphs(grouped_data, get_out_file_path( + output_base, "cost.png"), semilog=args.semilog, num_samples=plot_norm) if __name__ == "__main__": - main() + main() diff --git a/scripts/cromwell/analyze_resource_acquisition.py b/scripts/cromwell/analyze_resource_acquisition.py index 731b9962c..859e54466 100644 --- a/scripts/cromwell/analyze_resource_acquisition.py +++ b/scripts/cromwell/analyze_resource_acquisition.py @@ -43,459 +43,488 @@ def get_disk_info(metadata): - """ - Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py - Modified to return (hdd_size, ssd_size) - """ - if "runtimeAttributes" in metadata and "disks" in metadata['runtimeAttributes']: - boot_disk_gb = 0.0 - if "bootDiskSizeGb" in metadata['runtimeAttributes']: - boot_disk_gb = float(metadata['runtimeAttributes']['bootDiskSizeGb']) - # Note - am lumping boot disk in with requested disk. Assuming boot disk is same type as requested. - # i.e. is it possible that boot disk is HDD when requested is SDD. - (name, disk_size, disk_type) = metadata['runtimeAttributes']["disks"].split() - if disk_type == "HDD": - return float(disk_size) + boot_disk_gb, float(0) - elif disk_type == "SSD": - return float(0), float(disk_size) + boot_disk_gb + """ + Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py + Modified to return (hdd_size, ssd_size) + """ + if "runtimeAttributes" in metadata and "disks" in metadata['runtimeAttributes']: + boot_disk_gb = 0.0 + if "bootDiskSizeGb" in metadata['runtimeAttributes']: + boot_disk_gb = float( + metadata['runtimeAttributes']['bootDiskSizeGb']) + # Note - am lumping boot disk in with requested disk. Assuming boot disk is same type as requested. + # i.e. is it possible that boot disk is HDD when requested is SDD. + (name, disk_size, + disk_type) = metadata['runtimeAttributes']["disks"].split() + if disk_type == "HDD": + return float(disk_size) + boot_disk_gb, float(0) + elif disk_type == "SSD": + return float(0), float(disk_size) + boot_disk_gb + else: + return float(0), float(0) else: - return float(0), float(0) - else: - # we can't tell disk size in this case so just return nothing - return float(0), float(0) + # we can't tell disk size in this case so just return nothing + return float(0), float(0) def was_preemptible_vm(metadata, was_cached): - """ - Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py - """ - if was_cached: - return True # if call cached, not any type of VM, but don't inflate nonpreemptible count - elif "runtimeAttributes" in metadata and "preemptible" in metadata['runtimeAttributes']: - pe_count = int(metadata['runtimeAttributes']["preemptible"]) - attempt = int(metadata['attempt']) - - return attempt <= pe_count - else: - # we can't tell (older metadata) so conservatively return false - return False + """ + Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py + """ + if was_cached: + return True # if call cached, not any type of VM, but don't inflate nonpreemptible count + elif "runtimeAttributes" in metadata and "preemptible" in metadata['runtimeAttributes']: + pe_count = int(metadata['runtimeAttributes']["preemptible"]) + attempt = int(metadata['attempt']) + + return attempt <= pe_count + else: + # we can't tell (older metadata) so conservatively return false + return False def used_cached_results(metadata): - """ - Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py - """ - return "callCaching" in metadata and "hit" in metadata["callCaching"] and metadata["callCaching"]["hit"] + """ + Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py + """ + return "callCaching" in metadata and "hit" in metadata["callCaching"] and metadata["callCaching"]["hit"] def calculate_start_end(call_info, override_warning=False, alias=None): - """ - Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py - """ - if 'jobId' in call_info: - job_id = call_info['jobId'].split('/')[-1] - if alias is None or alias == "": - alias = job_id - else: - alias += "." + job_id - elif alias is None or alias == "": - alias = "NA" - - # get start (start time of VM start) & end time (end time of 'ok') according to metadata - start = None - end = None - - if 'executionEvents' in call_info: - for x in call_info['executionEvents']: - # ignore incomplete executionEvents (could be due to server restart or similar) - if 'description' not in x: - continue - y = x['description'] - - if 'backend' in call_info and call_info['backend'] == 'PAPIv2': - if y.startswith("PreparingJob"): - start = dateutil.parser.parse(x['startTime']) - if y.startswith("Worker released"): - end = dateutil.parser.parse(x['endTime']) - else: - if y.startswith("start"): - start = dateutil.parser.parse(x['startTime']) - if y.startswith("ok"): - end = dateutil.parser.parse(x['endTime']) - - # if we are preempted or if cromwell used previously cached results, we don't even get a start time from JES. - # if cromwell was restarted, the start time from JES might not have been written to the metadata. - # in either case, use the Cromwell start time which is earlier but not wrong. - if start is None: - start = dateutil.parser.parse(call_info['start']) - - # if we are preempted or if cromwell used previously cached results, we don't get an endTime from JES right now. - # if cromwell was restarted, the start time from JES might not have been written to the metadata. - # in either case, use the Cromwell end time which is later but not wrong - if end is None: - if 'end' in call_info: - end = dateutil.parser.parse(call_info['end']) - elif override_warning: - logging.warning("End time not found, omitting job {}".format(alias)) - end = start - else: - raise RuntimeError((f"End time not found for job {alias} (may be running or have been aborted)." - " Run again with --override-warning to continue anyway and omit the job.")) - - return start, end + """ + Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py + """ + if 'jobId' in call_info: + job_id = call_info['jobId'].split('/')[-1] + if alias is None or alias == "": + alias = job_id + else: + alias += "." + job_id + elif alias is None or alias == "": + alias = "NA" + + # get start (start time of VM start) & end time (end time of 'ok') according to metadata + start = None + end = None + + if 'executionEvents' in call_info: + for x in call_info['executionEvents']: + # ignore incomplete executionEvents (could be due to server restart or similar) + if 'description' not in x: + continue + y = x['description'] + + if 'backend' in call_info and call_info['backend'] == 'PAPIv2': + if y.startswith("PreparingJob"): + start = dateutil.parser.parse(x['startTime']) + if y.startswith("Worker released"): + end = dateutil.parser.parse(x['endTime']) + else: + if y.startswith("start"): + start = dateutil.parser.parse(x['startTime']) + if y.startswith("ok"): + end = dateutil.parser.parse(x['endTime']) + + # if we are preempted or if cromwell used previously cached results, we don't even get a start time from JES. + # if cromwell was restarted, the start time from JES might not have been written to the metadata. + # in either case, use the Cromwell start time which is earlier but not wrong. + if start is None: + start = dateutil.parser.parse(call_info['start']) + + # if we are preempted or if cromwell used previously cached results, we don't get an endTime from JES right now. + # if cromwell was restarted, the start time from JES might not have been written to the metadata. + # in either case, use the Cromwell end time which is later but not wrong + if end is None: + if 'end' in call_info: + end = dateutil.parser.parse(call_info['end']) + elif override_warning: + logging.warning( + "End time not found, omitting job {}".format(alias)) + end = start + else: + raise RuntimeError((f"End time not found for job {alias} (may be running or have been aborted)." + " Run again with --override-warning to continue anyway and omit the job.")) + + return start, end def get_mem_cpu(m): - """ - Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py - """ - cpu = 'na' - memory = 'na' - if 'runtimeAttributes' in m: - if 'cpu' in m['runtimeAttributes']: - cpu = int(m['runtimeAttributes']['cpu']) - if 'memory' in m['runtimeAttributes']: - mem_str = m['runtimeAttributes']['memory'] - memory = float(mem_str[:mem_str.index(" ")]) - return cpu, memory + """ + Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/develop/scripts/calculate_cost.py + """ + cpu = 'na' + memory = 'na' + if 'runtimeAttributes' in m: + if 'cpu' in m['runtimeAttributes']: + cpu = int(m['runtimeAttributes']['cpu']) + if 'memory' in m['runtimeAttributes']: + mem_str = m['runtimeAttributes']['memory'] + memory = float(mem_str[:mem_str.index(" ")]) + return cpu, memory def add_label_to_alias(alias, labels): - # In alias, track hierarchy of workflow/task up to current task nicely without repetition - if alias is None: - alias = "" - to_add = "" - if 'wdl-call-alias' in labels: - to_add = labels['wdl-call-alias'] - elif 'wdl-task-name' in labels: - to_add = labels['wdl-task-name'] - if to_add != "" and not alias.endswith(to_add): - if alias != "" and alias[-1] != ".": - alias += "." - alias += to_add - - return alias + # In alias, track hierarchy of workflow/task up to current task nicely without repetition + if alias is None: + alias = "" + to_add = "" + if 'wdl-call-alias' in labels: + to_add = labels['wdl-call-alias'] + elif 'wdl-task-name' in labels: + to_add = labels['wdl-task-name'] + if to_add != "" and not alias.endswith(to_add): + if alias != "" and alias[-1] != ".": + alias += "." + alias += to_add + + return alias def get_call_alias(alias, call): - # In call_alias, track hierarchy of workflow/task up to current call nicely without repetition - if alias is None: - alias = "" - call_split = call.split('.') - call_name = call - if alias.endswith(call_split[0]): - call_name = call_split[1] - call_alias = alias - if call_alias != "" and call_alias[-1] != ".": - call_alias += "." - call_alias += call_name + # In call_alias, track hierarchy of workflow/task up to current call nicely without repetition + if alias is None: + alias = "" + call_split = call.split('.') + call_name = call + if alias.endswith(call_split[0]): + call_name = call_split[1] + call_alias = alias + if call_alias != "" and call_alias[-1] != ".": + call_alias += "." + call_alias += call_name - return call_alias + return call_alias def update_nonpreemptible_counters(alias): - global NUM_NONPREEMPTIBLE - global NONPREEMPTIBLE_TASKS - NUM_NONPREEMPTIBLE += 1 - if alias in NONPREEMPTIBLE_TASKS: - NONPREEMPTIBLE_TASKS[alias] += 1 - else: - NONPREEMPTIBLE_TASKS[alias] = 1 + global NUM_NONPREEMPTIBLE + global NONPREEMPTIBLE_TASKS + NUM_NONPREEMPTIBLE += 1 + if alias in NONPREEMPTIBLE_TASKS: + NONPREEMPTIBLE_TASKS[alias] += 1 + else: + NONPREEMPTIBLE_TASKS[alias] = 1 def update_cached_counters(alias): - global CACHED - global NUM_CACHED - NUM_CACHED += 1 - if alias in CACHED: - CACHED[alias] += 1 - else: - CACHED[alias] = 1 - - -def get_calls(m, override_warning=False, alias=None): - """ - Modified from download_monitoring_logs.py script by Mark Walker - https://github.com/broadinstitute/gatk-sv/blob/master/scripts/cromwell/download_monitoring_logs.py - """ - if isinstance(m, list): - call_metadata = [] - for m_shard in m: - call_metadata.extend(get_calls(m_shard, override_warning, alias=alias)) - return call_metadata - - if 'labels' in m: - alias = add_label_to_alias(alias, m['labels']) - - call_metadata = [] - if 'calls' in m: - for call in m['calls']: - # Skips scatters that don't contain calls - if '.' not in call: - continue - call_alias = get_call_alias(alias, call) - # recursively get metadata - call_metadata.extend(get_calls(m['calls'][call], override_warning, alias=call_alias)) - - if 'subWorkflowMetadata' in m: - call_metadata.extend(get_calls(m['subWorkflowMetadata'], override_warning, alias=alias)) - - # in a call - if alias and ('stderr' in m): - start, end = calculate_start_end(m, override_warning, alias) - - cpu, memory = get_mem_cpu(m) - - cached = used_cached_results(m) - - preemptible = was_preemptible_vm(m, cached) - preemptible_cpu = 0 - nonpreemptible_cpu = 0 - if preemptible: - preemptible_cpu = cpu + global CACHED + global NUM_CACHED + NUM_CACHED += 1 + if alias in CACHED: + CACHED[alias] += 1 else: - nonpreemptible_cpu = cpu + CACHED[alias] = 1 - hdd_size, ssd_size = get_disk_info(m) - call_metadata.append((start, 1, cpu, preemptible_cpu, nonpreemptible_cpu, memory, hdd_size, ssd_size)) - call_metadata.append((end, -1, -1 * cpu, -1 * preemptible_cpu, -1 * nonpreemptible_cpu, -1 * memory, -1 * hdd_size, - -1 * ssd_size)) - if not preemptible: - update_nonpreemptible_counters(alias) +def get_calls(m, override_warning=False, alias=None): + """ + Modified from download_monitoring_logs.py script by Mark Walker + https://github.com/broadinstitute/gatk-sv/blob/master/scripts/cromwell/download_monitoring_logs.py + """ + if isinstance(m, list): + call_metadata = [] + for m_shard in m: + call_metadata.extend( + get_calls(m_shard, override_warning, alias=alias)) + return call_metadata + + if 'labels' in m: + alias = add_label_to_alias(alias, m['labels']) - if cached: - update_cached_counters(alias) + call_metadata = [] + if 'calls' in m: + for call in m['calls']: + # Skips scatters that don't contain calls + if '.' not in call: + continue + call_alias = get_call_alias(alias, call) + # recursively get metadata + call_metadata.extend( + get_calls(m['calls'][call], override_warning, alias=call_alias)) + + if 'subWorkflowMetadata' in m: + call_metadata.extend( + get_calls(m['subWorkflowMetadata'], override_warning, alias=alias)) + + # in a call + if alias and ('stderr' in m): + start, end = calculate_start_end(m, override_warning, alias) + + cpu, memory = get_mem_cpu(m) + + cached = used_cached_results(m) + + preemptible = was_preemptible_vm(m, cached) + preemptible_cpu = 0 + nonpreemptible_cpu = 0 + if preemptible: + preemptible_cpu = cpu + else: + nonpreemptible_cpu = cpu + + hdd_size, ssd_size = get_disk_info(m) + + call_metadata.append((start, 1, cpu, preemptible_cpu, + nonpreemptible_cpu, memory, hdd_size, ssd_size)) + call_metadata.append((end, -1, -1 * cpu, -1 * preemptible_cpu, -1 * nonpreemptible_cpu, -1 * memory, -1 * hdd_size, + -1 * ssd_size)) + if not preemptible: + update_nonpreemptible_counters(alias) + + if cached: + update_cached_counters(alias) - return call_metadata + return call_metadata def check_workflow_valid(metadata, metadata_file, override_warning): - # these errors cannot be overcome - if 'status' not in metadata: - raise RuntimeError("Incomplete metadata input file %s. File lacks workflow status field." % metadata_file) - if metadata['status'] == "fail": # Unrecognized workflow ID failure - unable to download metadata - err_msg = "Workflow metadata download failure." - if 'message' in metadata: - err_msg += " Message: " + metadata['message'] - raise RuntimeError(err_msg) - - # these errors may be able to be overcome for partial output - found_retryable_error = False - if metadata['status'] == "Failed": - logging.warning("Workflow failed, which is likely to impact plot accuracy.") - found_retryable_error = True - for event in metadata['workflowProcessingEvents']: - if event['description'] == "Released": - logging.warning("Server was interrupted during workflow execution, which is likely to impact plot accuracy.") - found_retryable_error = True - break - if found_retryable_error: - if override_warning: - logging.info("Override_warning=TRUE. Proceeding with caution.") - else: - raise RuntimeError(("One or more retryable errors encountered (see logging info for warnings). " - "To attempt to proceed anyway, re-run the script with the --override-warning flag.")) + # these errors cannot be overcome + if 'status' not in metadata: + raise RuntimeError( + "Incomplete metadata input file %s. File lacks workflow status field." % metadata_file) + # Unrecognized workflow ID failure - unable to download metadata + if metadata['status'] == "fail": + err_msg = "Workflow metadata download failure." + if 'message' in metadata: + err_msg += " Message: " + metadata['message'] + raise RuntimeError(err_msg) + + # these errors may be able to be overcome for partial output + found_retryable_error = False + if metadata['status'] == "Failed": + logging.warning( + "Workflow failed, which is likely to impact plot accuracy.") + found_retryable_error = True + for event in metadata['workflowProcessingEvents']: + if event['description'] == "Released": + logging.warning( + "Server was interrupted during workflow execution, which is likely to impact plot accuracy.") + found_retryable_error = True + break + if found_retryable_error: + if override_warning: + logging.info("Override_warning=TRUE. Proceeding with caution.") + else: + raise RuntimeError(("One or more retryable errors encountered (see logging info for warnings). " + "To attempt to proceed anyway, re-run the script with the --override-warning flag.")) def get_call_metadata(metadata_file, override_warning=False): - """ - Based on: https://github.com/broadinstitute/gatk-sv/blob/master/scripts/cromwell/download_monitoring_logs.py - """ - metadata = json.load(open(metadata_file, 'r')) - check_workflow_valid(metadata, metadata_file, override_warning) - colnames = ['timestamp', 'vm_delta', 'cpu_all_delta', 'cpu_preemptible_delta', 'cpu_nonpreemptible_delta', - 'memory_delta', 'hdd_delta', 'ssd_delta'] - - call_metadata = get_calls(metadata, override_warning) - if len(call_metadata) == 0: - raise RuntimeError("No calls in workflow metadata.") - call_metadata = pd.DataFrame(call_metadata, columns=colnames) + """ + Based on: https://github.com/broadinstitute/gatk-sv/blob/master/scripts/cromwell/download_monitoring_logs.py + """ + metadata = json.load(open(metadata_file, 'r')) + check_workflow_valid(metadata, metadata_file, override_warning) + colnames = ['timestamp', 'vm_delta', 'cpu_all_delta', 'cpu_preemptible_delta', 'cpu_nonpreemptible_delta', + 'memory_delta', 'hdd_delta', 'ssd_delta'] + + call_metadata = get_calls(metadata, override_warning) + if len(call_metadata) == 0: + raise RuntimeError("No calls in workflow metadata.") + call_metadata = pd.DataFrame(call_metadata, columns=colnames) - return call_metadata + return call_metadata def transform_call_metadata(call_metadata): - """ - Based on: https://github.com/broadinstitute/dsde-pipelines/blob/master/scripts/quota_usage.py - """ - call_metadata = call_metadata.sort_values(by='timestamp') - # make timestamps start from 0 by subtracting minimum (at index 0 after sorting) - call_metadata['timestamp_zero'] = call_metadata['timestamp'] - call_metadata.timestamp.iloc[0] - # get timedelta in seconds because plot labels won't format correctly otherwise - call_metadata['seconds'] = call_metadata['timestamp_zero'].dt.total_seconds() - - call_metadata['vm'] = call_metadata.vm_delta.cumsum() - call_metadata['cpu_all'] = call_metadata.cpu_all_delta.cumsum() - call_metadata['cpu_preemptible'] = call_metadata.cpu_preemptible_delta.cumsum() - call_metadata['cpu_nonpreemptible'] = call_metadata.cpu_nonpreemptible_delta.cumsum() - call_metadata['memory'] = call_metadata.memory_delta.cumsum() - call_metadata['ssd'] = call_metadata.ssd_delta.cumsum() - call_metadata['hdd'] = call_metadata.hdd_delta.cumsum() - - return call_metadata + """ + Based on: https://github.com/broadinstitute/dsde-pipelines/blob/master/scripts/quota_usage.py + """ + call_metadata = call_metadata.sort_values(by='timestamp') + # make timestamps start from 0 by subtracting minimum (at index 0 after sorting) + call_metadata['timestamp_zero'] = call_metadata['timestamp'] - \ + call_metadata.timestamp.iloc[0] + # get timedelta in seconds because plot labels won't format correctly otherwise + call_metadata['seconds'] = call_metadata['timestamp_zero'].dt.total_seconds() + + call_metadata['vm'] = call_metadata.vm_delta.cumsum() + call_metadata['cpu_all'] = call_metadata.cpu_all_delta.cumsum() + call_metadata['cpu_preemptible'] = call_metadata.cpu_preemptible_delta.cumsum() + call_metadata['cpu_nonpreemptible'] = call_metadata.cpu_nonpreemptible_delta.cumsum() + call_metadata['memory'] = call_metadata.memory_delta.cumsum() + call_metadata['ssd'] = call_metadata.ssd_delta.cumsum() + call_metadata['hdd'] = call_metadata.hdd_delta.cumsum() + + return call_metadata def plot_resources_time(df, title_name, output_name): - """ - Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/master/scripts/quota_usage.py - """ - logging.info("Writing " + output_name) - colors = { - "vm": "#006FA6", # blue - "cpu_all": "black", - "cpu_preemptible": "#10a197", # turquoise - "cpu_nonpreemptible": "#A30059", # dark pink - "memory": "#FF4A46", # coral red - "hdd": "#72418F", # purple - "ssd": "#008941", # green - } - LABEL_SIZE = 17 - TITLE_SIZE = 20 - TICK_SIZE = 15 - - fig, ax = plt.subplots(4, 1, figsize=(14, 26), sharex=True) - ax[0].set_title(title_name + "Resource Acquisition Over Time", fontsize=TITLE_SIZE) - - ax[0].plot(df['seconds'], df['vm'], color=colors["vm"]) - ax[0].set_ylabel("VMs", fontsize=LABEL_SIZE) - plt.setp(ax[0].get_yticklabels(), fontsize=TICK_SIZE) - - ax[1].plot(df['seconds'], df['cpu_all'], color=colors["cpu_all"], linewidth=2, label="All") - ax[1].plot(df['seconds'], df['cpu_preemptible'], color=colors["cpu_preemptible"], linestyle="dashed", label="Preemptible") - ax[1].plot(df['seconds'], df['cpu_nonpreemptible'], color=colors["cpu_nonpreemptible"], linestyle="dashed", label="Non-preemptible") - ax[1].set_ylabel("CPU Cores", fontsize=LABEL_SIZE) - plt.setp(ax[1].get_yticklabels(), fontsize=TICK_SIZE) - ax[1].legend(loc="upper right", title="CPU Types", fontsize=TICK_SIZE, title_fontsize=TICK_SIZE) - - ax[2].plot(df['seconds'], df['memory'], color=colors["memory"]) - ax[2].set_ylabel("RAM (GiB)", fontsize=LABEL_SIZE) - plt.setp(ax[2].get_yticklabels(), fontsize=TICK_SIZE) - - ax[3].plot(df['seconds'], df['hdd'], color=colors["hdd"], label="HDD") - ax[3].plot(df['seconds'], df['ssd'], color=colors["ssd"], label="SSD") - ax[3].set_ylabel("Disk Memory (GiB)", fontsize=LABEL_SIZE) - plt.setp(ax[3].get_yticklabels(), fontsize=TICK_SIZE) - ax[3].legend(loc="upper right", title="Disk Types", fontsize=TICK_SIZE, title_fontsize=TICK_SIZE) - - formatter = matplotlib.ticker.FuncFormatter(lambda x, pos: str(datetime.timedelta(seconds=x))) - ax[3].xaxis.set_major_formatter(formatter) - plt.setp(ax[3].get_xticklabels(), rotation=15, fontsize=TICK_SIZE) - ax[3].set_xlabel("Time", fontsize=LABEL_SIZE) - - fig.savefig(output_name, bbox_inches='tight') + """ + Modified from: https://github.com/broadinstitute/dsde-pipelines/blob/master/scripts/quota_usage.py + """ + logging.info("Writing " + output_name) + colors = { + "vm": "#006FA6", # blue + "cpu_all": "black", + "cpu_preemptible": "#10a197", # turquoise + "cpu_nonpreemptible": "#A30059", # dark pink + "memory": "#FF4A46", # coral red + "hdd": "#72418F", # purple + "ssd": "#008941", # green + } + LABEL_SIZE = 17 + TITLE_SIZE = 20 + TICK_SIZE = 15 + + fig, ax = plt.subplots(4, 1, figsize=(14, 26), sharex=True) + ax[0].set_title( + title_name + "Resource Acquisition Over Time", fontsize=TITLE_SIZE) + + ax[0].plot(df['seconds'], df['vm'], color=colors["vm"]) + ax[0].set_ylabel("VMs", fontsize=LABEL_SIZE) + plt.setp(ax[0].get_yticklabels(), fontsize=TICK_SIZE) + + ax[1].plot(df['seconds'], df['cpu_all'], + color=colors["cpu_all"], linewidth=2, label="All") + ax[1].plot(df['seconds'], df['cpu_preemptible'], + color=colors["cpu_preemptible"], linestyle="dashed", label="Preemptible") + ax[1].plot(df['seconds'], df['cpu_nonpreemptible'], + color=colors["cpu_nonpreemptible"], linestyle="dashed", label="Non-preemptible") + ax[1].set_ylabel("CPU Cores", fontsize=LABEL_SIZE) + plt.setp(ax[1].get_yticklabels(), fontsize=TICK_SIZE) + ax[1].legend(loc="upper right", title="CPU Types", + fontsize=TICK_SIZE, title_fontsize=TICK_SIZE) + + ax[2].plot(df['seconds'], df['memory'], color=colors["memory"]) + ax[2].set_ylabel("RAM (GiB)", fontsize=LABEL_SIZE) + plt.setp(ax[2].get_yticklabels(), fontsize=TICK_SIZE) + + ax[3].plot(df['seconds'], df['hdd'], color=colors["hdd"], label="HDD") + ax[3].plot(df['seconds'], df['ssd'], color=colors["ssd"], label="SSD") + ax[3].set_ylabel("Disk Memory (GiB)", fontsize=LABEL_SIZE) + plt.setp(ax[3].get_yticklabels(), fontsize=TICK_SIZE) + ax[3].legend(loc="upper right", title="Disk Types", + fontsize=TICK_SIZE, title_fontsize=TICK_SIZE) + + formatter = matplotlib.ticker.FuncFormatter( + lambda x, pos: str(datetime.timedelta(seconds=x))) + ax[3].xaxis.set_major_formatter(formatter) + plt.setp(ax[3].get_xticklabels(), rotation=15, fontsize=TICK_SIZE) + ax[3].set_xlabel("Time", fontsize=LABEL_SIZE) + + fig.savefig(output_name, bbox_inches='tight') def write_resources_time_table(call_metadata, table_file): - logging.info("Writing " + table_file) - call_metadata.to_csv( - table_file, - columns=["timestamp", "seconds", "vm", "cpu_all", "cpu_preemptible", "cpu_nonpreemptible", "memory", "hdd", "ssd"], - sep='\t', - index=False, - date_format='%Y-%m-%dT%H:%M%:%SZ' - ) + logging.info("Writing " + table_file) + call_metadata.to_csv( + table_file, + columns=["timestamp", "seconds", "vm", "cpu_all", + "cpu_preemptible", "cpu_nonpreemptible", "memory", "hdd", "ssd"], + sep='\t', + index=False, + date_format='%Y-%m-%dT%H:%M%:%SZ' + ) def write_peak_usage(m, peak_file): - logging.info("Writing " + peak_file) - with open(peak_file, 'w') as out: - out.write("peak_vms\t" + str(max(m['vm'])) + "\n") - out.write("peak_cpu_all\t" + str(max(m['cpu_all'])) + "\n") - out.write("peak_cpu_preemptible\t" + str(max(m['cpu_preemptible'])) + "\n") - out.write("peak_cpu_nonpreemptible\t" + str(max(m['cpu_nonpreemptible'])) + "\n") - out.write("peak_ram_gib\t" + "{:.2f}".format(max(m['memory'])) + "\n") - out.write("peak_disk_hdd_gib\t" + str(max(m['hdd'])) + "\n") - out.write("peak_disk_ssd_gib\t" + str(max(m['ssd'])) + "\n") + logging.info("Writing " + peak_file) + with open(peak_file, 'w') as out: + out.write("peak_vms\t" + str(max(m['vm'])) + "\n") + out.write("peak_cpu_all\t" + str(max(m['cpu_all'])) + "\n") + out.write("peak_cpu_preemptible\t" + + str(max(m['cpu_preemptible'])) + "\n") + out.write("peak_cpu_nonpreemptible\t" + + str(max(m['cpu_nonpreemptible'])) + "\n") + out.write("peak_ram_gib\t" + "{:.2f}".format(max(m['memory'])) + "\n") + out.write("peak_disk_hdd_gib\t" + str(max(m['hdd'])) + "\n") + out.write("peak_disk_ssd_gib\t" + str(max(m['ssd'])) + "\n") def write_cached_warning(cached_file): - global CACHED - global NUM_CACHED - if NUM_CACHED > 0: - logging.info("%d cached task(s) found, writing task(s) to %s." % (NUM_CACHED, cached_file)) - with open(cached_file, 'w') as cached_out: - cached_out.write("#task_name\tnum_cached\n") - cached_out.write("all_tasks\t%d\n" % NUM_CACHED) - cached_out.write("\n".join([x + '\t' + str(CACHED[x]) for x in sorted(list(CACHED.keys()))]) + "\n") - else: - logging.info("0 cached tasks found.") + global CACHED + global NUM_CACHED + if NUM_CACHED > 0: + logging.info("%d cached task(s) found, writing task(s) to %s." % + (NUM_CACHED, cached_file)) + with open(cached_file, 'w') as cached_out: + cached_out.write("#task_name\tnum_cached\n") + cached_out.write("all_tasks\t%d\n" % NUM_CACHED) + cached_out.write("\n".join( + [x + '\t' + str(CACHED[x]) for x in sorted(list(CACHED.keys()))]) + "\n") + else: + logging.info("0 cached tasks found.") def write_nonpreemptible_vms(vms_file): - global NUM_NONPREEMPTIBLE - global NONPREEMPTIBLE_TASKS - if NUM_NONPREEMPTIBLE > 0: - logging.info("%d non-preemptible VM(s) found, writing task(s) to %s." % (NUM_NONPREEMPTIBLE, vms_file)) - with open(vms_file, 'w') as vms_out: - vms_out.write("#task_name\tnum_nonpreemptible\n") - vms_out.write("all_tasks\t%d\n" % NUM_NONPREEMPTIBLE) - vms_out.write("\n".join([x + '\t' + str(NONPREEMPTIBLE_TASKS[x]) for x in sorted(list(NONPREEMPTIBLE_TASKS.keys()))]) + '\n') - else: - logging.info("0 non-preemptible VMs found.") + global NUM_NONPREEMPTIBLE + global NONPREEMPTIBLE_TASKS + if NUM_NONPREEMPTIBLE > 0: + logging.info("%d non-preemptible VM(s) found, writing task(s) to %s." % + (NUM_NONPREEMPTIBLE, vms_file)) + with open(vms_file, 'w') as vms_out: + vms_out.write("#task_name\tnum_nonpreemptible\n") + vms_out.write("all_tasks\t%d\n" % NUM_NONPREEMPTIBLE) + vms_out.write("\n".join([x + '\t' + str(NONPREEMPTIBLE_TASKS[x]) + for x in sorted(list(NONPREEMPTIBLE_TASKS.keys()))]) + '\n') + else: + logging.info("0 non-preemptible VMs found.") def check_file_nonempty(f): - if not isfile(f): - raise RuntimeError("Required metadata input file %s does not exist." % f) - elif getsize(f) == 0: - raise RuntimeError("Required metadata input file %s is empty." % f) + if not isfile(f): + raise RuntimeError( + "Required metadata input file %s does not exist." % f) + elif getsize(f) == 0: + raise RuntimeError("Required metadata input file %s is empty." % f) # Main function def main(): - parser = argparse.ArgumentParser() - parser.add_argument("workflow_metadata", help="Workflow metadata JSON file") - parser.add_argument("output_base", help="Output directory + basename") - parser.add_argument("--plot-title", - help="Provide workflow name for plot title: Resource Acquisition Over Time", - required=False, default="") - parser.add_argument("--override-warning", - help="Execute script despite workflow warning (server interrupted, workflow failed, etc.), \ + parser = argparse.ArgumentParser() + parser.add_argument("workflow_metadata", + help="Workflow metadata JSON file") + parser.add_argument("output_base", help="Output directory + basename") + parser.add_argument("--plot-title", + help="Provide workflow name for plot title: Resource Acquisition Over Time", + required=False, default="") + parser.add_argument("--override-warning", + help="Execute script despite workflow warning (server interrupted, workflow failed, etc.), \ which may impact plot accuracy", - required=False, default=False, action='store_true') - parser.add_argument("--save-table", help="Save TSV copy of resources over time table used to make plot", - required=False, default=False, action='store_true') - parser.add_argument("--log-level", - help="Specify level of logging information, ie. info, warning, error (not case-sensitive)", - required=False, default="INFO") - args = parser.parse_args() - - # get args as variables - metadata_file, output_base = args.workflow_metadata, args.output_base # required args - plt_title, override_warning, save_table, log_level = args.plot_title, args.override_warning, args.save_table, args.log_level # optional args - - # set attributes based on input parameters - numeric_level = getattr(logging, log_level.upper(), None) - if not isinstance(numeric_level, int): - raise ValueError('Invalid log level: %s' % log_level) - logging.basicConfig(level=numeric_level, format='%(levelname)s: %(message)s') - if plt_title != "": - plt_title += " " - sep = "." - if basename(output_base) == "": - sep = "" - - check_file_nonempty(metadata_file) - call_metadata = get_call_metadata(metadata_file, override_warning) - call_metadata = transform_call_metadata(call_metadata) - - cached_file = output_base + sep + "cached.tsv" - write_cached_warning(cached_file) - - vms_file = output_base + sep + "vms_file.tsv" - write_nonpreemptible_vms(vms_file) - - plot_file = output_base + sep + "plot.png" - plot_resources_time(call_metadata, plt_title, plot_file) - - if save_table: - table_file = output_base + sep + "table.tsv" - write_resources_time_table(call_metadata, table_file) - - peak_file = output_base + sep + "peaks.tsv" - write_peak_usage(call_metadata, peak_file) + required=False, default=False, action='store_true') + parser.add_argument("--save-table", help="Save TSV copy of resources over time table used to make plot", + required=False, default=False, action='store_true') + parser.add_argument("--log-level", + help="Specify level of logging information, ie. info, warning, error (not case-sensitive)", + required=False, default="INFO") + args = parser.parse_args() + + # get args as variables + metadata_file, output_base = args.workflow_metadata, args.output_base # required args + plt_title, override_warning, save_table, log_level = args.plot_title, args.override_warning, args.save_table, args.log_level # optional args + + # set attributes based on input parameters + numeric_level = getattr(logging, log_level.upper(), None) + if not isinstance(numeric_level, int): + raise ValueError('Invalid log level: %s' % log_level) + logging.basicConfig(level=numeric_level, + format='%(levelname)s: %(message)s') + if plt_title != "": + plt_title += " " + sep = "." + if basename(output_base) == "": + sep = "" + + check_file_nonempty(metadata_file) + call_metadata = get_call_metadata(metadata_file, override_warning) + call_metadata = transform_call_metadata(call_metadata) + + cached_file = output_base + sep + "cached.tsv" + write_cached_warning(cached_file) + + vms_file = output_base + sep + "vms_file.tsv" + write_nonpreemptible_vms(vms_file) + + plot_file = output_base + sep + "plot.png" + plot_resources_time(call_metadata, plt_title, plot_file) + + if save_table: + table_file = output_base + sep + "table.tsv" + write_resources_time_table(call_metadata, table_file) + + peak_file = output_base + sep + "peaks.tsv" + write_peak_usage(call_metadata, peak_file) if __name__ == "__main__": - main() + main() diff --git a/scripts/cromwell/copy_outputs.py b/scripts/cromwell/copy_outputs.py index c8a8d0def..3c24f3fa1 100644 --- a/scripts/cromwell/copy_outputs.py +++ b/scripts/cromwell/copy_outputs.py @@ -78,12 +78,14 @@ def copy_blob(storage_client, bucket_name, blob_name, destination_bucket_name, d source_uri = f"gs://{source_bucket.name}/{source_blob.name}" destination_uri = f"gs://{destination_bucket.name}/{destination_blob_name}" if destination_blob.exists(): - sys.stderr.write(f"Target {destination_uri} exists, cautiously refusing to overwrite. Aborting...\n") + sys.stderr.write( + f"Target {destination_uri} exists, cautiously refusing to overwrite. Aborting...\n") sys.exit(1) sys.stderr.write(f"Copying {source_uri}...") (token, bytes_rewritten, total_bytes) = destination_blob.rewrite(source=source_blob) while token is not None: - (token, bytes_rewritten, total_bytes) = destination_blob.rewrite(source=source_blob, token=token) + (token, bytes_rewritten, total_bytes) = destination_blob.rewrite( + source=source_blob, token=token) size_kb = int(bytes_rewritten / 1024) sys.stderr.write(f"done ({size_kb} KB)\n") @@ -96,15 +98,18 @@ def _parse_uri(uri): return bucket_name, bucket_object source_bucket_name, source_blob_name = _parse_uri(source_uri) dest_bucket_name, dest_blob_name = _parse_uri(dest_uri) - copy_blob(storage_client, source_bucket_name, source_blob_name, dest_bucket_name, dest_blob_name) + copy_blob(storage_client, source_bucket_name, + source_blob_name, dest_bucket_name, dest_blob_name) # Main function def main(): parser = argparse.ArgumentParser() parser.add_argument("--name", help="Batch or cohort name", required=True) - parser.add_argument("--metadata", help="Workflow metadata JSON file", required=True) - parser.add_argument("--dest", help="Destination GCS URI (e.g. \"gs://my-bucket/output\")", required=True) + parser.add_argument( + "--metadata", help="Workflow metadata JSON file", required=True) + parser.add_argument( + "--dest", help="Destination GCS URI (e.g. \"gs://my-bucket/output\")", required=True) args = parser.parse_args() metadata = json.load(open(args.metadata, 'r')) output_uris = get_uris(metadata, args.name, args.dest) @@ -113,5 +118,5 @@ def main(): copy_uri(source_uri, dest_uri, client) -if __name__== "__main__": +if __name__ == "__main__": main() diff --git a/scripts/cromwell/download_monitoring_logs.py b/scripts/cromwell/download_monitoring_logs.py index 833c1a538..17f704570 100644 --- a/scripts/cromwell/download_monitoring_logs.py +++ b/scripts/cromwell/download_monitoring_logs.py @@ -27,82 +27,91 @@ NUM_THREADS = 8 RAND_SEED = 7282993 -def getCalls(m, alias=None): - if isinstance(m, list): - call_metadata = [] - for m_shard in m: - call_metadata.extend(getCalls(m_shard, alias=alias)) - return call_metadata - if 'labels' in m: - if 'wdl-call-alias' in m['labels']: - alias = m['labels']['wdl-call-alias'] - elif 'wdl-task-name' in m['labels']: - alias = m['labels']['wdl-task-name'] - - shard_index = '-2' - if 'shardIndex' in m: - shard_index = m['shardIndex'] - - attempt = '0' - if 'attempt' in m: - attempt = m['attempt'] +def getCalls(m, alias=None): + if isinstance(m, list): + call_metadata = [] + for m_shard in m: + call_metadata.extend(getCalls(m_shard, alias=alias)) + return call_metadata + + if 'labels' in m: + if 'wdl-call-alias' in m['labels']: + alias = m['labels']['wdl-call-alias'] + elif 'wdl-task-name' in m['labels']: + alias = m['labels']['wdl-task-name'] + + shard_index = '-2' + if 'shardIndex' in m: + shard_index = m['shardIndex'] + + attempt = '0' + if 'attempt' in m: + attempt = m['attempt'] + + job_id = 'na' + if 'jobId' in m: + job_id = m['jobId'].split('/')[-1] - job_id = 'na' - if 'jobId' in m: - job_id = m['jobId'].split('/')[-1] + call_metadata = [] + if 'calls' in m: + for call in m['calls']: + # Skips scatters that don't contain calls + if '.' not in call: + continue + call_alias = call.split('.')[1] + call_metadata.extend(getCalls(m['calls'][call], alias=call_alias)) - call_metadata = [] - if 'calls' in m: - for call in m['calls']: - # Skips scatters that don't contain calls - if '.' not in call: - continue - call_alias = call.split('.')[1] - call_metadata.extend(getCalls(m['calls'][call], alias=call_alias)) + if 'subWorkflowMetadata' in m: + call_metadata.extend(getCalls(m['subWorkflowMetadata'], alias=alias)) - if 'subWorkflowMetadata' in m: - call_metadata.extend(getCalls(m['subWorkflowMetadata'], alias=alias)) + # in a call + if alias and ('monitoringLog' in m): + call_metadata.append((m, alias, shard_index, attempt, job_id)) - # in a call - if alias and ('monitoringLog' in m): - call_metadata.append((m, alias, shard_index, attempt, job_id)) + return call_metadata - return call_metadata def download(data, output_dir): - (m, alias, shard_index, attempt, job_id) = data - if job_id != 'na': - output_dest = output_dir + '/' + alias + '.' + str(shard_index) + '.' + str(attempt) + '.' + job_id + '.monitoring.log' - log_url = m['monitoringLog'] - if os.path.isfile(output_dest): - print("skipping " + log_url) - return - with open(output_dest, 'wb') as f: - client = storage.Client() - tokens = log_url.split('/') - bucket_name = tokens[2] - bucket_object = '/'.join(tokens[3:]) - bucket = client.get_bucket(bucket_name) - blob = bucket.get_blob(bucket_object) - if blob: - print(log_url) - blob.download_to_file(f) + (m, alias, shard_index, attempt, job_id) = data + if job_id != 'na': + output_dest = output_dir + '/' + alias + '.' + \ + str(shard_index) + '.' + str(attempt) + \ + '.' + job_id + '.monitoring.log' + log_url = m['monitoringLog'] + if os.path.isfile(output_dest): + print("skipping " + log_url) + return + with open(output_dest, 'wb') as f: + client = storage.Client() + tokens = log_url.split('/') + bucket_name = tokens[2] + bucket_object = '/'.join(tokens[3:]) + bucket = client.get_bucket(bucket_name) + blob = bucket.get_blob(bucket_object) + if blob: + print(log_url) + blob.download_to_file(f) # Main function + + def main(): - parser = argparse.ArgumentParser() - parser.add_argument("workflow_metadata", help="Workflow metadata JSON file") - parser.add_argument("output_dir", help="Output directory") - args = parser.parse_args() - random.seed(RAND_SEED) + parser = argparse.ArgumentParser() + parser.add_argument("workflow_metadata", + help="Workflow metadata JSON file") + parser.add_argument("output_dir", help="Output directory") + args = parser.parse_args() + random.seed(RAND_SEED) + + metadata_file = args.workflow_metadata + output_dir = args.output_dir - metadata_file = args.workflow_metadata - output_dir = args.output_dir + metadata = json.load(open(metadata_file, 'r')) + call_metadata = getCalls(metadata, metadata['workflowName']) + Parallel(n_jobs=NUM_THREADS)(delayed(download)(d, output_dir) + for d in call_metadata) - metadata = json.load(open(metadata_file, 'r')) - call_metadata = getCalls(metadata, metadata['workflowName']) - Parallel(n_jobs=NUM_THREADS)(delayed(download)(d, output_dir) for d in call_metadata) -if __name__== "__main__": - main() +if __name__ == "__main__": + main() diff --git a/scripts/cromwell/generate_inputs.py b/scripts/cromwell/generate_inputs.py index 4e96f3b30..60c5e3bfd 100644 --- a/scripts/cromwell/generate_inputs.py +++ b/scripts/cromwell/generate_inputs.py @@ -12,7 +12,7 @@ # # Usage: # python generate_inputs.py workflow.wdl.example.json prereq_metadata_files.json -# +# # Parameters: # worfklow.wdl.example.json : Workflow input file containing all parameters. This is used to populate default values for parameters not determined from metadata. # prereq_metadata_files.json : JSON-encoded set of prerequisite workflow metadata files (see generate_inputs_examples directory) @@ -20,325 +20,368 @@ # Author: Mark Walker (markw@broadinstitute.org) # Prints error message and quits + + def raise_error(msg): - raise ValueError(msg) - sys.exit(1) + raise ValueError(msg) + sys.exit(1) # Prints warning message to stderr + + def print_warning(msg): - sys.stderr.write("Warning: " + msg) + sys.stderr.write("Warning: " + msg) # Workflow-specific configuration class + + class ScriptConfig: - def __init__(self, data_map, sample_ids_keys = None, sample_specific_file_lists = None): - self.data_map = data_map - self.sample_ids_keys = sample_ids_keys - self.sample_specific_file_lists = sample_specific_file_lists + def __init__(self, data_map, sample_ids_keys=None, sample_specific_file_lists=None): + self.data_map = data_map + self.sample_ids_keys = sample_ids_keys + self.sample_specific_file_lists = sample_specific_file_lists - def requires_sample_ids(self): - if self.sample_ids_keys: - return True - return False + def requires_sample_ids(self): + if self.sample_ids_keys: + return True + return False # Definitions of prerequisite workflows and mappings from their inputs/outputs to the current workflow's input # i.e. X_MAP[PREREQ_WORKFLOW][INPUT/OUTPUT][PREREQ_OUTPUT] = X_INPUT + # TODO : add gCNV GATKSVPIPELINEPHASE1_MAP = { - "Module00a" : { - "inputs" : { - "samples" : "samples" - }, - "outputs" : { - "BAF_out" : "BAF_files", - "coverage_counts" : "counts", - "delly_vcf" : "delly_vcfs", - "manta_vcf" : "manta_vcfs", - "melt_vcf" : "melt_vcfs", - "pesr_disc" : "PE_files", - "pesr_split" : "SR_files", - "wham_vcf" : "wham_vcfs" + "Module00a": { + "inputs": { + "samples": "samples" + }, + "outputs": { + "BAF_out": "BAF_files", + "coverage_counts": "counts", + "delly_vcf": "delly_vcfs", + "manta_vcf": "manta_vcfs", + "melt_vcf": "melt_vcfs", + "pesr_disc": "PE_files", + "pesr_split": "SR_files", + "wham_vcf": "wham_vcfs" + } } - } } MODULE00B_MAP = { - "Module00a" : { - "inputs" : { - "samples" : "samples" - }, - "outputs" : { - "coverage_counts" : "counts", - "delly_vcf" : "delly_vcfs", - "manta_vcf" : "manta_vcfs", - "melt_vcf" : "melt_vcfs", - "wham_vcf" : "wham_vcfs" + "Module00a": { + "inputs": { + "samples": "samples" + }, + "outputs": { + "coverage_counts": "counts", + "delly_vcf": "delly_vcfs", + "manta_vcf": "manta_vcfs", + "melt_vcf": "melt_vcfs", + "wham_vcf": "wham_vcfs" + } } - } } MODULE00C_MAP = { - "Module00a" : { - "inputs" : { - "samples" : "samples" - }, - "outputs" : { - "BAF_out" : "BAF_files", - "coverage_counts" : "counts", - "delly_vcf" : "delly_vcfs", - "manta_vcf" : "manta_vcfs", - "melt_vcf" : "melt_vcfs", - "pesr_disc" : "PE_files", - "pesr_split" : "SR_files", - "wham_vcf" : "wham_vcfs" + "Module00a": { + "inputs": { + "samples": "samples" + }, + "outputs": { + "BAF_out": "BAF_files", + "coverage_counts": "counts", + "delly_vcf": "delly_vcfs", + "manta_vcf": "manta_vcfs", + "melt_vcf": "melt_vcfs", + "pesr_disc": "PE_files", + "pesr_split": "SR_files", + "wham_vcf": "wham_vcfs" + } } - } } MODULE01_MAP = { - "Module00c" : { - "inputs" : { - "batch" : "batch" - }, - "outputs" : { - "std_manta_vcf" : "manta_vcfs", - "std_delly_vcf" : "delly_vcfs", - "std_melt_vcf" : "melt_vcfs", - "std_wham_vcf" : "wham_vcfs", - "merged_dels" : "del_bed", - "merged_dups" : "dup_bed" + "Module00c": { + "inputs": { + "batch": "batch" + }, + "outputs": { + "std_manta_vcf": "manta_vcfs", + "std_delly_vcf": "delly_vcfs", + "std_melt_vcf": "melt_vcfs", + "std_wham_vcf": "wham_vcfs", + "merged_dels": "del_bed", + "merged_dups": "dup_bed" + } } - } } MODULE02_MAP = { - "Module00c" : { - "inputs" : { - "samples" : "samples", - "batch" : "batch" + "Module00c": { + "inputs": { + "samples": "samples", + "batch": "batch" + }, + "outputs": { + "merged_BAF": "baf_metrics", + "merged_SR": "splitfile", + "merged_PE": "discfile", + "merged_bincov": "coveragefile", + "median_cov": "medianfile" + } }, - "outputs" : { - "merged_BAF" : "baf_metrics", - "merged_SR" : "splitfile", - "merged_PE" : "discfile", - "merged_bincov" : "coveragefile", - "median_cov" : "medianfile" + "Module01": { + "outputs": { + "depth_vcf": "depth_vcf", + "manta_vcf": "manta_vcf", + "delly_vcf": "delly_vcf", + "wham_vcf": "wham_vcf", + "melt_vcf": "melt_vcf" + } } - }, - "Module01" : { - "outputs" : { - "depth_vcf" : "depth_vcf", - "manta_vcf" : "manta_vcf", - "delly_vcf" : "delly_vcf", - "wham_vcf" : "wham_vcf", - "melt_vcf" : "melt_vcf" - } - } } MODULE03_MAP = { - "Module01" : { - "inputs" : { - "samples" : "samples", - "batch" : "batch" + "Module01": { + "inputs": { + "samples": "samples", + "batch": "batch" + }, + "outputs": { + "depth_vcf": "depth_vcf", + "manta_vcf": "manta_vcf", + "delly_vcf": "delly_vcf", + "wham_vcf": "wham_vcf", + "melt_vcf": "melt_vcf" + } }, - "outputs" : { - "depth_vcf" : "depth_vcf", - "manta_vcf" : "manta_vcf", - "delly_vcf" : "delly_vcf", - "wham_vcf" : "wham_vcf", - "melt_vcf" : "melt_vcf" - } - }, - "Module02" : { - "outputs" : { - "metrics" : "evidence_metrics" + "Module02": { + "outputs": { + "metrics": "evidence_metrics" + } } - } } MODULE04_MAP = { - "Module00c" : { - "inputs" : { - "batch" : "batch" + "Module00c": { + "inputs": { + "batch": "batch" + }, + "outputs": { + "merged_SR": "splitfile", + "merged_PE": "discfile", + "merged_bincov": "coveragefile", + "median_cov": "medianfile" + } }, - "outputs" : { - "merged_SR" : "splitfile", - "merged_PE" : "discfile", - "merged_bincov" : "coveragefile", - "median_cov" : "medianfile" + "Module03": { + "outputs": { + "filtered_depth_vcf": "batch_depth_vcf", + "filtered_pesr_vcf": "batch_pesr_vcf", + "ped_file_postOutlierExclusion": "famfile", + "batch_samples_postOutlierExclusion": "samples", + "cutoffs": "rf_cutoffs" + } } - }, - "Module03" : { - "outputs" : { - "filtered_depth_vcf" : "batch_depth_vcf", - "filtered_pesr_vcf" : "batch_pesr_vcf", - "ped_file_postOutlierExclusion" : "famfile", - "batch_samples_postOutlierExclusion" : "samples", - "cutoffs" : "rf_cutoffs" - } - } } SCRIPT_CONFIGS = { - "GATKSVPipelinePhase1" : ScriptConfig(GATKSVPIPELINEPHASE1_MAP, - sample_ids_keys = ("Module00a","inputs","samples"), - sample_specific_file_lists = ["BAF_files", "PE_files", "SR_files", "counts", "genotyped_segments_vcfs", "manta_vcfs", "delly_vcfs", "melt_vcfs", "wham_vcfs"]), - "Module00b" : ScriptConfig(MODULE00B_MAP, - sample_ids_keys = ("Module00a","inputs","samples"), - sample_specific_file_lists = ["counts", "manta_vcfs", "delly_vcfs", "melt_vcfs", "wham_vcfs"]), - "Module00c" : ScriptConfig(MODULE00C_MAP, - sample_specific_file_lists = ["BAF_files", "PE_files", "SR_files", "counts", "manta_vcfs", "delly_vcfs", "melt_vcfs", "wham_vcfs"]), - "Module01" : ScriptConfig(MODULE01_MAP, - sample_ids_keys = ("Module00c","inputs","samples"), - sample_specific_file_lists = ["manta_vcfs", "delly_vcfs", "melt_vcfs", "wham_vcfs"]), - "Module02" : ScriptConfig(MODULE02_MAP, - sample_ids_keys = ("Module00c","inputs","samples")), - "Module03" : ScriptConfig(MODULE03_MAP, - sample_ids_keys = ("Module01","inputs","samples")), - "Module04" : ScriptConfig(MODULE04_MAP) # No sample order checking post-exclusion + "GATKSVPipelinePhase1": ScriptConfig(GATKSVPIPELINEPHASE1_MAP, + sample_ids_keys=( + "Module00a", "inputs", "samples"), + sample_specific_file_lists=["BAF_files", "PE_files", "SR_files", "counts", "genotyped_segments_vcfs", "manta_vcfs", "delly_vcfs", "melt_vcfs", "wham_vcfs"]), + "Module00b": ScriptConfig(MODULE00B_MAP, + sample_ids_keys=( + "Module00a", "inputs", "samples"), + sample_specific_file_lists=["counts", "manta_vcfs", "delly_vcfs", "melt_vcfs", "wham_vcfs"]), + "Module00c": ScriptConfig(MODULE00C_MAP, + sample_specific_file_lists=["BAF_files", "PE_files", "SR_files", "counts", "manta_vcfs", "delly_vcfs", "melt_vcfs", "wham_vcfs"]), + "Module01": ScriptConfig(MODULE01_MAP, + sample_ids_keys=( + "Module00c", "inputs", "samples"), + sample_specific_file_lists=["manta_vcfs", "delly_vcfs", "melt_vcfs", "wham_vcfs"]), + "Module02": ScriptConfig(MODULE02_MAP, + sample_ids_keys=("Module00c", "inputs", "samples")), + "Module03": ScriptConfig(MODULE03_MAP, + sample_ids_keys=("Module01", "inputs", "samples")), + # No sample order checking post-exclusion + "Module04": ScriptConfig(MODULE04_MAP) } + def load_json(filepath): - with open(filepath, 'r') as f: - return json.load(f) - return + with open(filepath, 'r') as f: + return json.load(f) + return + def determine_workflow_name(default_inputs): - workflow_name = "" - for key in default_inputs: - if '.' not in key: - raise_error('Missing "." in WDL input field: ' + key) - tokens = key.split('.') + workflow_name = "" + for key in default_inputs: + if '.' not in key: + raise_error('Missing "." in WDL input field: ' + key) + tokens = key.split('.') + if not workflow_name: + workflow_name = tokens[0] + else: + if tokens[0] != workflow_name: + raise_error('Inconsistent workflow name: ' + tokens[0]) if not workflow_name: - workflow_name = tokens[0] - else: - if tokens[0] != workflow_name: - raise_error('Inconsistent workflow name: ' + tokens[0]) - if not workflow_name: - raise_error('Workflow name could not be determined from the WDL input file') - return workflow_name + raise_error( + 'Workflow name could not be determined from the WDL input file') + return workflow_name + def get_workflow_config(workflow_name): - if workflow_name not in SCRIPT_CONFIGS: - raise_error('Could not find workflow "' + workflow_name + '", options are: ' + str(SCRIPT_CONFIGS.keys())) - return SCRIPT_CONFIGS[workflow_name] + if workflow_name not in SCRIPT_CONFIGS: + raise_error('Could not find workflow "' + workflow_name + + '", options are: ' + str(SCRIPT_CONFIGS.keys())) + return SCRIPT_CONFIGS[workflow_name] + def check_all_metadata_present(script_config, metadata_files): - if script_config.data_map.keys() != metadata_files.keys(): - raise_error('Script config workflows and metadata file workflows did not match. Script config expected ' + str(workflows) + ' but got metadata for ' + str(metadata_files.keys())) + if script_config.data_map.keys() != metadata_files.keys(): + raise_error('Script config workflows and metadata file workflows did not match. Script config expected ' + + str(workflows) + ' but got metadata for ' + str(metadata_files.keys())) + def check_expected_workflow_fields(script_config, default_inputs, workflow_name): - for workflow in script_config.data_map: - if "outputs" in script_config.data_map[workflow]: - for output_name in script_config.data_map[workflow]["outputs"]: - wdl_input_name = workflow_name + "." + script_config.data_map[workflow]["outputs"][output_name] - if wdl_input_name not in default_inputs: - raise_error('Script configuration expected field ' + wdl_input_name + ' but it was not found in the WDL input file') + for workflow in script_config.data_map: + if "outputs" in script_config.data_map[workflow]: + for output_name in script_config.data_map[workflow]["outputs"]: + wdl_input_name = workflow_name + "." + \ + script_config.data_map[workflow]["outputs"][output_name] + if wdl_input_name not in default_inputs: + raise_error('Script configuration expected field ' + + wdl_input_name + ' but it was not found in the WDL input file') + def load_prerequisite_metadata(metadata_files): - prereq_metadata = {} - for workflow in metadata_files: - with open(metadata_files[workflow], 'r') as f: - m = json.load(f) - if 'outputs' not in m: - raise_error('Metadata ' + metadata_files[workflow] + ' did not have an outputs field') - prereq_metadata[workflow] = m - return prereq_metadata + prereq_metadata = {} + for workflow in metadata_files: + with open(metadata_files[workflow], 'r') as f: + m = json.load(f) + if 'outputs' not in m: + raise_error( + 'Metadata ' + metadata_files[workflow] + ' did not have an outputs field') + prereq_metadata[workflow] = m + return prereq_metadata + def get_preqreq_values(workflow_map, workflow_metadata, script_config, prereq_attr_prefix, workflow_name, inputs): - for expected_name in workflow_map: - name = prereq_attr_prefix + expected_name - if name not in workflow_metadata or not workflow_metadata[name]: - print_warning('could not find metadata for attribute ' + name + ', using default value if provided\n') - else: - input_name = workflow_name + "." + workflow_map[expected_name] - inputs[input_name] = workflow_metadata[name] + for expected_name in workflow_map: + name = prereq_attr_prefix + expected_name + if name not in workflow_metadata or not workflow_metadata[name]: + print_warning('could not find metadata for attribute ' + + name + ', using default value if provided\n') + else: + input_name = workflow_name + "." + workflow_map[expected_name] + inputs[input_name] = workflow_metadata[name] + def get_workflow_inputs(prereq_metadata, script_config, default_inputs, workflow_name): - inputs = {} - for prereq_workflow_name in script_config.data_map: - workflow_metadata = prereq_metadata[prereq_workflow_name] - data_maps = script_config.data_map[prereq_workflow_name] - if "inputs" in data_maps: - get_preqreq_values(data_maps["inputs"], workflow_metadata["inputs"], script_config, "", workflow_name, inputs) - if "outputs" in data_maps: - get_preqreq_values(data_maps["outputs"], workflow_metadata["outputs"], script_config, prereq_workflow_name + ".", workflow_name, inputs) - #Fill in rest of the fields with defaults inputs file - for key in default_inputs: - if key not in inputs: - inputs[key] = default_inputs[key] - return inputs + inputs = {} + for prereq_workflow_name in script_config.data_map: + workflow_metadata = prereq_metadata[prereq_workflow_name] + data_maps = script_config.data_map[prereq_workflow_name] + if "inputs" in data_maps: + get_preqreq_values( + data_maps["inputs"], workflow_metadata["inputs"], script_config, "", workflow_name, inputs) + if "outputs" in data_maps: + get_preqreq_values(data_maps["outputs"], workflow_metadata["outputs"], + script_config, prereq_workflow_name + ".", workflow_name, inputs) + # Fill in rest of the fields with defaults inputs file + for key in default_inputs: + if key not in inputs: + inputs[key] = default_inputs[key] + return inputs + def get_samples_list(script_config, prereq_metadata): - samples_workflow = script_config.sample_ids_keys[0] - samples_workflow_metadata_key = script_config.sample_ids_keys[1] - samples_attr = script_config.sample_ids_keys[2] - if samples_workflow not in prereq_metadata: - raise_error("Expected metadata for workflow " + samples_workflow) - if samples_workflow_metadata_key not in prereq_metadata[samples_workflow]: - raise_error("Expected to find key " + samples_workflow_metadata_key + " in workflow " + samples_workflow + " metadata, but found: " + str(prereq_metadata[samples_workflow].keys())) - if samples_attr not in prereq_metadata[samples_workflow][samples_workflow_metadata_key]: - raise_error("Expected to find attribute " + samples_workflow_metadata_key + " : { " + samples_attr + " } in workflow " + samples_workflow) - return prereq_metadata[samples_workflow][samples_workflow_metadata_key][samples_attr] + samples_workflow = script_config.sample_ids_keys[0] + samples_workflow_metadata_key = script_config.sample_ids_keys[1] + samples_attr = script_config.sample_ids_keys[2] + if samples_workflow not in prereq_metadata: + raise_error("Expected metadata for workflow " + samples_workflow) + if samples_workflow_metadata_key not in prereq_metadata[samples_workflow]: + raise_error("Expected to find key " + samples_workflow_metadata_key + " in workflow " + + samples_workflow + " metadata, but found: " + str(prereq_metadata[samples_workflow].keys())) + if samples_attr not in prereq_metadata[samples_workflow][samples_workflow_metadata_key]: + raise_error("Expected to find attribute " + samples_workflow_metadata_key + + " : { " + samples_attr + " } in workflow " + samples_workflow) + return prereq_metadata[samples_workflow][samples_workflow_metadata_key][samples_attr] + def cross_check_sample_order(workflow_name, script_config, inputs, samples_list): - for sample_specific_name in [workflow_name + "." + name for name in script_config.sample_specific_file_lists]: - if sample_specific_name not in inputs: - raise_error('Expected to find sample-specific parameter list ' + sample_specific_name) - sample_specific_values = inputs[sample_specific_name] - if not isinstance(sample_specific_values, list): - raise_error('Expected sample-specific value ' + sample_specific_name + ' to be of type list but found ' + str(type(sample_specific_values))) - if len(sample_specific_values) != len(samples_list): - print_warning('Length of samples list is ' + str(len(samples_list)) + ' but length of sample-specific parameter ' + sample_specific_name + ' was ' + str(len(sample_specific_values))) - for i in range(len(samples_list)): - sample_id = samples_list[i] - if sample_id not in sample_specific_values[i]: - print_warning('Did not find sample id ' + sample_id + ' in input ' + sample_specific_name + '[' + str(i) + '], found ' + str(sample_specific_values[i])) + for sample_specific_name in [workflow_name + "." + name for name in script_config.sample_specific_file_lists]: + if sample_specific_name not in inputs: + raise_error( + 'Expected to find sample-specific parameter list ' + sample_specific_name) + sample_specific_values = inputs[sample_specific_name] + if not isinstance(sample_specific_values, list): + raise_error('Expected sample-specific value ' + sample_specific_name + + ' to be of type list but found ' + str(type(sample_specific_values))) + if len(sample_specific_values) != len(samples_list): + print_warning('Length of samples list is ' + str(len(samples_list)) + + ' but length of sample-specific parameter ' + sample_specific_name + ' was ' + str(len(sample_specific_values))) + for i in range(len(samples_list)): + sample_id = samples_list[i] + if sample_id not in sample_specific_values[i]: + print_warning('Did not find sample id ' + sample_id + ' in input ' + + sample_specific_name + '[' + str(i) + '], found ' + str(sample_specific_values[i])) # Main function + + def main(): - parser = argparse.ArgumentParser() - parser.add_argument("default_inputs", help="Inputs JSON file containing default parameter values") - parser.add_argument("prereq_workflow_paths", help="JSON file specifying metadata file paths for each prerequisite workflow with format \"workflow_name\" : \"/path/to/metadata\"") - args = parser.parse_args() + parser = argparse.ArgumentParser() + parser.add_argument( + "default_inputs", help="Inputs JSON file containing default parameter values") + parser.add_argument("prereq_workflow_paths", + help="JSON file specifying metadata file paths for each prerequisite workflow with format \"workflow_name\" : \"/path/to/metadata\"") + args = parser.parse_args() - #Load the inputs file for the workflow - default_inputs = load_json(args.default_inputs) + # Load the inputs file for the workflow + default_inputs = load_json(args.default_inputs) - #Load preqreq metadata file paths, if provided - metadata_files = load_json(args.prereq_workflow_paths) + # Load preqreq metadata file paths, if provided + metadata_files = load_json(args.prereq_workflow_paths) - #Determine name of the current workflow - workflow_name = determine_workflow_name(default_inputs) + # Determine name of the current workflow + workflow_name = determine_workflow_name(default_inputs) - #Check that workflow is defined and retrieve it - script_config = get_workflow_config(workflow_name) + # Check that workflow is defined and retrieve it + script_config = get_workflow_config(workflow_name) - #Check that the expected prerequisite workflow metadata files were provided - check_all_metadata_present(script_config, metadata_files) + # Check that the expected prerequisite workflow metadata files were provided + check_all_metadata_present(script_config, metadata_files) - #Check that all script config fields are present in the WDL inputs file - #check_expected_workflow_fields(script_config, default_inputs, workflow_name) + # Check that all script config fields are present in the WDL inputs file + # check_expected_workflow_fields(script_config, default_inputs, workflow_name) - #Load prerequisite metadata outputs - prereq_metadata = load_prerequisite_metadata(metadata_files) + # Load prerequisite metadata outputs + prereq_metadata = load_prerequisite_metadata(metadata_files) - #Map metadata outputs to workflow inputs and fill in default values - inputs = get_workflow_inputs(prereq_metadata, script_config, default_inputs, workflow_name) + # Map metadata outputs to workflow inputs and fill in default values + inputs = get_workflow_inputs( + prereq_metadata, script_config, default_inputs, workflow_name) - #Use samples list if provided - if script_config.requires_sample_ids(): - samples_attr = script_config.sample_ids_keys[2] - samples_name = workflow_name + "." + samples_attr - inputs[samples_name] = get_samples_list(script_config, prereq_metadata) - samples_list = inputs[samples_name] - #Checks that sample-specific lists contain sample ids in correct order - if script_config.sample_specific_file_lists: - cross_check_sample_order(workflow_name, script_config, inputs, samples_list) - - #Print output - print json.dumps(inputs, sort_keys=True, indent=2) - -if __name__== "__main__": - main() + # Use samples list if provided + if script_config.requires_sample_ids(): + samples_attr = script_config.sample_ids_keys[2] + samples_name = workflow_name + "." + samples_attr + inputs[samples_name] = get_samples_list(script_config, prereq_metadata) + samples_list = inputs[samples_name] + # Checks that sample-specific lists contain sample ids in correct order + if script_config.sample_specific_file_lists: + cross_check_sample_order( + workflow_name, script_config, inputs, samples_list) + + # Print output + print json.dumps(inputs, sort_keys=True, indent=2) + + +if __name__ == "__main__": + main() diff --git a/scripts/cromwell/get_inputs_outputs.py b/scripts/cromwell/get_inputs_outputs.py index 151fe58b5..6f06c39ee 100644 --- a/scripts/cromwell/get_inputs_outputs.py +++ b/scripts/cromwell/get_inputs_outputs.py @@ -16,64 +16,70 @@ # # Author: Mark Walker (markw@broadinstitute.org) + def getSubworkflows(m, alias): - if isinstance(m, list): - return getSubworkflows(m[0], alias) + if isinstance(m, list): + return getSubworkflows(m[0], alias) + + task = '' + if 'workflowName' in m: + task = m['workflowName'] - task = '' - if 'workflowName' in m: - task = m['workflowName'] + # in a call + if not ('subWorkflowMetadata' in m or 'calls' in m): + return [] - #in a call - if not ('subWorkflowMetadata' in m or 'calls' in m): - return [] + call_metadata = [] + if 'calls' in m: + for call in m['calls']: + call_metadata.extend(getSubworkflows(m['calls'][call], call)) - call_metadata = [] - if 'calls' in m: - for call in m['calls']: - call_metadata.extend(getSubworkflows(m['calls'][call], call)) + if 'subWorkflowMetadata' in m: + call_metadata.extend(getSubworkflows(m['subWorkflowMetadata'], alias)) - if 'subWorkflowMetadata' in m: - call_metadata.extend(getSubworkflows(m['subWorkflowMetadata'], alias)) + if ('inputs' in m and 'outputs' in m and task): + call_metadata.append((m, task, alias)) - if ('inputs' in m and 'outputs' in m and task): - call_metadata.append((m, task, alias)) + return call_metadata - return call_metadata def write_files(workflow_metadata, output_dir): - for (m, task, alias) in workflow_metadata: - m_copy = {} - m_copy['inputs'] = m['inputs'] - m_copy['outputs'] = m['outputs'] - for key in list(m_copy['inputs']): - if m_copy['inputs'][key]: - m_copy['inputs'][task + '.' + key] = m_copy['inputs'][key] - del m_copy['inputs'][key] - for key in list(m_copy['outputs']): - if not m_copy['outputs'][key]: - del m_copy['outputs'][key] - - inputs_path = os.path.join(output_dir, alias + '.inputs.json') - outputs_path = os.path.join(output_dir, alias + '.outputs.json') - with open(inputs_path, 'w') as f: - f.write(json.dumps(m_copy['inputs'], sort_keys=True, indent=2)) - with open(outputs_path, 'w') as f: - f.write(json.dumps(m_copy['outputs'], sort_keys=True, indent=2)) + for (m, task, alias) in workflow_metadata: + m_copy = {} + m_copy['inputs'] = m['inputs'] + m_copy['outputs'] = m['outputs'] + for key in list(m_copy['inputs']): + if m_copy['inputs'][key]: + m_copy['inputs'][task + '.' + key] = m_copy['inputs'][key] + del m_copy['inputs'][key] + for key in list(m_copy['outputs']): + if not m_copy['outputs'][key]: + del m_copy['outputs'][key] + + inputs_path = os.path.join(output_dir, alias + '.inputs.json') + outputs_path = os.path.join(output_dir, alias + '.outputs.json') + with open(inputs_path, 'w') as f: + f.write(json.dumps(m_copy['inputs'], sort_keys=True, indent=2)) + with open(outputs_path, 'w') as f: + f.write(json.dumps(m_copy['outputs'], sort_keys=True, indent=2)) # Main function + + def main(): - parser = argparse.ArgumentParser() - parser.add_argument("workflow_metadata", help="Workflow metadata JSON file") - parser.add_argument("output_dir", help="Output directory") - args = parser.parse_args() + parser = argparse.ArgumentParser() + parser.add_argument("workflow_metadata", + help="Workflow metadata JSON file") + parser.add_argument("output_dir", help="Output directory") + args = parser.parse_args() + + metadata_file = args.workflow_metadata + output_dir = args.output_dir - metadata_file = args.workflow_metadata - output_dir = args.output_dir + metadata = json.load(open(metadata_file, 'r')) + workflow_metadata = getSubworkflows(metadata, metadata['workflowName']) + write_files(workflow_metadata, output_dir) - metadata = json.load(open(metadata_file, 'r')) - workflow_metadata = getSubworkflows(metadata, metadata['workflowName']) - write_files(workflow_metadata, output_dir) -if __name__== "__main__": - main() +if __name__ == "__main__": + main() diff --git a/scripts/cromwell/get_output_paths.py b/scripts/cromwell/get_output_paths.py new file mode 100644 index 000000000..aaadfda69 --- /dev/null +++ b/scripts/cromwell/get_output_paths.py @@ -0,0 +1,263 @@ +#!/bin/python + +import argparse +import json +import logging +import re +import os.path +from urllib.parse import urlparse + +from google.cloud import storage + +""" +Summary: Find GCS paths for specified workflow file outputs for multiple workflows at once without downloading metadata. + +Caveats: Assumes cromwell file structure. Recommended for use with cromwell final_workflow_outputs_dir + to reduce number of files to search. Requires file suffixes for each output file that are + unique within the workflow directory. + +For usage & parameters: Run python get_output_paths.py --help + +Output: TSV file with columns for each output variable and a row for each + batch (or entity, if providing --entities-file), containing GCS output paths + +Author: Emma Pierce-Hoffman (epierceh@broadinstitute.org) +""" + + +def check_file_nonempty(f): + # Validate existence of file and that it is > 0 bytes + if not os.path.isfile(f): + raise RuntimeError("Required input file %s does not exist." % f) + elif os.path.getsize(f) == 0: + raise RuntimeError("Required input file %s is empty." % f) + + +def read_entities_file(entities_file): + # Get list of entities from -e entities file + entities = [] + if entities_file is not None: + # proceed with reading file - must not be None at this point + check_file_nonempty(entities_file) + with open(entities_file, 'r') as f: + for line in f: + entities.append(line.strip()) + return entities + + +def load_filenames(filenames): + # Read -f filenames / output names JSON + files_dict = json.load(open(filenames, 'r')) + output_names = sorted(files_dict.keys()) + if len(output_names) == 0: + raise ValueError("No output files to search for found in required -f/--filenames JSON %s." % filenames) + return files_dict, output_names + + +def split_bucket_subdir(directory): + # Parse -b URI input into top-level bucket name (no gs://) and subdirectory path + uri = urlparse(directory) + return uri.netloc, uri.path.lstrip("/") + + +def get_batch_dirs(workflows, workflow_id, directory): + # Return list of (batch_name, batch_subdirectory) and top-level bucket parsed from -b URI input + batches_dirs = [] # to hold tuples of (batch, dir) in order given in input + bucket, subdir = split_bucket_subdir(directory) + # If using -i input, just add workflow ID to subdirectory path and return + if workflow_id is not None: + return [("placeholder_batch", os.path.join(subdir, workflow_id))], bucket + # If using -w input, read workflows file to get batch names and workflow IDs + with open(workflows, 'r') as inp: + for line in inp: + if line.strip() == "": + continue + (batch, workflow) = line.strip().split('\t') + batch_dir = os.path.join(subdir, workflow) + batches_dirs.append((batch, batch_dir)) + return batches_dirs, bucket + + +def find_batch_output_files(batch, bucket, prefix, files_dict, output_names, num_outputs): + # Search batch directory for files with specified prefixes + + # Get all objects in directory + storage_client = storage.Client() + blobs = storage_client.list_blobs(bucket, prefix=prefix, + delimiter=None) # only one workflow per batch - assumes caching if multiple + + # Go through each object in directory once, checking if it matches any filenames not yet found + batch_outputs = {file: [] for file in output_names} + names_left = list(output_names) + num_found = 0 + for blob in blobs: + blob_name = blob.name.strip() + # in case multiple files, continue matching on suffixes even if already found file match(es) + for name in output_names: + if blob_name.endswith(files_dict[name]): + blob_path = os.path.join("gs://", bucket, blob_name) # reconstruct URI + if len(batch_outputs[name]) == 0: + num_found += 1 + names_left.remove(name) + batch_outputs[name].append(blob_path) + break + + # Warn if some outputs not found + if num_found < num_outputs: + for name in names_left: + logging.warning(f"{batch} output file {name} not found in gs://{bucket}/{prefix}. Outputting empty string") + + return batch_outputs + + +def sort_files_by_shard(file_list): + # Attempt to sort file list by shard number based on last occurrence of "shard-" in URI + if len(file_list) < 2: + return file_list + regex = r'^(shard-)([0-9]+)(/.*)' # extract shard number for sorting - group 2 + shard_numbers = [] + check_different_shard = None + for file in file_list: + index = file.rfind("shard-") # find index of last occurrence of shard- substring in file path + if index == -1: + return file_list # abandon sorting if no shard- substring + shard = int(re.match(regex, file[index:]).group(2)) + # make sure first two shard numbers actually differ + if check_different_shard is None: + check_different_shard = shard + elif check_different_shard != -1: + if shard == check_different_shard: + return file_list # if first two shard numbers match, then abandon sorting by shard + check_different_shard = -1 + shard_numbers.append(shard) + return [x for _, x in sorted(zip(shard_numbers, file_list), key=lambda pair: pair[0])] + + +def format_batch_line(batch, output_names, batch_outputs): + # Format line with batch and outputs (if not using entities option) + batch_line = batch + "\t" + batch_line += "\t".join(",".join(sort_files_by_shard(batch_outputs[name])) for name in output_names) + batch_line += "\n" + return batch_line + + +def update_entity_outputs(output_names, batch_outputs, entities, entity_outputs): + # Edit entity_outputs dict in place: add new batch outputs to each corresponding entity + for output_index, name in enumerate(output_names): + filepaths = batch_outputs[name] + filenames = [path.split("/")[-1] for path in filepaths] + for entity in entities: # not efficient but should be <500 entities and filenames to search + for i, filename in enumerate(filenames): + # cannot handle Array[File] output for one entity + if entity in filename and entity_outputs[entity][output_index] == "": + entity_outputs[entity][output_index] = filepaths[i] + entity_outputs[entity].append(filepaths[i]) + filenames.remove(filename) + filepaths.remove(filepaths[i]) + break + + +def write_entity_outputs(entity_outputs, keep_all_entities, entities, output_stream): + # Check, format, and write entity outputs + # do write inside function to be able to print line-by-line + for entity in entities: + # check for blank entities + if all(element == "" for element in entity_outputs[entity]): + if keep_all_entities: + logging.info(f"No output files found for entity '{entity}' in provided directories. " + f"Outputting blank entry. Remove -k argument to exclude empty entities.") + else: + logging.info(f"No output files found for entity '{entity}' in provided directories. " + f"Omitting from output. Use -k argument to include empty entities.") + continue + output_stream.write(entity + "\t" + "\t".join(entity_outputs[entity]) + "\n") + + +def retrieve_and_write_output_files(batches_dirs, bucket, files_dict, output_names, output_file, + entities, entity_type, keep_all_entities): + num_outputs = len(output_names) + num_entities = len(entities) + entity_outputs = {entity: [""] * num_outputs for entity in entities} # empty if entities is empty + logging.info("Writing %s" % output_file) + with open(output_file, 'w') as out: + out.write(entity_type + "\t" + "\t".join(output_names) + "\n") + for batch, batch_dir in batches_dirs: + logging.info("Searching for outputs for %s" % batch) + batch_outputs = find_batch_output_files(batch, bucket, batch_dir, files_dict, output_names, num_outputs) + if num_entities > 0: + update_entity_outputs(output_names, batch_outputs, entities, entity_outputs) + else: + batch_line = format_batch_line(batch, output_names, batch_outputs) + out.write(batch_line) + if num_entities > 0: + write_entity_outputs(entity_outputs, keep_all_entities, entities, out) + logging.info("Done!") + + +# Main function +def main(): + parser = argparse.ArgumentParser() + group = parser.add_mutually_exclusive_group(required=True) + group.add_argument("-w", "--workflows-file", + help="TSV file (no header) with batch (or sample) names and workflow IDs (one workflow " + "per batch). Either -i or -w required.") + group.add_argument("-i", "--workflow-id", + help="Workflow ID provided directly on the command line; alternative to -w if only " + "one workflow. Either -i or -w required.") + parser.add_argument("-f", "--filenames", required=True, + help="JSON file with workflow output file names (for column names in output TSV) and a " + "unique filename suffix expected for each workflow output. " + "Format is { \"output_file_name\": \"unique_file_suffix\" }.") + parser.add_argument("-o", "--output-file", required=True, help="Output file path to create") + parser.add_argument("-b", "--bucket", required=True, + help="Google bucket path to search for files - should include all subdirectories " + "preceding the workflow ID, including the workflow name.") + parser.add_argument("-l", "--log-level", required=False, default="INFO", + help="Specify level of logging information, ie. info, warning, error (not case-sensitive). " + "Default: INFO") + parser.add_argument("-e", "--entities-file", required=False, + help="Newline-separated text file of entity (ie. sample, batch) names (no header). " + "Entity here refers to units, like samples within a batch or batches within a cohort, " + "for which the workflow(s) produced outputs; the script expects one output per entity " + "for all outputs, with the filename containing the entity ID provided in the entities " + "file. Output will have one line per entity in the order provided. " + "If multiple batches, outputs will be concatenated and order may be affected.") + parser.add_argument("-t", "--entity-type", required=False, default="batch", + help="Entity type (ie. sample, batch) of each line of output. If using -e, then define " + "what each entity name in the file is (ie. a sample, a batch). Otherwise, define " + "what each workflow corresponds to. This type will be the first column name. " + "Default: batch") + parser.add_argument("-k", "--keep-all-entities", required=False, default=False, action='store_true', + help="With --entities-file, output a line for every entity, even if none of the " + "output files are found.") + args = parser.parse_args() + + # Set logging level from -l input + log_level = args.log_level + numeric_level = getattr(logging, log_level.upper(), None) + if not isinstance(numeric_level, int): + raise ValueError('Invalid log level: %s' % log_level) + logging.basicConfig(level=numeric_level, format='%(levelname)s: %(message)s') + + # Set required arguments. Validate existence of & read filenames JSON + filenames, output_file, bucket = args.filenames, args.output_file, args.bucket # required + check_file_nonempty(filenames) + files_dict, output_names = load_filenames(filenames) + + # Determine workflow IDs from -w or -i arguments. Get subdirectories + workflows, workflow_id = args.workflows_file, args.workflow_id + if workflows is not None: + check_file_nonempty(workflows) + batches_dirs, bucket = get_batch_dirs(workflows, workflow_id, bucket) + + # Set entity arguments and read entities file + entity_type, entities_file, keep_all_entities = args.entity_type, args.entities_file, args.keep_all_entities + entities = read_entities_file(entities_file) + + # Core functionality + retrieve_and_write_output_files(batches_dirs, bucket, files_dict, output_names, output_file, + entities, entity_type, keep_all_entities) + + +if __name__ == "__main__": + main() diff --git a/scripts/docker/build_docker.py b/scripts/docker/build_docker.py index 665b86210..c9042a4fd 100644 --- a/scripts/docker/build_docker.py +++ b/scripts/docker/build_docker.py @@ -1,6 +1,9 @@ -import argparse, os, os.path +import argparse +import os +import os.path import time -import tempfile, shutil +import tempfile +import shutil from termcolor import colored ############################################################ @@ -16,90 +19,98 @@ ############################################################ # dummy Exceptions for use when OS docker build fails + + class UserError(Exception): pass + class DockerBuildError(Exception): pass ############################################################ # parsing and checking arguments + + class CMD_line_args_parser: def __init__(self, args_list): - parser = argparse.ArgumentParser(description='='*50 + "\nBuilding docker images for GATK-SV pipeline v1.\n" + '='*50, + parser = argparse.ArgumentParser(description='=' * 50 + "\nBuilding docker images for GATK-SV pipeline v1.\n" + '=' * 50, formatter_class=argparse.RawDescriptionHelpFormatter) # required arguments - required_args_group = parser.add_argument_group('Required', 'required arguments') + required_args_group = parser.add_argument_group( + 'Required', 'required arguments') required_args_group.add_argument('--targets', - nargs = '+', - type = str, - required = True, - help = 'the sub project docker(s) you want to build (note "all" does not include melt)') + nargs='+', + type=str, + required=True, + help='the sub project docker(s) you want to build (note "all" does not include melt)') required_args_group.add_argument('--image-tag', - type = str, - required = True, - help = 'tag to be applied to all images being built') + type=str, + required=True, + help='tag to be applied to all images being built') # to build from local or remote git tag/hash values - git_args_group = parser.add_argument_group('Mutex args', 'remote git tag/hash values (mutually exclusive)') + git_args_group = parser.add_argument_group( + 'Mutex args', 'remote git tag/hash values (mutually exclusive)') git_mutex_args_group = git_args_group.add_mutually_exclusive_group() git_mutex_args_group.add_argument('--remote-git-tag', - type = str, - help = 'release tag on Github; this indicates pulling from Github to a staging dir') + type=str, + help='release tag on Github; this indicates pulling from Github to a staging dir') git_mutex_args_group.add_argument('--remote-git-hash', - type = str, - help = 'a hash value on Github; this indicates pulling from Github to a staging dir') + type=str, + help='a hash value on Github; this indicates pulling from Github to a staging dir') # to build from remote/Github (staging required) remote_git_args_group = parser.add_argument_group('Remote git', 'args involved when building from remote git tags/hashes') remote_git_args_group.add_argument('--staging-dir', - type = str, - help = 'a temporary staging directory to store builds; required only when pulling from Github, ignored otherwise') + type=str, + help='a temporary staging directory to store builds; required only when pulling from Github, ignored otherwise') remote_git_args_group.add_argument('--use-ssh', - action = 'store_true', - help = 'use SSH to pull from github') + action='store_true', + help='use SSH to pull from github') # flag to turn on push to Dockerhub and/or GCR docker_remote_args_group = parser.add_argument_group('Docker push', 'controlling behavior related pushing dockers to remote repos') docker_remote_args_group.add_argument('--gcr-project', - type = str, - help = 'GCR billing project to push the images to') + type=str, + help='GCR billing project to push the images to. If not given, the ' + 'built image(s) will not be push to GCR.') docker_remote_args_group.add_argument('--update-latest', - action = 'store_true', - help = 'also update \"latest\" tag in remote docker repo(s)') + action='store_true', + help='also update \"latest\" tag in remote docker repo(s)') # flag to turn off git protection (default modue is refusing to build when there are untracked files and/or uncommited changes) parser.add_argument('--disable-git-protect', - action = 'store_true', - help = 'disable git check/protect when building from local files (will use uncommited changes to build)') + action='store_true', + help='disable git check/protect when building from local files (will use uncommited changes to build)') parser.add_argument('--skip-base-image-build', - action = 'store_true', - help = 'skip rebuild of the target\'s base image(s). Assumes that the base image(s) already exist with same tag.') + action='store_true', + help='skip rebuild of the target\'s base image(s). Assumes that the base image(s) already exist with same tag.') parser.add_argument('--skip-cleanup', - action = 'store_true', - help = 'skip cleanup after successful and unsuccessful build attempts. This will speed up subsequent builds.') + action='store_true', + help='skip cleanup after successful and unsuccessful build attempts. This will speed up subsequent builds.') # parse and consistency check parsed_args = parser.parse_args(args_list) CMD_line_args_parser.consistency_check(parsed_args) self.project_args = parsed_args - # cmd line args consistency check + @staticmethod def consistency_check(argparse_namespace_obj): @@ -110,32 +121,40 @@ def consistency_check(argparse_namespace_obj): if ("all" in argparse_namespace_obj.targets): if 1 != len(argparse_namespace_obj.targets): - raise UserError("when \"all\" is provided, no other target values allowed") + raise UserError( + "when \"all\" is provided, no other target values allowed") # if "use_ssh" flag is turned on, remote git tag/hash should be provided if (argparse_namespace_obj.use_ssh is True): if (argparse_namespace_obj.remote_git_tag is None) and (argparse_namespace_obj.remote_git_hash is None): - raise UserError("\"use_ssh\" is specified but remote git tag/hash is not") + raise UserError( + "\"use_ssh\" is specified but remote git tag/hash is not") # if remote git tag/hash and/or is specified, staging dir should be specified if (argparse_namespace_obj.remote_git_tag is not None) or (argparse_namespace_obj.remote_git_hash is not None): if argparse_namespace_obj.staging_dir is None: - raise UserError("remote git tag/hash is specified but staging_dir is not") + raise UserError( + "remote git tag/hash is specified but staging_dir is not") # if requesting to update "latest" tag in remote docker repo(s), remote git release tag must be specified if (argparse_namespace_obj.update_latest is True): if argparse_namespace_obj.remote_git_tag is None: - raise UserError("publishing \"latest\" docker images requires a remote Github release tag") + raise UserError( + "publishing \"latest\" docker images requires a remote Github release tag") # if there're un-committed changes when building from local files, raise exception if (argparse_namespace_obj.staging_dir is None and not argparse_namespace_obj.disable_git_protect): - s = os.popen("git status -s | wc -l | tr -d ' ' | tr -d '\n'").read() + s = os.popen( + "git status -s | wc -l | tr -d ' ' | tr -d '\n'").read() ret = int(s) if 0 != ret: - raise UserError("Current directory has uncommited changes or untracked files. Cautiously refusing to proceed.") + raise UserError( + "Current directory has uncommited changes or untracked files. Cautiously refusing to proceed.") ############################################################ # controlling the build and push of a single image + + class Docker_Build: NON_PUBLIC_DOCKERS = ('melt') @@ -150,7 +169,7 @@ def __init__(self, name, tag, build_context, remote_docker_repos): def build(self, built_time_args_dict): # get to the requested directory - docker_build_command = "cd " + self.build_context + " && \\\n" + docker_build_command = "cd " + self.build_context + " && \\\n" # standard build command docker_build_command += "docker build --progress plain \\\n " docker_build_command += "--tag " + self.name + ":" + self.tag + " \\\n " @@ -158,7 +177,8 @@ def build(self, built_time_args_dict): for key, value in built_time_args_dict.items(): docker_build_command += "--build-arg " + key + "=" + value + " \\\n " - will_push = 0!=len(self.remote_docker_repos) and any(e is not None for e in self.remote_docker_repos) + will_push = 0 != len(self.remote_docker_repos) and any( + e is not None for e in self.remote_docker_repos) docker_build_command += "--squash . " if (will_push) else ". " # build and time it @@ -166,7 +186,8 @@ def build(self, built_time_args_dict): start_time = time.time() ret = os.system(docker_build_command) if 0 != ret: - raise DockerBuildError("Failed to build image " + self.name + ":" + self.tag) + raise DockerBuildError( + "Failed to build image " + self.name + ":" + self.tag) elapsed_time = time.time() - start_time elapsed_min, elapsed_sec = divmod(elapsed_time, 60) print("Time spent on docker build:") @@ -176,32 +197,37 @@ def push(self, is_update_latest): for rep in self.remote_docker_repos: # do not push images with very restrictive licenses - if (self.name in Docker_Build.NON_PUBLIC_DOCKERS) and ( not rep.startswith('us.gcr.io') ): - print(colored("Refusing to push non-public image " + self.name + " to " + rep, "red")) + if (self.name in Docker_Build.NON_PUBLIC_DOCKERS) and (not rep.startswith('us.gcr.io')): + print(colored("Refusing to push non-public image " + + self.name + " to " + rep, "red")) next remote_tag = "latest" if (is_update_latest) else self.tag - docker_tag_command = "docker tag " + self.name + ":" + self.tag + " " + rep + "/" + self.name + ":" + remote_tag + docker_tag_command = "docker tag " + self.name + ":" + \ + self.tag + " " + rep + "/" + self.name + ":" + remote_tag docker_push_command = "docker push " + rep + "/" + self.name + ":" + remote_tag print(docker_tag_command) print(docker_push_command) ret = os.system(docker_tag_command) if 0 != ret: - raise DockerBuildError("Failed to tag image for pushing to remote") + raise DockerBuildError( + "Failed to tag image for pushing to remote") ret = os.system(docker_push_command) if 0 != ret: raise DockerBuildError("Failed to push image") ############################################################ # controlling the build and push of all requested images + + class Project_Build: INTERM_RESOURCE_IMG = 'gatksv-pipeline-v1-resources' - GITHUB_ORG = 'broadinstitute' - GITHUB_REPO = 'gatk-sv' + GITHUB_ORG = 'broadinstitute' + GITHUB_REPO = 'gatk-sv' - #### for constructing an ordered build chain, to resolve dependency + # for constructing an ordered build chain, to resolve dependency DEP_DICT = {'delly': None, 'manta': None, 'melt': None, 'wham': None, 'sv-base-mini': None, 'samtools-cloud': 'sv-base-mini', @@ -219,25 +245,27 @@ class Project_Build: 'sv-pipeline-children-r': 3, 'sv-pipeline-rdtest': 4, 'sv-pipeline-qc': 4} # for use when a single target is to be built + @staticmethod def get_ordered_build_chain_single(target_name): chain = () t = target_name - while ( Project_Build.DEP_DICT[t] is not None): + while (Project_Build.DEP_DICT[t] is not None): t = Project_Build.DEP_DICT[t] chain = (t,) + chain return chain + (target_name,) # for use when multiple targets are to be built + @staticmethod def get_ordered_build_chain_list(target_name_list): dup_chain = [] for t in target_name_list: - dup_chain.extend( list(Project_Build.get_ordered_build_chain_single(t)) ) + dup_chain.extend( + list(Project_Build.get_ordered_build_chain_single(t))) agg_chain = list(set(dup_chain)) # uniquify - build_chain = sorted(agg_chain, key=lambda name : Project_Build.BUILD_PRIORITY[name]) - return tuple(build_chain) # immutable, at least attempt - - + build_chain = sorted( + agg_chain, key=lambda name: Project_Build.BUILD_PRIORITY[name]) + return tuple(build_chain) # immutable, at least attempt def __init__(self, project_arguments, launch_script_path): @@ -250,7 +278,6 @@ def __init__(self, project_arguments, launch_script_path): self.working_dir = None self.successfully_built_images = [] - def go_to_workdir(self): tmp_dir_path = None @@ -265,9 +292,11 @@ def go_to_workdir(self): else: # if staging is required, mkdir, cd, and pull if self.project_arguments.staging_dir.endswith("/"): - tmp_dir_path = tempfile.mkdtemp(prefix = self.project_arguments.staging_dir) + tmp_dir_path = tempfile.mkdtemp( + prefix=self.project_arguments.staging_dir) else: - tmp_dir_path = tempfile.mkdtemp(prefix = self.project_arguments.staging_dir + "/") + tmp_dir_path = tempfile.mkdtemp( + prefix=self.project_arguments.staging_dir + "/") connect_mode = "git@github.com:" if self.project_arguments.use_ssh else "https://github.com" ret = os.system("git clone " + connect_mode + @@ -281,18 +310,19 @@ def go_to_workdir(self): # checkout desired hash or tag, if building remotely if (self.project_arguments.remote_git_tag is not None): git_checkout_cmd = "git checkout tags/" + self.project_arguments.remote_git_tag - ret = os.system( git_checkout_cmd ) + ret = os.system(git_checkout_cmd) if 0 != ret: - raise UserError("Seems that the provided git tag [" - + self.project_arguments.remote_git_tag - + "] does not exist") + raise UserError("Seems that the provided git tag [" + + self.project_arguments.remote_git_tag + + "] does not exist") elif (self.project_arguments.remote_git_hash is not None): - git_checkout_cmd = "git checkout " + self.project_arguments.remote_git_hash - ret = os.system( git_checkout_cmd ) + git_checkout_cmd = "git checkout " + \ + self.project_arguments.remote_git_hash + ret = os.system(git_checkout_cmd) if 0 != ret: - raise UserError("Seems that the provided git hash [" - + self.project_arguments.remote_git_hash - + "] does not exist") + raise UserError("Seems that the provided git hash [" + + self.project_arguments.remote_git_hash + + "] does not exist") print("Working directory: " + os.getcwd()) return tmp_dir_path @@ -300,15 +330,19 @@ def go_to_workdir(self): def build_and_push(self): # start docker daemon, if one hasn't been started yet - os.system("open --background -a Docker && while ! docker system info > /dev/null 2>&1; do sleep 1; done") + os.system( + "open --background -a Docker && while ! docker system info > /dev/null 2>&1; do sleep 1; done") # prepare resources docker print(colored('#################################################', 'magenta')) a = colored("Building intermediate resource image", "grey") - b = colored(Project_Build.INTERM_RESOURCE_IMG + ":" + self.project_arguments.image_tag, "yellow", attrs=['bold']) + b = colored(Project_Build.INTERM_RESOURCE_IMG + ":" + + self.project_arguments.image_tag, "yellow", attrs=['bold']) c = colored(" ...", "grey") print(a, b, c) - resource_docker_build_cmd = "docker build -t " + Project_Build.INTERM_RESOURCE_IMG + ":" + self.project_arguments.image_tag + " -f scripts/docker/resources.Dockerfile ." + resource_docker_build_cmd = "docker build -t " + Project_Build.INTERM_RESOURCE_IMG + \ + ":" + self.project_arguments.image_tag + \ + " -f scripts/docker/resources.Dockerfile ." print(resource_docker_build_cmd) os.system(resource_docker_build_cmd) print(colored('#################################################', 'magenta')) @@ -316,11 +350,13 @@ def build_and_push(self): # if build all, easy # otherwise construct build chain on the fly if "all" in self.project_arguments.targets: - expanded_build_targets = tuple(target for target in INDIVIDUAL_TARGET_VALUES if target != "melt") + expanded_build_targets = tuple( + target for target in INDIVIDUAL_TARGET_VALUES if target != "melt") elif self.project_arguments.skip_base_image_build: expanded_build_targets = self.project_arguments.targets else: - expanded_build_targets = Project_Build.get_ordered_build_chain_list(tuple(self.project_arguments.targets)) + expanded_build_targets = Project_Build.get_ordered_build_chain_list( + tuple(self.project_arguments.targets)) print("Building the following targets in order:") print(expanded_build_targets) print(colored('#################################################', 'magenta')) @@ -328,44 +364,46 @@ def build_and_push(self): for proj in expanded_build_targets: a = colored("Building image ", "grey") - b = colored(proj + ":" + self.project_arguments.image_tag, "yellow", attrs=['bold']) + b = colored(proj + ":" + self.project_arguments.image_tag, + "yellow", attrs=['bold']) c = colored(" ...", "grey") print(a, b, c) build_time_args = {} if (proj == "sv-base" or proj == "samtools-cloud"): build_time_args = { - "MINIBASE_IMAGE" : "sv-base-mini:" + self.project_arguments.image_tag + "MINIBASE_IMAGE": "sv-base-mini:" + self.project_arguments.image_tag } elif (proj == "cnmops" or proj == "sv-pipeline-base"): build_time_args = { - "GATKSV_PIPELINE_V1_RESOURCES_IMAGE" : Project_Build.INTERM_RESOURCE_IMG + ":" + self.project_arguments.image_tag, - "SVBASE_IMAGE" : "sv-base:" + self.project_arguments.image_tag + "GATKSV_PIPELINE_V1_RESOURCES_IMAGE": Project_Build.INTERM_RESOURCE_IMG + ":" + self.project_arguments.image_tag, + "SVBASE_IMAGE": "sv-base:" + self.project_arguments.image_tag } elif (proj == "sv-pipeline-children-r"): build_time_args = { - "SV_PIPELINE_BASE_IMAGE" : "sv-pipeline-base:" + self.project_arguments.image_tag + "SV_PIPELINE_BASE_IMAGE": "sv-pipeline-base:" + self.project_arguments.image_tag } elif (proj == "sv-pipeline-base"): build_time_args = { - "GATKSV_PIPELINE_V1_RESOURCES_IMAGE" : Project_Build.INTERM_RESOURCE_IMG + ":" + self.project_arguments.image_tag, - "SVBASE_IMAGE" : "sv-base:" + self.project_arguments.image_tag + "GATKSV_PIPELINE_V1_RESOURCES_IMAGE": Project_Build.INTERM_RESOURCE_IMG + ":" + self.project_arguments.image_tag, + "SVBASE_IMAGE": "sv-base:" + self.project_arguments.image_tag } elif (proj == "sv-pipeline"): build_time_args = { - "GATKSV_PIPELINE_V1_RESOURCES_IMAGE" : Project_Build.INTERM_RESOURCE_IMG + ":" + self.project_arguments.image_tag, - "SV_PIPELINE_BASE_IMAGE" : "sv-pipeline-base:" + self.project_arguments.image_tag + "GATKSV_PIPELINE_V1_RESOURCES_IMAGE": Project_Build.INTERM_RESOURCE_IMG + ":" + self.project_arguments.image_tag, + "SV_PIPELINE_BASE_IMAGE": "sv-pipeline-base:" + self.project_arguments.image_tag } elif (proj.startswith("sv-pipeline")): build_time_args = { - "SV_PIPELINE_BASE_R_IMAGE" : "sv-pipeline-children-r:" + self.project_arguments.image_tag + "SV_PIPELINE_BASE_R_IMAGE": "sv-pipeline-children-r:" + self.project_arguments.image_tag } build_context = self.working_dir + "/dockerfiles/" + proj remote_docker_repos = [] if (self.project_arguments.gcr_project is not None): - remote_docker_repos.append("us.gcr.io/" + self.project_arguments.gcr_project) + remote_docker_repos.append( + "us.gcr.io/" + self.project_arguments.gcr_project) docker = Docker_Build(proj, self.project_arguments.image_tag, build_context, remote_docker_repos) @@ -385,7 +423,8 @@ def cleanup(self, tmp_dir_path): os.system(clean_dangling_images) # clean intermediate image that are not to be pushed - os.system("docker rmi --force " + Project_Build.INTERM_RESOURCE_IMG + ":" + self.project_arguments.image_tag) + os.system("docker rmi --force " + Project_Build.INTERM_RESOURCE_IMG + + ":" + self.project_arguments.image_tag) # "rm -rf" staging dir, if was specified if(self.project_arguments.staging_dir is not None) and (tmp_dir_path is not None): @@ -394,6 +433,8 @@ def cleanup(self, tmp_dir_path): ############################################################ # static function to control the build and cleanup + + def parse_and_build(parsed_project_arguments, launch_script_path): import pprint @@ -401,24 +442,24 @@ def parse_and_build(parsed_project_arguments, launch_script_path): pprint.pprint(vars(parsed_project_arguments)) print("") - my_build_project = Project_Build(parsed_project_arguments, launch_script_path) + my_build_project = Project_Build( + parsed_project_arguments, launch_script_path) possible_tmp_dir_path = my_build_project.go_to_workdir() try: my_build_project.build_and_push() except UserError as a: - raise Exception("Build Process Errored due to an assertion error!!!\n" - + str(a)) + raise Exception("Build Process Errored due to an assertion error!!!\n" + str(a)) except DockerBuildError as d: - raise Exception("Build Process Errored due to a docker build error!!!\n" - + str(d)) + raise Exception("Build Process Errored due to a docker build error!!!\n" + str(d)) finally: if not my_build_project.project_arguments.skip_cleanup: my_build_project.cleanup(possible_tmp_dir_path) ############################################################ + if __name__ == "__main__": import sys if 1 == len(sys.argv): diff --git a/scripts/inputs/build_inputs.py b/scripts/inputs/build_inputs.py index 3d9eb5354..0745b9bba 100755 --- a/scripts/inputs/build_inputs.py +++ b/scripts/inputs/build_inputs.py @@ -1,8 +1,9 @@ #!/usr/bin/env python # -*- coding: utf-8 -*- -import sys -import argparse, os, os.path +import argparse +import os +import os.path import glob import json from jinja2 import Environment, FileSystemLoader, Undefined @@ -53,50 +54,60 @@ # this class drops logs undefined value references in the "undefined_names" list undefined_names = [] + + class TrackMissingValuesUndefined(Undefined): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) - #dict.__init__(self, fname=fname) + # dict.__init__(self, fname=fname) if 'name' in self._undefined_obj: - undefined_names.append(self._undefined_obj['name'] + "." + self._undefined_name) + undefined_names.append( + self._undefined_obj['name'] + "." + self._undefined_name) else: undefined_names.append(self._undefined_name) - #self._fail_with_undefined_error() + # self._fail_with_undefined_error() def __str__(self): return "" + def to_json_custom(value, *args, **kwargs): if isinstance(value, Undefined): return "UNDEFINED" else: return json.dumps(value, *args, **kwargs) + def main(): parser = argparse.ArgumentParser( description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) - parser.add_argument("input_values_directory", help="Directory containing input value map JSON files") - parser.add_argument("template_path", help = "Path to template directory or file (directories will be processed recursively)") - parser.add_argument("output_directory", help = "Directory to create output files in") - parser.add_argument('-a', '--aliases', type=json.loads, default={}, help="Aliases for input value bundles") + parser.add_argument("input_values_directory", + help="Directory containing input value map JSON files") + parser.add_argument( + "template_path", help="Path to template directory or file (directories will be processed recursively)") + parser.add_argument("output_directory", + help="Directory to create output files in") + parser.add_argument('-a', '--aliases', type=json.loads, + default={}, help="Aliases for input value bundles") args = parser.parse_args() # prepare input values and bundle aliases input_directory = args.input_values_directory input_files = glob.glob(input_directory + "/*.json") - raw_input_bundles = {os.path.splitext(os.path.basename(input_file))[0]:json.load(open(input_file, "r")) for input_file in input_files} + raw_input_bundles = {os.path.splitext(os.path.basename(input_file))[ + 0]: json.load(open(input_file, "r")) for input_file in input_files} raw_input_bundles['test_batch_empty'] = {} raw_input_bundles['test_batch_empty']['name'] = 'test_batch' raw_input_bundles['single_sample_none'] = {} raw_input_bundles['single_sample_none']['name'] = 'single_sample' - default_aliases = { 'dockers' : 'dockers', - 'ref_panel' : 'ref_panel_v1b', - 'reference_resources' : 'resources_hg38', - 'test_batch' : 'test_batch_empty', - 'single_sample' : 'single_sample_none'} + default_aliases = {'dockers': 'dockers', + 'ref_panel': 'ref_panel_v1b', + 'reference_resources': 'resources_hg38', + 'test_batch': 'test_batch_empty', + 'single_sample': 'single_sample_none'} # prepare the input_dict using default, document default, and user-specified aliases input_dict = {} @@ -108,20 +119,21 @@ def main(): for alias in user_aliases: input_dict[alias] = raw_input_bundles[user_aliases[alias]] - template_path = args.template_path target_directory = args.output_directory if os.path.isdir(template_path): process_directory(input_dict, template_path, target_directory) else: - process_file(input_dict, os.path.dirname(template_path), os.path.basename(template_path), target_directory) + process_file(input_dict, os.path.dirname(template_path), + os.path.basename(template_path), target_directory) def process_directory(input_dict, template_dir, target_directory): template_dir_split = template_dir.split(os.sep) template_root = template_dir_split[len(template_dir_split) - 1] - template_base = os.sep.join(template_dir_split[0:len(template_dir_split) - 1]) + template_base = os.sep.join( + template_dir_split[0:len(template_dir_split) - 1]) target_dir_split = target_directory.split(os.sep) target_root = target_dir_split[len(target_dir_split) - 1] target_base = os.sep.join(target_dir_split[0:len(target_dir_split) - 1]) @@ -129,7 +141,8 @@ def process_directory(input_dict, template_dir, target_directory): stripped_subdir = subdir[(len(template_base) + len(os.sep)):] stripped_subdir = stripped_subdir[(len(template_root) + len(os.sep)):] if len(stripped_subdir) > 0: - target_subdir = os.sep.join([target_base, target_root, stripped_subdir]) + target_subdir = os.sep.join( + [target_base, target_root, stripped_subdir]) else: target_subdir = os.sep.join([target_base, target_root]) for file in fileList: @@ -142,17 +155,20 @@ def process_file(input_dict, template_subdir, template_file, target_subdir): # only process files that end with .tmpl if not template_file.endswith(".tmpl"): - print("WARNING: skipping file " + template_file_path + " because it does not have .tmpl extension") + print("WARNING: skipping file " + template_file_path + + " because it does not have .tmpl extension") return target_file = template_file.rsplit('.', 1)[0] target_file_path = os.sep.join([target_subdir, target_file]) - env = Environment(loader=FileSystemLoader(template_subdir), undefined=TrackMissingValuesUndefined) + env = Environment(loader=FileSystemLoader(template_subdir), + undefined=TrackMissingValuesUndefined) env.policies['json.dumps_function'] = to_json_custom print(template_file_path + " -> " + target_file_path) processed_content = env.get_template(template_file).render(input_dict) if len(undefined_names) > 0: - print("WARNING: skipping file " + template_file_path + " due to missing values " + str(undefined_names)) + print("WARNING: skipping file " + template_file_path + + " due to missing values " + str(undefined_names)) else: os.makedirs(target_subdir, exist_ok=True) target_file = open(target_file_path, "w") @@ -161,4 +177,4 @@ def process_file(input_dict, template_subdir, template_file, target_subdir): if __name__ == "__main__": - main() \ No newline at end of file + main() diff --git a/scripts/test/check_gs_urls.py b/scripts/test/check_gs_urls.py index 3fd79c286..698929e5c 100644 --- a/scripts/test/check_gs_urls.py +++ b/scripts/test/check_gs_urls.py @@ -26,44 +26,52 @@ GOOGLE_STORAGE = 'gs' # Checks if the string is a Google bucket URL + + def is_gcs_url(str): - return urlparse(str).scheme == GOOGLE_STORAGE + return urlparse(str).scheme == GOOGLE_STORAGE # Checks if the object exists in GCS + + def check_gcs_url(source_uri, client, project_id): - def _parse_uri(uri): - parsed = urlparse(uri) - bucket_name = parsed.netloc - bucket_object = parsed.path[1:] - return bucket_name, bucket_object - source_bucket_name, source_blob_name = _parse_uri(source_uri) - source_bucket = client.bucket(source_bucket_name, user_project=project_id) - source_blob = source_bucket.blob(source_blob_name) - try: - if not source_blob.exists(): - print(f"{source_uri} not found") - return False - except exceptions.BadRequest as e: - print(e) - return True + def _parse_uri(uri): + parsed = urlparse(uri) + bucket_name = parsed.netloc + bucket_object = parsed.path[1:] + return bucket_name, bucket_object + source_bucket_name, source_blob_name = _parse_uri(source_uri) + source_bucket = client.bucket(source_bucket_name, user_project=project_id) + source_blob = source_bucket.blob(source_blob_name) + try: + if not source_blob.exists(): + print(f"{source_uri} not found") + return False + except exceptions.BadRequest as e: + print(e) + return True # Main function + + def main(): - parser = argparse.ArgumentParser() - parser.add_argument("inputs_json") - parser.add_argument("--project-id", required=False, help="Project ID to charge for requester pays buckets") - args = parser.parse_args() + parser = argparse.ArgumentParser() + parser.add_argument("inputs_json") + parser.add_argument("--project-id", required=False, + help="Project ID to charge for requester pays buckets") + args = parser.parse_args() + + with open(args.inputs_json, 'r') as f: + client = storage.Client() + inputs = json.load(f) + for x in inputs: + if isinstance(inputs[x], str) and is_gcs_url(inputs[x]): + check_gcs_url(inputs[x], client, args.project_id) + elif isinstance(inputs[x], list): + for y in inputs[x]: + if is_gcs_url(y): + check_gcs_url(y, client, args.project_id) + - with open(args.inputs_json, 'r') as f: - client = storage.Client() - inputs = json.load(f) - for x in inputs: - if isinstance(inputs[x], str) and is_gcs_url(inputs[x]): - check_gcs_url(inputs[x], client, args.project_id) - elif isinstance(inputs[x], list): - for y in inputs[x]: - if is_gcs_url(y): - check_gcs_url(y, client, args.project_id) - -if __name__== "__main__": - main() +if __name__ == "__main__": + main() diff --git a/scripts/test/compare_files.py b/scripts/test/compare_files.py new file mode 100644 index 000000000..1ac77b6c2 --- /dev/null +++ b/scripts/test/compare_files.py @@ -0,0 +1,274 @@ +import argparse +import gzip +import json +import os +from metadata import ITaskOutputFilters, Metadata +from subprocess import DEVNULL, STDOUT, check_call + + +# For coloring the prints; see the following SO +# answer for details: https://stackoverflow.com/a/287944/947889 +COLOR_ENDC = "\033[0m" +COLOR_ULINE = "\033[04m" +COLOR_BLINKING = "\033[05m" +COLOR_RED = "\033[91m" +COLOR_GREEN = "\033[92m" +COLOR_YELLOW = "\033[93m" + + +class FilterBasedOnExtensions(ITaskOutputFilters): + + def __init__(self, extensions): + self.extensions = extensions + + def filter(self, metadata, outputs): + """ + Iterates through the outputs of a task and + filters the outputs whose file type match + the types subject to comparison (i.e., + types defined in filetypes_to_compare). + + :return: An array of the filtered outputs. + """ + filtered_outputs = {} + if not isinstance(outputs, list): + outputs = [outputs] + + for task_output in outputs: + if not isinstance(task_output, str): + # Happens when output is not a file, + # e.g., when it is a number. + continue + for ext in self.extensions: + if task_output.endswith(ext): + if ext not in filtered_outputs: + filtered_outputs[ext] = [] + filtered_outputs[ext].append(task_output) + return filtered_outputs + + +class BaseCompareAgent: + def __init__(self, working_dir=None): + self.working_dir = working_dir + + def get_filename(self, obj): + return obj.replace("gs://", os.path.join(self.working_dir, "")) + + def get_obj(self, obj): + """ + Ensures the given Google Cloud Storage + object (obj) is available in the working + directory, and returns its filename in + the working directory. + """ + raise NotImplementedError + + +class VCFCompareAgent(BaseCompareAgent): + def __init__(self, working_dir=None): + super(VCFCompareAgent, self).__init__(working_dir) + + # Delimiter + self.d = "\t" + self.id_col = 2 + + def get_obj(self, obj): + """ + Ensures the given VCF object is + available in the working directory: + if it exists, returns its filename, and + If it does not, downloads the VCF object and + returns its filename. + """ + filename = self.get_filename(obj) + if not os.path.isfile(filename): + if not os.path.isfile(filename): + check_call( + ["gsutil", "-m", "cp", obj, filename], + stdout=DEVNULL, stderr=STDOUT) + return filename + + def equals(self, x, y): + """ + Gets two VCF objects (Google Cloud Storage URI), + x and y, and returns true if files are identical, + and false if otherwise. Additionally, it returns the + compared files. + """ + x = self.get_obj(x) + y = self.get_obj(y) + + with gzip.open(x, "rt", encoding="utf-8") as X, \ + gzip.open(y, "rt", encoding="utf-8") as Y: + for x_line, y_line in zip(X, Y): + if x_line.startswith("#") and y_line.startswith("#"): + continue + + x_columns = x_line.strip().split(self.d) + y_columns = y_line.strip().split(self.d) + + if len(x_columns) != len(y_columns): + return False, x, y + + if any(x_columns[c] != y_columns[c] + for c in range(0, len(x_columns)) + if c != self.id_col): + return False, x, y + return True, x, y + + +class CompareWorkflowOutputs: + def __init__(self, working_dir): + self.working_dir = working_dir + self.filetypes_to_compare = { + "vcf.gz": VCFCompareAgent(self.working_dir) + } + + def get_mismatches(self, reference_metadata, + target_metadata, + traverse_sub_workflows=False): + """ + Takes two metadata files (both belonging to a common + workflow execution), iterates through the outputs of + their task, downloads the objects if not already exist + in the working directory, compares the corresponding + files, and returns the files that do not match. + """ + def record_compare_result(match, reference, target): + if not match: + if call not in mismatches: + mismatches[call] = [] + mismatches[call].append([reference, target]) + + # First we define a method that takes a list + # of a task outputs, and keeps only those that + # are files and their extension match the + # file types that we want to compare + # (e.g., filter only VCF files). + filter_method = FilterBasedOnExtensions( + self.filetypes_to_compare.keys()).filter + + # Then we create two instances of the Metadata + # class, one for each metadata file, and we + # invoke the `get_outputs` method which traverses + # the outputs of task, and returns those filtered + # by the above-defined filter. + ref_output_files = Metadata(reference_metadata).get_outputs( + traverse_sub_workflows, filter_method) + test_output_files = Metadata(target_metadata).get_outputs( + traverse_sub_workflows, filter_method) + + mismatches = {} + i = 0 + + r_t = ref_output_files.keys() - test_output_files.keys() + t_r = test_output_files.keys() - ref_output_files.keys() + if r_t or t_r: + print(f"\n{COLOR_BLINKING}WARNING!{COLOR_ENDC}") + print(f"The reference and test metadata files differ " + f"in their outputs; " + f"{COLOR_ULINE}the differences will be skipped.{COLOR_ENDC}") + if r_t: + print(f"\t{len(r_t)}/{len(ref_output_files.keys())} " + f"outputs of the reference are not in the test:") + for x in r_t: + print(f"\t\t- {x}") + if t_r: + print(f"\t{len(t_r)}/{len(test_output_files.keys())} " + f"outputs of the test are not in the reference:") + for x in t_r: + print(f"\t\t- {x}") + print("\n") + + [ref_output_files.pop(x) for x in r_t] + print(f"{COLOR_YELLOW}Comparing {len(ref_output_files)} " + f"files that are common between reference and test " + f"metadata files and their respective task is executed " + f"successfully.{COLOR_ENDC}") + for call, ref_outputs in ref_output_files.items(): + i += 1 + matched = True + print(f"Comparing\t{i}/{len(ref_output_files)}\t{call} ... ", end="") + for extension, objs in ref_outputs.items(): + if len(objs) != len(test_output_files[call][extension]): + record_compare_result(False, objs, test_output_files[call][extension]) + matched = False + continue + for idx, obj in enumerate(objs): + equals, x, y = \ + self.filetypes_to_compare[extension].equals( + obj, test_output_files[call][extension][idx]) + record_compare_result(equals, x, y) + if not equals: + matched = False + if matched: + print(f"{COLOR_GREEN}match{COLOR_ENDC}") + else: + print(f"{COLOR_RED}mismatch{COLOR_ENDC}") + return mismatches + + +def main(): + parser = argparse.ArgumentParser( + description="Takes two cromwell metadata files as input, " + "reference and target, compares their corresponding " + "output files, and reports the files that do not match. " + "The two metadata files should belong to the execution " + "of a common workflow (e.g., one workflow with different " + "inputs). The script requires `gsutil` and `gzip` to be " + "installed and configured. If the output of a task is an " + "array of files, the reference and target arrays are " + "expected to be in the same order." + "\n\n" + "The currently supported file types are as follows." + "\n\t- VCF (.vcf.gz): The non-header lines of VCF files" + "are compared; except for the ID column, all the other " + "columns of a variation are expected to be identical. " + "The two files are expected to be equally ordered (i.e., " + "n-th variation in one file is compared to the " + "n-th variation on the other file).", + formatter_class=argparse.RawTextHelpFormatter) + + parser.add_argument( + "reference_metadata", + help="Reference cromwell metadata file.") + parser.add_argument( + "target_metadata", + help="Target cromwell metadata file.") + parser.add_argument( + "-w", "--working_dir", + help="The directory where the files will " + "be downloaded; default is the " + "invocation directory.") + parser.add_argument( + "-o", "--output", + help="Output file to store mismatches " + "(in JSON format); defaults to `output.json`.") + parser.add_argument( + "-d", "--deep", + action="store_true", + help="Include sub-workflows traversing the metadata files.") + + args = parser.parse_args() + + wd = args.working_dir if args.working_dir else "." + comparer = CompareWorkflowOutputs(wd) + mismatches = comparer.get_mismatches( + args.reference_metadata, + args.target_metadata, + args.deep) + + if len(mismatches) == 0: + print(f"{COLOR_GREEN}All the compared files matched.{COLOR_ENDC}") + else: + print(f"{COLOR_RED}{len(mismatches)} of the compared files did not match.{COLOR_ENDC}") + output_file = \ + args.output if args.output else \ + os.path.join(wd, "output.json") + with open(output_file, "w") as f: + json.dump(mismatches, f, indent=2) + print(f"Mismatches are persisted in {output_file}.") + + +if __name__ == '__main__': + main() diff --git a/scripts/test/metadata.py b/scripts/test/metadata.py new file mode 100644 index 000000000..ffb048e0e --- /dev/null +++ b/scripts/test/metadata.py @@ -0,0 +1,129 @@ +import json +import types +from abc import ABC, abstractmethod + + +class ITaskOutputFilters(ABC): + """ + An interface that should be implemented by + custom filtering methods to be used with Metadata. + + This design follows the principles of strategy pattern, + where a custom method can be used to augment the default + behavior of an algorithm. Here, this design is used to + decouple the filtering of tasks outputs (e.g., only extract + files with certain extension) from metadata traversal. + """ + + @abstractmethod + def filter(self, metadata, outputs): + """ + How to filter the output of a task. + + Note that the method is stateful; i.e., + it has references to both self and to + the instance of Metadata class that + invokes this method. + + :param metadata: `self` of the instance + of the Metadata class that calls this method. + + :param outputs: The values of a key in the + `outputs` field in a metadata file. e.g., + `metadata` in the following object is `a.vcf`: + 'outputs': {'merged': 'a.vcf'} + + :return: Filtered task outputs. + """ + pass + + +class Metadata: + """ + Implements utilities for traversing, processing, and + querying the resulting metadata (in JSON) of running + a workflow on Cromwell. + """ + def __init__(self, filename): + self.filename = filename + + @staticmethod + def _get_output_label(parent_workflow, workflow, output_var, shard_index): + """ + Composes a label for a task output. + :return: Some examples of constructed labels are: + - GATKSVPipelineSingleSample.Module00c.Module00c.PreprocessPESR.std_manta_vcf + - Module00c.PreprocessPESR.PreprocessPESR.StandardizeVCFs.std_vcf.0 + """ + return \ + ((parent_workflow + ".") if parent_workflow else "") + \ + f"{workflow}.{output_var}" + \ + (("." + str(shard_index)) if shard_index != -1 else "") + + @staticmethod + def _get_filtered_outputs(outputs): + return outputs + + def _traverse_outputs(self, calls, parent_workflow="", deep=False): + output_files = {} + + def update_output_files(outputs): + if run["executionStatus"] == "Done" and len(outputs) > 0: + output_files[self._get_output_label( + parent_workflow, workflow, out_label, + run["shardIndex"])] = outputs + + for workflow, runs in calls.items(): + for run in runs: + if "outputs" in run: + for out_label, out_files in run["outputs"].items(): + if not out_files: + continue + update_output_files(self._get_filtered_outputs(out_files)) + if deep and "subWorkflowMetadata" in run: + output_files.update( + self._traverse_outputs( + run["subWorkflowMetadata"]["calls"], + workflow, deep)) + return output_files + + def get_outputs(self, include_sub_workflows=False, filter_method=None): + """ + Iterates through a given cromwell metadata file + and filters the output files to be compared. + + :param include_sub_workflows: Boolean, if set to True, + output files generated in sub-workflows will be traversed. + + :param filter_method: A method to override the default + filter method. This method should be implement the + ITaskOutputFilters interface. Every traversed output of tasks + will be passed to this method, and this method's returned + value will be aggregated and returned. For instance, see + FilterBasedOnExtensions class for how the filter method can + be used to extract files with certain extension from the + metadata. + + :return: A dictionary with keys and values being a composite label + for tasks outputs and the values of the task output, respectively. + For instance (serialized to JSON and simplified for brevity): + { + "GATKSVPipelineSingleSample.FilterMelt.out":{ + "vcf.gz":["NA12878.melt.NA12878.vcf.gz"] + } + } + """ + if filter_method: + if not issubclass(type(filter_method.__self__), + ITaskOutputFilters): + raise TypeError(f"The class {type(filter_method.__self__)} " + f"should implement the interface " + f"{ITaskOutputFilters}.") + self._get_filtered_outputs = types.MethodType(filter_method, self) + + with open(self.filename, "r") as metadata_file: + metadata = json.load(metadata_file) + output_files = self._traverse_outputs( + metadata["calls"], + deep=include_sub_workflows) + return output_files diff --git a/scripts/test/terra_validation.py b/scripts/test/terra_validation.py new file mode 100644 index 000000000..34bacc672 --- /dev/null +++ b/scripts/test/terra_validation.py @@ -0,0 +1,125 @@ +#!/bin/python + +import argparse +import json +import logging +import subprocess +from os import listdir +import os.path + +""" +Summary: Cursory validation of Terra input JSONs for pipeline workflows. + Checks if JSONs contain all required inputs and that + they do not contain extraneous inputs. Does not perform type-checking. + Also checks that all expected JSONs have been generated. + +Usage: + python scripts/test/terra_validation.py -d /path/to/base/dir -j /path/to/womtool/jar [optional inputs] + +Parameters: + /path/to/base/dir: path to base directory of gatk-sv repo + /path/to/womtool/jar: path to user's womtool jar file + Optional inputs: + -n,--num-input-jsons INT: override default expected number of Terra input + JSONs + --log-level LEVEL: specify level of logging information to print, + ie. INFO, WARNING, ERROR - not case-sensitive) + +Outputs: If successful, last line of printout should read + " of Terra input JSONs exist and passed validation." + Prior lines will detail the JSONs that were examined and any errors found. +""" + +WDLS_PATH = "wdl/" +TERRA_INPUTS_PATH = "inputs/terra_workspaces/cohort_mode/workflow_configurations/" + + +def list_jsons(inputs_path, expected_num_jsons, subdir="", description=""): + jsons = [os.path.join(subdir, x) for x in listdir(os.path.join(inputs_path, subdir)) if x.endswith(".json")] + num_input_jsons = len(jsons) + if num_input_jsons < expected_num_jsons: + raise Exception(f"Expected {expected_num_jsons} Terra {description}input JSONs but found {num_input_jsons}.") + jsons.sort() + return jsons + + +def get_wdl_json_pairs(wdl_path, terra_inputs_path, expected_num_inputs): + jsons = list_jsons(terra_inputs_path, expected_num_inputs) + + for json_file in jsons: + path_to_wdl = os.path.join(wdl_path, os.path.basename(json_file).split(".")[0] + ".wdl") + if os.path.isfile(path_to_wdl): + yield path_to_wdl, os.path.join(terra_inputs_path, json_file) + else: + logging.warning(f"Can't find WDL corresponding to {os.path.basename(json_file)} at {path_to_wdl}.") + + +def validate_terra_json(wdl, terra_json, womtool_jar): + womtool_command = "java -jar " + womtool_jar + " inputs " + wdl + womtool_json = subprocess.run(womtool_command, shell=True, stdout=subprocess.PIPE) + + with open(terra_json, 'r') as terra_json_file: + womtool_inputs = json.loads(womtool_json.stdout) + terra_inputs = json.load(terra_json_file) + wdl_name = os.path.basename(wdl) + + valid = True + for inp in womtool_inputs: + if "optional" in womtool_inputs[inp]: + continue + elif inp not in terra_inputs: + logging.error(f"Missing input: Required input {inp} for {wdl_name} missing from {terra_json}") + valid = False + + for inp in terra_inputs: + if inp not in womtool_inputs: + logging.error(f"Unexpected input: {terra_json} contains unexpected input {inp} for {wdl_name}") + valid = False + + if valid: + logging.info(f"PASS: {terra_json} is a valid Terra input JSON for {wdl_name}") + + return int(valid) # return 1 if valid, 0 if not valid + + +def validate_all_terra_jsons(base_dir, womtool_jar, expected_num_inputs): + successes = 0 + for wdl, json_file in get_wdl_json_pairs(os.path.join(base_dir, WDLS_PATH), + os.path.join(base_dir, TERRA_INPUTS_PATH), + expected_num_inputs): + successes += validate_terra_json(wdl, json_file, womtool_jar) + + print("\n") + if successes != expected_num_inputs: + raise RuntimeError(f"{successes} of {expected_num_inputs} Terra input JSONs passed validation.") + else: + print(f"{successes} of {expected_num_inputs} Terra input JSONs passed validation.") + + +# Main function +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("-d", "--base-dir", help="Relative path to base of gatk-sv repo", required=True) + parser.add_argument("-j", "--womtool-jar", help="Path to womtool jar", required=True) + parser.add_argument("-n", "--num-input-jsons", + help="Number of Terra input JSONs expected", + required=False, default=19, type=int) + parser.add_argument("--log-level", + help="Specify level of logging information, ie. info, warning, error (not case-sensitive)", + required=False, default="INFO") + args = parser.parse_args() + + # get args as variables + base_dir, womtool_jar, log_level = args.base_dir, args.womtool_jar, args.log_level + expected_num_inputs = args.num_input_jsons + + numeric_level = getattr(logging, log_level.upper(), None) + if not isinstance(numeric_level, int): + raise ValueError('Invalid log level: %s' % log_level) + logging.basicConfig(level=numeric_level, format='%(levelname)s: %(message)s') + + validate_all_terra_jsons(base_dir, womtool_jar, expected_num_inputs) + + +if __name__ == "__main__": + main() diff --git a/scripts/test/validate.sh b/scripts/test/validate.sh index 2e389abcf..e88d6302b 100755 --- a/scripts/test/validate.sh +++ b/scripts/test/validate.sh @@ -14,9 +14,10 @@ set -e function usage() { printf "Usage: \n \ - %s -d -j \n \ + %s -d -j [-t] \n \ \t path to gatk-sv base directory \n \ - \t path to womtool jar (downloaded from https://github.com/broadinstitute/cromwell/releases) \n" "$1" + \t path to womtool jar (downloaded from https://github.com/broadinstitute/cromwell/releases) \n \ + [-t] \t optional flag to run validation on Terra cohort mode input JSONs in addition to test inputs \n" "$1" } if [[ "$#" == 0 ]]; then @@ -26,10 +27,11 @@ fi ################################################# # Parsing arguments ################################################# -while getopts "j:d:" option; do +while getopts "j:d:t" option; do case "$option" in j) WOMTOOL_JAR="$OPTARG" ;; d) BASE_DIR="$OPTARG" ;; + t) TERRA_VALIDATION=true ;; *) usage "$0" && exit 1 ;; esac done @@ -81,3 +83,10 @@ fi echo "" echo "#############################################################" echo "${COUNTER} TESTS PASSED SUCCESSFULLY!" + +if [ "$TERRA_VALIDATION" = true ]; then + echo "" + echo "#############################################################" + echo "RUNNING TERRA INPUT VALIDATION NOW" + eval "python3 ${BASE_DIR}/scripts/test/terra_validation.py -d ${BASE_DIR} -j ${WOMTOOL_JAR}" +fi diff --git a/src/sv-pipeline/scripts/vcf_qc/analyze_fams.R b/src/sv-pipeline/scripts/vcf_qc/analyze_fams.R index e11ef86db..76c705628 100755 --- a/src/sv-pipeline/scripts/vcf_qc/analyze_fams.R +++ b/src/sv-pipeline/scripts/vcf_qc/analyze_fams.R @@ -1228,10 +1228,10 @@ masterInhWrapper <- function(fam.dat.list,fam.type, gq=T){ ###RSCRIPT FUNCTIONALITY ######################## ###Load libraries as needed -require(optparse) -require(beeswarm) -require(vioplot) -require(zoo) +require(optparse, quietly=T) +require(beeswarm, quietly=T) +require(vioplot, quietly=T) +require(zoo, quietly=T) ###List of command-line options option_list <- list( @@ -1263,12 +1263,12 @@ svtypes.file <- opts$svtypes multiallelics <- opts$multiallelics # #Dev parameters -# dat.in <- "~/scratch/xfer/gnomAD_v2_SV_MASTER_resolved_VCF.VCF_sites.stats.bed.gz" -# famfile.in <- "~/scratch/xfer/cleaned.fam" -# perSampDir <- "~/scratch/xfer/gnomAD_v2_SV_MASTER_resolved_VCF_perSample_VIDs_merged/" +# dat.in <- "~/scratch/gnomAD-SV_v3.chr19_to_22.v1.VCF_sites.stats.bed.gz" +# famfile.in <- "~/scratch/cleaned.fam" +# perSampDir <- "~/scratch/gnomAD-SV_v3.chr19_to_22.v1_perSample_VIDs_merged/gnomAD-SV_v3.chr19_to_22.v1_perSample_VID_lists/" # OUTDIR <- "~/scratch/famQC_plots_test/" # # OUTDIR <- "~/scratch/VCF_plots_test/" -# svtypes.file <- "~/Desktop/Collins/Talkowski/code/sv-pipeline/ref/vcf_qc_refs/SV_colors.txt" +# svtypes.file <- "/Users/collins/Desktop/Collins/Talkowski/NGS/SV_Projects/gnomAD_v3/gnomad-sv-v3-qc//src/sv-pipeline/scripts/vcf_qc/SV_colors.txt" # multiallelics <- F ###Prepares I/O files @@ -1277,7 +1277,7 @@ dat <- read.table(dat.in,comment.char="",sep="\t",header=T,check.names=F) colnames(dat)[1] <- "chr" #Restrict data to autosomes only, and exclude multiallelics (if optioned) allosome.exclude.idx <- which(!(dat$chr %in% c(1:22,paste("chr",1:22,sep="")))) -multi.exclude.idx <- which(dat$other_gts>0) +multi.exclude.idx <- which(dat$other_gts>0 | dat$svtype %in% c("CNV", "MCNV")) cat(paste("NOTE: only autosomes considered during transmission analyses. Excluded ", prettyNum(length(allosome.exclude.idx),big.mark=","),"/", prettyNum(nrow(dat),big.mark=",")," (", @@ -1295,10 +1295,11 @@ if(multiallelics==F){ prettyNum(nrow(dat),big.mark=",")," (", round(100*length(all.exclude.idx)/nrow(dat),1), "%) of all variants due to autosomal and/or multiallelic filters.\n",sep="")) - - dat <- dat[-all.exclude.idx,] }else{ - dat <- dat[-allosome.exclude.idx,] + all.exclude.idx <- allosome.exclude.idx +} +if(length(all.exclude.idx) > 0){ + dat <- dat[-all.exclude.idx,] } cat(paste("NOTE: retained ", prettyNum(nrow(dat),big.mark=","), @@ -1337,14 +1338,10 @@ if(!dir.exists(paste(OUTDIR,"/supporting_plots/sv_inheritance_plots/",sep=""))){ ###Performs trio analyses, if any trios exist if(nrow(trios)>0){ - #Downsample to 100 trios if necessary - if(nrow(trios)>100){ - trios <- trios[sample(1:nrow(trios),100,replace=F),] - } #Read data trio.dat <- apply(trios[,2:4],1,function(IDs){ IDs <- as.character(IDs) - return(getFamDat(dat=dat,proband=IDs[1],father=IDs[2],mother=IDs[3],biallelic=!multiallelics)) + return(getFamDat(dat=dat,proband=IDs[1], father=IDs[2], mother=IDs[3], biallelic=!multiallelics)) }) names(trio.dat) <- trios[,1] diff --git a/src/sv-pipeline/scripts/vcf_qc/collectQC.external_benchmarking.sh b/src/sv-pipeline/scripts/vcf_qc/collectQC.external_benchmarking.sh index 97dc7c9c8..a39149814 100755 --- a/src/sv-pipeline/scripts/vcf_qc/collectQC.external_benchmarking.sh +++ b/src/sv-pipeline/scripts/vcf_qc/collectQC.external_benchmarking.sh @@ -13,7 +13,7 @@ set -e usage(){ cat < ${QCTMP}/svtypes.txt +# Gather list of external BEDs to compare +for bed in $( find ${BENCHDIR} -name "*.bed.gz" ); do + prefix=$( basename $bed | sed 's/\.bed\.gz//g' ) + echo -e "${prefix}\t${bed}" +done > comparator_beds.tsv ###GATHER EXTERNAL BENCHMARKING @@ -107,61 +114,23 @@ cut -f1 ${SVTYPES} | sort | uniq > ${QCTMP}/svtypes.txt if [ ${QUIET} == 0 ]; then echo -e "$( date ) - VCF QC STATUS: Starting external benchmarking" fi -#1000G (Sudmant) with allele frequencies -if [ ${COMPARATOR} == "1000G_Sudmant" ]; then - for pop in ALL AFR AMR EAS EUR SAS; do - #Print status - if [ ${QUIET} == 0 ]; then - echo -e "$( date ) - VCF QC STATUS: Benchmarking ${pop} samples in ${COMPARATOR}" - fi - ${BIN}/compare_callsets.sh \ - -O ${QCTMP}/1000G_Sudmant.SV.${pop}.overlaps.bed \ - -p 1000G_Sudmant_${pop}_Benchmarking_SV \ - ${STATS} \ - ${BENCHDIR}/1000G_Sudmant.SV.${pop}.bed.gz - cp ${QCTMP}/1000G_Sudmant.SV.${pop}.overlaps.bed \ - ${OUTDIR}/data/1000G_Sudmant.SV.${pop}.overlaps.bed - bgzip -f ${OUTDIR}/data/1000G_Sudmant.SV.${pop}.overlaps.bed - tabix -f ${OUTDIR}/data/1000G_Sudmant.SV.${pop}.overlaps.bed.gz - done -fi -#ASC (Werling) with carrier frequencies -if [ ${COMPARATOR} == "ASC_Werling" ]; then - for pop in ALL EUR OTH; do - #Print status - if [ ${QUIET} == 0 ]; then - echo -e "$( date ) - VCF QC STATUS: Benchmarking ${pop} samples in ${COMPARATOR}" - fi - ${BIN}/compare_callsets.sh -C \ - -O ${QCTMP}/ASC_Werling.SV.${pop}.overlaps.bed \ - -p ASC_Werling_${pop}_Benchmarking_SV \ - ${STATS} \ - ${BENCHDIR}/ASC_Werling.SV.${pop}.bed.gz - cp ${QCTMP}/ASC_Werling.SV.${pop}.overlaps.bed \ - ${OUTDIR}/data/ASC_Werling.SV.${pop}.overlaps.bed - bgzip -f ${OUTDIR}/data/ASC_Werling.SV.${pop}.overlaps.bed - tabix -f ${OUTDIR}/data/ASC_Werling.SV.${pop}.overlaps.bed.gz - done -fi -#HGSV (Chaisson) with carrier frequencies -if [ ${COMPARATOR} == "HGSV_Chaisson" ]; then - for pop in ALL AFR AMR EAS; do - #Print status - if [ ${QUIET} == 0 ]; then - echo -e "$( date ) - VCF QC STATUS: Benchmarking ${pop} samples in ${COMPARATOR}" - fi - ${BIN}/compare_callsets.sh -C \ - -O ${QCTMP}/HGSV_Chaisson.SV.${pop}.overlaps.bed \ - -p HGSV_Chaisson_${pop}_Benchmarking_SV \ - ${STATS} \ - ${BENCHDIR}/HGSV_Chaisson.SV.hg19_liftover.${pop}.bed.gz - cp ${QCTMP}/HGSV_Chaisson.SV.${pop}.overlaps.bed \ - ${OUTDIR}/data/HGSV_Chaisson.SV.${pop}.overlaps.bed - bgzip -f ${OUTDIR}/data/HGSV_Chaisson.SV.${pop}.overlaps.bed - tabix -f ${OUTDIR}/data/HGSV_Chaisson.SV.${pop}.overlaps.bed.gz - done -fi +while read prefix bed; do + #Print status + if [ ${QUIET} == 0 ]; then + echo -e "$( date ) - VCF QC STATUS: Benchmarking samples in ${prefix}" + fi + ${BIN}/compare_callsets.sh \ + -O ${QCTMP}/${prefix}.overlaps.bed \ + -p ${prefix}_Benchmarking_SV \ + ${STATS} \ + ${bed} \ + ${CONTIGS} + cp ${QCTMP}/${prefix}.overlaps.bed ${OUTDIR}/data/ + bgzip -f ${OUTDIR}/data/${prefix}.overlaps.bed + tabix -f ${OUTDIR}/data/${prefix}.overlaps.bed.gz +done < comparator_beds.tsv ###CLEAN UP rm -rf ${QCTMP} + diff --git a/src/sv-pipeline/scripts/vcf_qc/collectQC.perSample_benchmarking.sh b/src/sv-pipeline/scripts/vcf_qc/collectQC.perSample_benchmarking.sh index 3e1eaf291..6cd5ac7c6 100755 --- a/src/sv-pipeline/scripts/vcf_qc/collectQC.perSample_benchmarking.sh +++ b/src/sv-pipeline/scripts/vcf_qc/collectQC.perSample_benchmarking.sh @@ -13,13 +13,14 @@ set -e usage(){ cat < \ ${OVRTMP}/SET1_calls/${ID}.SET1.SV_calls.bed zcat ${VIDlist} | cut -f1 | fgrep -wf - <( zcat ${VCFSTATS} ) >> \ @@ -163,6 +176,7 @@ ${BIN}/compare_callsets_perSample.sh \ ${OVRTMP}/SET1_calls.tar.gz \ ${SET2} \ ${SAMPLES} \ + ${CONTIGS} \ ${OUTDIR} diff --git a/src/sv-pipeline/scripts/vcf_qc/compare_callsets.sh b/src/sv-pipeline/scripts/vcf_qc/compare_callsets.sh index b6cd31905..fe62947a7 100755 --- a/src/sv-pipeline/scripts/vcf_qc/compare_callsets.sh +++ b/src/sv-pipeline/scripts/vcf_qc/compare_callsets.sh @@ -12,13 +12,14 @@ set -e usage(){ cat < ${OVRTMP}/set1.bed - zcat ${SET1} | fgrep -v "#" | awk -v OFS="\t" '{ if ($3<$2) $3=$2; print }' | \ - sort -Vk1,1 -k2,2n -k3,3n | uniq >> ${OVRTMP}/set1.bed + zcat ${SET1} | fgrep -v "#" | awk -v OFS="\t" '{ if ($3<$2) $3=$2; print }' \ + | grep -f <( awk '{ print "^"$1"\t" }' ${CONTIGS} ) \ + | sed -e 's/MEI\|LINE1\|SVA\|ALU/INS/g' -e 's/MCNV/CNV/g' \ + | sort -Vk1,1 -k2,2n -k3,3n | uniq >> ${OVRTMP}/set1.bed else cat ${SET1} | fgrep "#" > ${OVRTMP}/set1.bed - cat ${SET1} | fgrep -v "#" | awk -v OFS="\t" '{ if ($3<$2) $3=$2; print }' | \ - sort -Vk1,1 -k2,2n -k3,3n | uniq >> ${OVRTMP}/set1.bed + cat ${SET1} | fgrep -v "#" | awk -v OFS="\t" '{ if ($3<$2) $3=$2; print }' \ + | grep -f <( awk '{ print "^"$1"\t" }' ${CONTIGS} ) \ + | sed -e 's/MEI\|LINE1\|SVA\|ALU/INS/g' -e 's/MCNV/CNV/g' \ + | sort -Vk1,1 -k2,2n -k3,3n | uniq >> ${OVRTMP}/set1.bed fi #Set carrierFrequency as final column, if optioned if [ ${CARRIER} == 1 ]; then - idx=$( head -n1 ${OVRTMP}/set1.bed | sed 's/\t/\n/g' | \ - awk '{ if ($1=="carrierFreq") print NR }' ) + idx=$( head -n1 ${OVRTMP}/set1.bed | sed 's/\t/\n/g' \ + | awk '{ if ($1=="carrierFreq") print NR }' ) awk -v FS="\t" -v OFS="\t" -v idx=${idx} \ '{ print $0, $(idx) }' ${OVRTMP}/set1.bed > \ ${OVRTMP}/set1.bed2 mv ${OVRTMP}/set1.bed2 ${OVRTMP}/set1.bed fi #Unzip & format SET2, if gzipped -if [ $( file ${SET2} | fgrep " gzip " | wc -l ) -gt 0 ]; then +if [ $( file ${SET2} | fgrep " gzip " | wc -l ) -gt 0 ] || \ + [ $( echo ${SET2} | awk -v FS="." '{ if ($NF ~ /gz|bgz/) print "TRUE" }' ) ]; then zcat ${SET2} | fgrep "#" | awk -v OFS="\t" \ '{ print $1, $2, $3, "VID", $4, $5, $6 }' > ${OVRTMP}/set2.bed - zcat ${SET2} | fgrep -v "#" | \ - awk -v OFS="\t" -v PREFIX=${PREFIX} \ - '{ if ($3<$2) $3=$2; print $1, $2, $3, PREFIX"_"NR, $4, $5, $6 }' | \ - sort -Vk1,1 -k2,2n -k3,3n | uniq >> ${OVRTMP}/set2.bed + zcat ${SET2} | fgrep -v "#" \ + | awk -v OFS="\t" -v PREFIX=${PREFIX} \ + '{ if ($3<$2) $3=$2; print $1, $2, $3, PREFIX"_"NR, $4, $5, $6 }' \ + | grep -f <( awk '{ print "^"$1"\t" }' ${CONTIGS} ) \ + | sed -e 's/MEI\|LINE1\|SVA\|ALU/INS/g' -e 's/MCNV/CNV/g' \ + | sort -Vk1,1 -k2,2n -k3,3n | uniq >> ${OVRTMP}/set2.bed else cat ${SET2} | fgrep "#" | awk -v OFS="\t" \ '{ print $1, $2, $3, "VID", $4, $5, $6 }' > ${OVRTMP}/set2.bed cat ${SET2} | fgrep -v "#" | \ awk -v OFS="\t" -v PREFIX=${PREFIX} \ - '{ if ($3<$2) $3=$2; print $1, $2, $3, PREFIX"_"NR, $4, $5, $6 }' | \ - sort -Vk1,1 -k2,2n -k3,3n | uniq >> ${OVRTMP}/set2.bed + '{ if ($3<$2) $3=$2; print $1, $2, $3, PREFIX"_"NR, $4, $5, $6 }' \ + | grep -f <( awk '{ print "^"$1"\t" }' ${CONTIGS} ) \ + | sed -e 's/MEI\|LINE1\|SVA\|ALU/INS/g' -e 's/MCNV/CNV/g' \ + | sort -Vk1,1 -k2,2n -k3,3n | uniq >> ${OVRTMP}/set2.bed fi ###RUN INTERSECTIONS -#Intersect method 1 data -bedtools intersect -loj -r -f 0.5 \ - -a <( awk -v small_cutoff=5000 -v OFS="\t" '{ if ($6>=small_cutoff) print $0 }' ${OVRTMP}/set2.bed ) \ - -b ${OVRTMP}/set1.bed > \ - ${OVRTMP}/OVR1.raw.bed -bedtools intersect -loj -r -f 0.1 \ - -a <( awk -v small_cutoff=5000 -v OFS="\t" '{ if ($6> \ - ${OVRTMP}/OVR1.raw.bed -#Intersect method 1a: 50% reciprocal overlap (10% for small SV), matching SV types -awk -v FS="\t" -v OFS="\t" \ -'{ if ($5==$12 || $5=="DUP" && $12=="MCNV" || $12=="DUP" && $5=="MCNV" || $5=="DEL" && $12=="MCNV" || $5=="MCNV" && $12=="DEL") print $4, $NF; else if ($12==".") print $4, "NO_OVR" }' \ -${OVRTMP}/OVR1.raw.bed | sort -Vk1,1 -k2,2n | uniq | \ -awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ -${OVRTMP}/OVR1a.raw.txt -cut -f1 ${OVRTMP}/OVR1a.raw.txt | fgrep -wvf - ${OVRTMP}/OVR1.raw.bed | \ -awk -v OFS="\t" '{ print $4, "NO_OVR" }' | sort -Vk1,1 -k2,2n | uniq >> \ -${OVRTMP}/OVR1a.raw.txt -sort -Vk1,1 -k2,2n ${OVRTMP}/OVR1a.raw.txt | uniq > ${OVRTMP}/OVR1a.raw.txt2 -mv ${OVRTMP}/OVR1a.raw.txt2 ${OVRTMP}/OVR1a.raw.txt -#Intersect method 1b: 50% reciprocal overlap (10% for small SV), any SV types -awk -v FS="\t" -v OFS="\t" \ -'{ if ($12==".") print $4, "NO_OVR"; else print $4, $NF }' \ -${OVRTMP}/OVR1.raw.bed | sort -Vk1,1 -k2,2n | uniq | \ -awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ -${OVRTMP}/OVR1b.raw.txt -cut -f1 ${OVRTMP}/OVR1b.raw.txt | fgrep -wvf - ${OVRTMP}/OVR1.raw.bed | \ -awk -v OFS="\t" '{ print $4, "NO_OVR" }' | sort -Vk1,1 -k2,2n | uniq >> \ -${OVRTMP}/OVR1b.raw.txt -sort -Vk1,1 -k2,2n ${OVRTMP}/OVR1b.raw.txt | uniq > ${OVRTMP}/OVR1b.raw.txt2 -mv ${OVRTMP}/OVR1b.raw.txt2 ${OVRTMP}/OVR1b.raw.txt -#Intersect method 2 data -bedtools intersect -loj -a ${OVRTMP}/set2.bed -b ${OVRTMP}/set1.bed > \ -${OVRTMP}/OVR2.raw.bed -#Intersect method 2a: any overlap, breakpoints within $DIST, matching SV types -awk -v FS="\t" -v OFS="\t" -v DIST=${DIST} \ -'{ if ($12!="." && ($2-$9<=DIST && $2-$9>=-DIST) && ($3-$10<=DIST && $3-$10>=-DIST) && ($5==$12 || $5=="DUP" && $12=="MCNV" || $12=="DUP" && $5=="MCNV" || $5=="DEL" && $12=="MCNV" || $5=="MCNV" && $12=="DEL")) print $4, $NF }' \ -${OVRTMP}/OVR2.raw.bed | sort -Vk1,1 -k2,2n | uniq | \ -awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ -${OVRTMP}/OVR2a.raw.txt -cut -f1 ${OVRTMP}/OVR2a.raw.txt | sort | uniq | fgrep -wvf - ${OVRTMP}/OVR2.raw.bed | \ -awk -v OFS="\t" '{ print $4, "NO_OVR" }' | sort -Vk1,1 -k2,2n | uniq >> \ -${OVRTMP}/OVR2a.raw.txt -sort -Vk1,1 -k2,2n ${OVRTMP}/OVR2a.raw.txt | uniq > ${OVRTMP}/OVR2a.raw.txt2 -mv ${OVRTMP}/OVR2a.raw.txt2 ${OVRTMP}/OVR2a.raw.txt -#Intersect method 2b: any overlap, breakpoints within $DIST, any SV types -awk -v FS="\t" -v OFS="\t" -v DIST=${DIST} \ -'{ if ($12!="." && ($2-$9<=DIST && $2-$9>=-DIST) && ($3-$10<=DIST && $3-$10>=-DIST)) print $4, $NF }' \ -${OVRTMP}/OVR2.raw.bed | sort -Vk1,1 -k2,2n | uniq | \ -awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ -${OVRTMP}/OVR2b.raw.txt -cut -f1 ${OVRTMP}/OVR2b.raw.txt | sort | uniq | fgrep -wvf - ${OVRTMP}/OVR2.raw.bed | \ -awk -v OFS="\t" '{ print $4, "NO_OVR" }' | sort -Vk1,1 -k2,2n | uniq >> \ -${OVRTMP}/OVR2b.raw.txt -sort -Vk1,1 -k2,2n ${OVRTMP}/OVR2b.raw.txt | uniq > ${OVRTMP}/OVR2b.raw.txt2 -mv ${OVRTMP}/OVR2b.raw.txt2 ${OVRTMP}/OVR2b.raw.txt -#Intersect method 3: any overlap, buffer ± $DIST, any svtype -bedtools intersect -loj -a ${OVRTMP}/set2.bed \ --b <( awk -v OFS="\t" -v DIST=${DIST} '{ $2=$2-DIST; $3=$3+DIST; print }' \ - ${OVRTMP}/set1.bed | awk -v OFS="\t" '{ if ($2<0) $2=0; print }' ) | \ -sort -Vk1,1 -k2,2n | uniq | awk -v FS="\t" -v OFS="\t" \ -'{ if ($12==".") print $4, "NO_OVR"; else print $4, $NF }' \ -| sort -Vk1,1 -k2,2n | uniq | \ -awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ -${OVRTMP}/OVR3.raw.txt - - -###CONVERT INTERSECTIONS TO FINAL TABLE -${BIN}/compare_callsets_helper.R \ - ${OVRTMP}/set2.bed \ - ${OVRTMP}/OVR1a.raw.txt \ - ${OVRTMP}/OVR1b.raw.txt \ - ${OVRTMP}/OVR2a.raw.txt \ - ${OVRTMP}/OVR2b.raw.txt \ - ${OVRTMP}/OVR3.raw.txt \ - ${OUTFILE} +# Check if any variants are left in set 2 (benchmarking set) after subsetting to contigs of interest +if [ $( cat ${OVRTMP}/set2.bed | fgrep -v "#" | wc -l ) -gt 0 ]; then + #Intersect method 1 data + bedtools intersect -loj -r -f 0.5 \ + -a <( awk -v small_cutoff=5000 -v OFS="\t" '{ if ($6>=small_cutoff) print $0 }' ${OVRTMP}/set2.bed ) \ + -b ${OVRTMP}/set1.bed > \ + ${OVRTMP}/OVR1.raw.bed + bedtools intersect -loj -r -f 0.1 \ + -a <( awk -v small_cutoff=5000 -v OFS="\t" '{ if ($6> \ + ${OVRTMP}/OVR1.raw.bed + #Intersect method 1a: 50% reciprocal overlap (10% for small SV), matching SV types + awk -v FS="\t" -v OFS="\t" \ + '{ if ($5==$12 \ + || $5=="DEL" && $12=="CNV" \ + || $5=="DUP" && $12=="CNV" || $12=="DUP" && $5=="INS" \ + || $5=="CNV" && $12=="DEL" || $5=="CNV" && $12=="DUP" || $5=="CNV" && $12=="INS" \ + || $5=="INS" && $12=="DUP" || $5=="INS" && $12=="CNV" \ + || $5=="INV" && $12=="CPX" || $5=="CPX" && $12=="INV") \ + print $4, $NF; else if ($12==".") print $4, "NO_OVR" }' \ + ${OVRTMP}/OVR1.raw.bed | sort -Vk1,1 -k2,2n | uniq | \ + awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ + ${OVRTMP}/OVR1a.raw.txt + cut -f1 ${OVRTMP}/OVR1a.raw.txt | fgrep -wvf - ${OVRTMP}/OVR1.raw.bed | \ + awk -v OFS="\t" '{ print $4, "NO_OVR" }' | sort -Vk1,1 -k2,2n | uniq >> \ + ${OVRTMP}/OVR1a.raw.txt + sort -Vk1,1 -k2,2n ${OVRTMP}/OVR1a.raw.txt | uniq > ${OVRTMP}/OVR1a.raw.txt2 + mv ${OVRTMP}/OVR1a.raw.txt2 ${OVRTMP}/OVR1a.raw.txt + #Intersect method 1b: 50% reciprocal overlap (10% for small SV), any SV types + awk -v FS="\t" -v OFS="\t" \ + '{ if ($12==".") print $4, "NO_OVR"; else print $4, $NF }' \ + ${OVRTMP}/OVR1.raw.bed | sort -Vk1,1 -k2,2n | uniq | \ + awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ + ${OVRTMP}/OVR1b.raw.txt + cut -f1 ${OVRTMP}/OVR1b.raw.txt | fgrep -wvf - ${OVRTMP}/OVR1.raw.bed | \ + awk -v OFS="\t" '{ print $4, "NO_OVR" }' | sort -Vk1,1 -k2,2n | uniq >> \ + ${OVRTMP}/OVR1b.raw.txt + sort -Vk1,1 -k2,2n ${OVRTMP}/OVR1b.raw.txt | uniq > ${OVRTMP}/OVR1b.raw.txt2 + mv ${OVRTMP}/OVR1b.raw.txt2 ${OVRTMP}/OVR1b.raw.txt + #Intersect method 2 data + bedtools intersect -loj -a ${OVRTMP}/set2.bed -b ${OVRTMP}/set1.bed > \ + ${OVRTMP}/OVR2.raw.bed + #Intersect method 2a: any overlap, breakpoints within $DIST, matching SV types + awk -v FS="\t" -v OFS="\t" -v DIST=${DIST} \ + '{ if ($5==$12 \ + || $5=="DEL" && $12=="CNV" \ + || $5=="DUP" && $12=="CNV" || $12=="DUP" && $5=="INS" \ + || $5=="CNV" && $12=="DEL" || $5=="CNV" && $12=="DUP" || $5=="CNV" && $12=="INS" \ + || $5=="INS" && $12=="DUP" || $5=="INS" && $12=="CNV" \ + || $5=="INV" && $12=="CPX" || $5=="CPX" && $12=="INV") \ + print $4, $NF; else if ($12==".") print $4, "NO_OVR" }' \ + ${OVRTMP}/OVR2.raw.bed | sort -Vk1,1 -k2,2n | uniq | \ + awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ + ${OVRTMP}/OVR2a.raw.txt + cut -f1 ${OVRTMP}/OVR2a.raw.txt | sort | uniq | fgrep -wvf - ${OVRTMP}/OVR2.raw.bed | \ + awk -v OFS="\t" '{ print $4, "NO_OVR" }' | sort -Vk1,1 -k2,2n | uniq >> \ + ${OVRTMP}/OVR2a.raw.txt + sort -Vk1,1 -k2,2n ${OVRTMP}/OVR2a.raw.txt | uniq > ${OVRTMP}/OVR2a.raw.txt2 + mv ${OVRTMP}/OVR2a.raw.txt2 ${OVRTMP}/OVR2a.raw.txt + #Intersect method 2b: any overlap, breakpoints within $DIST, any SV types + awk -v FS="\t" -v OFS="\t" -v DIST=${DIST} \ + '{ if ($12!="." && ($2-$9<=DIST && $2-$9>=-DIST) && ($3-$10<=DIST && $3-$10>=-DIST)) print $4, $NF }' \ + ${OVRTMP}/OVR2.raw.bed | sort -Vk1,1 -k2,2n | uniq | \ + awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ + ${OVRTMP}/OVR2b.raw.txt + cut -f1 ${OVRTMP}/OVR2b.raw.txt | sort | uniq | fgrep -wvf - ${OVRTMP}/OVR2.raw.bed | \ + awk -v OFS="\t" '{ print $4, "NO_OVR" }' | sort -Vk1,1 -k2,2n | uniq >> \ + ${OVRTMP}/OVR2b.raw.txt + sort -Vk1,1 -k2,2n ${OVRTMP}/OVR2b.raw.txt | uniq > ${OVRTMP}/OVR2b.raw.txt2 + mv ${OVRTMP}/OVR2b.raw.txt2 ${OVRTMP}/OVR2b.raw.txt + #Intersect method 3: any overlap, buffer ± $DIST, any svtype + bedtools intersect -loj -a ${OVRTMP}/set2.bed \ + -b <( awk -v OFS="\t" -v DIST=${DIST} '{ $2=$2-DIST; $3=$3+DIST; print }' \ + ${OVRTMP}/set1.bed | awk -v OFS="\t" '{ if ($2<0) $2=0; print }' ) | \ + sort -Vk1,1 -k2,2n | uniq | awk -v FS="\t" -v OFS="\t" \ + '{ if ($12==".") print $4, "NO_OVR"; else print $4, $NF }' \ + | sort -Vk1,1 -k2,2n | uniq | \ + awk -v OFS="\t" '{ if ($2=="NA") $2="1"; print $1, $2 }' > \ + ${OVRTMP}/OVR3.raw.txt + + ###CONVERT INTERSECTIONS TO FINAL TABLE + ${BIN}/compare_callsets_helper.R \ + ${OVRTMP}/set2.bed \ + ${OVRTMP}/OVR1a.raw.txt \ + ${OVRTMP}/OVR1b.raw.txt \ + ${OVRTMP}/OVR2a.raw.txt \ + ${OVRTMP}/OVR2b.raw.txt \ + ${OVRTMP}/OVR3.raw.txt \ + ${OUTFILE} + +# If no variants remain in set 2 after subsetting to contigs of interest, output empty OUTFILE +else + echo -e "#chr\tstart\tend\tVID\tsvtype\tlength\tAF\tovr1a\tovr1b\tovr2a\tovr2b\tovr3" > ${OUTFILE} +fi ###CLEAN UP rm -rf ${QCTMP} diff --git a/src/sv-pipeline/scripts/vcf_qc/compare_callsets_perSample.sh b/src/sv-pipeline/scripts/vcf_qc/compare_callsets_perSample.sh index 4ebc0fca5..bce248160 100755 --- a/src/sv-pipeline/scripts/vcf_qc/compare_callsets_perSample.sh +++ b/src/sv-pipeline/scripts/vcf_qc/compare_callsets_perSample.sh @@ -12,7 +12,7 @@ set -e usage(){ cat < \ @@ -234,10 +247,11 @@ else -p ${ID}_${PREFIX}_specificity \ -O ${OUTDIR}/${ID}.specificity.bed \ ${OVRTMP}/${ID}.SET2.spec.cleaned.bed.gz \ - ${OVRTMP}/${ID}.SET1.spec.cleaned.bed.gz - bgzip -f ${OUTDIR}/${ID}.specificity.bed - rm ${OVRTMP}/${ID}.SET1.spec.cleaned.bed.gz \ - ${OVRTMP}/${ID}.SET2.spec.cleaned.bed.gz + ${OVRTMP}/${ID}.SET1.spec.cleaned.bed.gz \ + ${CONTIGS} + bgzip -f ${OUTDIR}/${ID}.specificity.bed + rm ${OVRTMP}/${ID}.SET1.spec.cleaned.bed.gz \ + ${OVRTMP}/${ID}.SET2.spec.cleaned.bed.gz #Report counter, if relevant i=$(( ${i} + 1 )) @@ -249,6 +263,12 @@ else fi fi done < ${OVRTMP}/int_samples_and_paths.list + + # Report when complete + if [ ${QUIET} == 0 ]; then + echo -e "$( date ) - PER-SAMPLE COMPARISON STATUS: Finished comparisons for all ${nsamps} samples" + fi + fi diff --git a/src/sv-pipeline/scripts/vcf_qc/plotQC.external_benchmarking.helper.sh b/src/sv-pipeline/scripts/vcf_qc/plotQC.external_benchmarking.helper.sh index 8c78cefd4..61028bd0b 100755 --- a/src/sv-pipeline/scripts/vcf_qc/plotQC.external_benchmarking.helper.sh +++ b/src/sv-pipeline/scripts/vcf_qc/plotQC.external_benchmarking.helper.sh @@ -12,15 +12,14 @@ set -e usage(){ cat < input_data.tsv +cat input_data.tsv +while read compdat; do + outprefix=$( basename ${compdat} | sed 's/\.overlaps\.bed\.gz//g' ) if [ -e ${compdat} ] && [ -s ${compdat} ]; then #Print status - echo -e "$( date ) - VCF QC STATUS: Plotting benchmarking for ${pop} samples in ${COMPARATOR}" + echo -e "$( date ) - VCF QC STATUS: Plotting benchmarking for ${outprefix} subset from ${PREFIX}" #Plot benchmarking ${BIN}/plot_callset_comparison.R \ - ${carrierFlag} \ - -p ${COMPARATOR}_${pop} \ + -p ${outprefix} \ ${compdat} \ - ${OUTDIR}/plots/${COMPARATOR}_${pop}_samples/ + ${OUTDIR}/plots/${outprefix}/ fi -done +done < input_data.tsv diff --git a/src/sv-pipeline/scripts/vcf_qc/plot_callset_comparison.R b/src/sv-pipeline/scripts/vcf_qc/plot_callset_comparison.R index aceda40b4..573559a23 100755 --- a/src/sv-pipeline/scripts/vcf_qc/plot_callset_comparison.R +++ b/src/sv-pipeline/scripts/vcf_qc/plot_callset_comparison.R @@ -119,21 +119,18 @@ categoryBreakdownByClass <- function(dat,norm=F){ "Nearby\n(+/- 250bp)","No Overlap") return(mat) } -#Extract best matching AF pairs +#Extract best-matching AF pairs (after excluding category 3) getFreqPairs <- function(dat){ - freqPairs <- as.data.frame(t(apply(dat[,7:ncol(dat)],1,function(vals){ - benchmark.AF <- as.numeric(vals[1]) - callset.AF <- NULL - # for(i in c(2,4,3,5,6)){ - for(i in c(4,2)){ - if(is.null(callset.AF)){ - if(!(vals[i]) %in% c("NS","NO_OVR")){ - callset.AF <- as.numeric(vals[i]) - } - } - } - if(is.null(callset.AF)){ + freqPairs <- as.data.frame(t(apply(dat[,7:11],1,function(vals){ + vals <- as.numeric(vals) + benchmark.AF <- vals[1] + callset.AFs <- vals[-1] + deltas <- abs(callset.AFs - benchmark.AF) + if(all(is.na(deltas))){ callset.AF <- NA + }else{ + best.match.AF <- min(deltas, na.rm=T) + callset.AF <- callset.AFs[head(which(deltas == best.match.AF), 1)] } return(c(benchmark.AF,callset.AF)) }))) @@ -993,18 +990,6 @@ OUTDIR <- args$args[2] prefix <- opts$prefix carrierFreqs <- opts$carrierFreqs -#Dev parameters -# INFILE <- "~/scratch/vcf_qc_output_pesr/data/ASC_Werling.SV.overlaps.bed.gz" -# prefix <- "ASC_Werling" -# carrierFreqs <- T -# INFILE <- "~/scratch/vcf_qc_output_rd/data/1000G_Sudmant.SV.overlaps.bed.gz" -# prefix <- "1000G_Sudmant" -# carrierFreqs <- F -# INFILE <- "~/scratch/vcf_qc_output_pesr/data/HGSV_Chaisson.SV.overlaps.bed.gz" -# prefix <- "HGSV_Chaisson" -# carrierFreqs <- T -# OUTDIR <- "~/scratch/callset_comparison_plots/" - ###Prepares I/O files #Read & clean data dat <- read.table(INFILE,comment.char="",sep="\t",header=T,check.names=F) diff --git a/src/sv-pipeline/scripts/vcf_qc/plot_perSample_benchmarking.R b/src/sv-pipeline/scripts/vcf_qc/plot_perSample_benchmarking.R index bada2766d..fc0cfa0bc 100755 --- a/src/sv-pipeline/scripts/vcf_qc/plot_perSample_benchmarking.R +++ b/src/sv-pipeline/scripts/vcf_qc/plot_perSample_benchmarking.R @@ -28,7 +28,7 @@ ovr.cat.cols <- c("#76E349","#4DAC26","#2C750E", ###GENERAL HELPER FUNCTIONS ########################### #Read overlap data for a list of samples -readMultiSampDat <- function(samples,measurement="sensitivity"){ +readMultiSampDat <- function(samples, measurement="sensitivity"){ #Iterate over samples dat <- lapply(samples,function(ID){ #Set path @@ -594,9 +594,11 @@ plotLinesByClass <- function(means,ci_adj,nsamp,xlab=NULL,ylab=NULL,title=NULL, #Add legend if(legend==T){ idx.for.legend <- which(apply(means,1,function(vals){any(!is.na(vals))})) - legend("bottomleft",bg="white",pch=19,cex=0.7*lab.cex,lwd=2, - legend=rownames(means)[idx.for.legend], - col=colors[idx.for.legend]) + if(length(idx.for.legend) > 0){ + legend("bottomleft",bg="white",pch=19,cex=0.7*lab.cex,lwd=2, + legend=rownames(means)[idx.for.legend], + col=colors[idx.for.legend]) + } } #Add cleanup boxes @@ -862,9 +864,9 @@ masterWrapper <- function(plotDat.all,compset.prefix){ ###RSCRIPT FUNCTIONALITY ######################## ###Load libraries as needed -require(optparse) -require(vioplot) -require(beeswarm) +require(optparse, quietly=T) +require(vioplot, quietly=T) +require(beeswarm, quietly=T) ###List of command-line options option_list <- list( @@ -892,11 +894,11 @@ OUTDIR <- args$args[4] compset.prefix <- opts$comparisonSetName # #Dev parameters -# perSampDir <- "/Users/rlc/Downloads/gnomAD_v2_SV_PCRPLUS_Q4_batch_1.manta_03_filtered_vcf_Werling_2018_WGS_results_merged/" -# samples.in <- "/Users/rlc/Downloads/gnomAD_v2_SV_PCRPLUS_Q4_batch_1.manta_03_filtered_vcf.shard..analysis_samples.list" +# perSampDir <- "/Users/collins/scratch/gnomAD-SV_v3.chr19_to_22.v1/" +# samples.in <- "/Users/collins/scratch/gnomAD-SV_v3.chr19_to_22.v1.chr19.shard.shard_.analysis_samples.list" # OUTDIR <- "~/scratch/perSample_benchmarking_plots_test/" -# svtypes.file <- "~/Desktop/Collins/Talkowski/code/sv-pipeline/ref/vcf_qc_refs/SV_colors.txt" -# compset.prefix <- "Sanders_2015_array" +# svtypes.file <- "/Users/collins/Desktop/Collins/Talkowski/NGS/SV_Projects/gnomAD_v3/gnomad-sv-v3-qc//src/sv-pipeline/scripts/vcf_qc/SV_colors.txt" +# compset.prefix <- "HGSV_Ebert_perSample" ###Read & process input data #Read list of samples @@ -926,7 +928,7 @@ plotDat.all <- lapply(list("sensitivity","specificity"),function(measurement){ measurement.lab <- paste(toupper(substr(measurement,1,1)),substr(measurement,2,nchar(measurement)),sep="") #Reads list of per-sample overlap data - dat <- readMultiSampDat(samples,measurement=measurement) + dat <- readMultiSampDat(samples, measurement=measurement) #Get number of samples nsamp <- length(dat) diff --git a/src/sv-pipeline/scripts/vcf_qc/plot_sv_vcf_distribs.R b/src/sv-pipeline/scripts/vcf_qc/plot_sv_vcf_distribs.R index d36723db2..67363c9d0 100755 --- a/src/sv-pipeline/scripts/vcf_qc/plot_sv_vcf_distribs.R +++ b/src/sv-pipeline/scripts/vcf_qc/plot_sv_vcf_distribs.R @@ -19,6 +19,7 @@ medium.max.size <- 2500 medlarge.max.size <- 10000 large.max.size <- 50000 huge.max.size <- 300000000 +sex.chroms <- c(1:22, paste("chr", 1:22, sep="")) ################### @@ -115,26 +116,27 @@ plotSVCountBars <- function(dat,svtypes,title=NULL,ylab="SV Count"){ } #Plot dot for fraction of total SV per chromosome plotDotsSVperChrom <- function(dat,svtypes,title=NULL,ylab="Fraction of SV Type"){ + contigs <- sort(unique(dat$chr)) #Compute table mat <- sapply(svtypes$svtype,function(svtype){ - counts <- sapply(c(1:22,"X","Y"),function(contig){ + counts <- sapply(contigs, function(contig){ length(which(dat$chr==contig & dat$svtype==svtype)) }) if(sum(counts,na.rm=T)>0){ return(counts/sum(counts)) }else{ - return(rep(0,24)) + return(rep(0, length(contigs))) } }) #Prep plotting area par(bty="n",mar=c(3,4.5,2.5,0.5)) - plot(x=c(0,24),y=c(0,1.15*max(mat)),type="n", + plot(x=c(0, length(contigs)),y=c(0,1.15*max(mat)),type="n", xaxt="n",yaxt="n",xlab="",ylab="",xaxs="i",yaxs="i") #Add axes and title - sapply(1:24,function(i){ - axis(1,at=i-0.5,tick=F,labels=c(1:22,"X","Y")[i],line=-0.8,cex.axis=0.7) + sapply(1:length(contigs),function(i){ + axis(1,at=i-0.5,tick=F,labels=contigs[i],line=-0.8,cex.axis=0.7) }) mtext(1,text="Chromosome",line=1.5) axis(2,at=axTicks(2),labels=NA) @@ -145,7 +147,7 @@ plotDotsSVperChrom <- function(dat,svtypes,title=NULL,ylab="Fraction of SV Type" #Plot per-svtype information sapply(1:ncol(mat),function(i){ - points(x=(1:24)-seq(0.8,0.2,by=-0.6/(nrow(svtypes)-1))[i], + points(x=(1:length(contigs))-seq(0.8,0.2,by=-0.6/(nrow(svtypes)-1))[i], y=mat[,i],type="l",lwd=0.5,col=adjustcolor(svtypes$color[i],alpha=0.5)) }) sapply(1:nrow(mat),function(i){ @@ -162,6 +164,7 @@ plotDotsSVperChrom <- function(dat,svtypes,title=NULL,ylab="Fraction of SV Type" legend("topright",legend=svtypes$svtype, pch=19,col=svtypes$color,cex=0.7,border=NA,bty="n") } + #Wrapper to plot all barplots of SV counts wrapperPlotAllCountBars <- function(){ #All SV @@ -376,14 +379,14 @@ wrapperPlotAllCountBars <- function(){ #####Size plots ############### #Plot single size distribution -plotSizeDistrib <- function(dat,svtypes,n.breaks=250,k=5, - min.size=50,max.size=1000000, - autosomal=F,biallelic=F, - title=NULL,legend=F,lwd.cex=1){ +plotSizeDistrib <- function(dat, svtypes, n.breaks=150, k=10, + min.size=50, max.size=1000000, + autosomal=F, biallelic=F, + title=NULL, legend=F, lwd.cex=1, text.cex=1){ #Filter/process sizes & compute range + breaks filter.legend <- NULL if(autosomal==T){ - dat <- dat[which(dat$chr %in% c(1:22,paste("chr",1:22,sep=""))),] + dat <- dat[which(dat$chr %in% sex.chroms),] filter.legend <- c(filter.legend,"Autosomal SV only") } if(biallelic==T){ @@ -412,7 +415,7 @@ plotSizeDistrib <- function(dat,svtypes,n.breaks=250,k=5, dens$ALL <- as.numeric(all.h$counts/length(all.vals)) #Prepare plot area - ylims <- c(0,quantile(unlist(dens),probs=0.995,na.rm=T)) + ylims <- c(0,quantile(unlist(dens),probs=0.99,na.rm=T)) dens <- lapply(dens,function(vals){ vals[which(vals>max(ylims))] <- max(ylims) return(vals) @@ -437,11 +440,11 @@ plotSizeDistrib <- function(dat,svtypes,n.breaks=250,k=5, axis(1,at=logscale.major,tck=-0.03,labels=NA) axis(1,at=logscale.minor,tick=F,cex.axis=0.8,line=-0.4,las=2, labels=logscale.minor.labs) - mtext(1,text="Size",line=2.25,cex=lwd.cex) + mtext(1,text="Size",line=2.25,cex=text.cex) axis(2,at=axTicks(2),tck=-0.025,labels=NA) axis(2,at=axTicks(2),tick=F,line=-0.4,cex.axis=0.8,las=2, labels=paste(round(100*axTicks(2),1),"%",sep="")) - mtext(2,text="Fraction of SV",line=2,cex=lwd.cex) + mtext(2,text="Fraction of SV",line=2,cex=text.cex) sapply(1:2,function(i){ axis(3,at=log10(c(300,6000))[i],labels=NA,tck=-0.01) axis(3,at=log10(c(300,6000))[i],tick=F,line=-0.9,cex.axis=0.8, @@ -449,7 +452,7 @@ plotSizeDistrib <- function(dat,svtypes,n.breaks=250,k=5, }) axis(3,at=log10(c(1000,2000)),labels=NA,tck=-0.01) axis(3,at=mean(log10(c(1000,2000))),tick=F,line=-0.9,cex.axis=0.8,labels="SVA",font=3) - mtext(3,line=1.5,text=title,font=2,cex=lwd.cex) + mtext(3,line=1.5,text=title,font=2,cex=text.cex) #Add points per SV type sapply(1:length(dens),function(i){ @@ -482,8 +485,17 @@ plotSizeDistrib <- function(dat,svtypes,n.breaks=250,k=5, #Add sv type legend if(legend==T){ idx.for.legend <- which(unlist(lapply(dens,function(vals){any(!is.na(vals) & !is.infinite(vals) & vals>0)}))) - legend("right",bg=NA,bty="n",pch=NA,cex=lwd.cex*0.7,lwd=3, - legend=rbind(svtypes,c("ALL","gray15"))$svtype[idx.for.legend], + counts.for.legend <- sapply(names(idx.for.legend), function(svtype){ + if(svtype == "ALL"){ + nrow(dat) + }else{ + length(which(dat$svtype==svtype)) + } + }) + legend("right",bg=NA,bty="n",pch=NA,cex=text.cex*0.7,lwd=3, + legend=paste(rbind(svtypes, c("ALL","gray15"))$svtype[idx.for.legend], + " (N=", prettyNum(counts.for.legend, big.mark=","), + ")", sep=""), col=rbind(svtypes,c("ALL","gray15"))$color[idx.for.legend]) } }else{ @@ -491,7 +503,7 @@ plotSizeDistrib <- function(dat,svtypes,n.breaks=250,k=5, plot(x=c(0,1),y=c(0,1),type="n", xaxt="n",yaxt="n",xlab="",ylab="",yaxs="i") text(x=0.5,y=0.5,labels="No Data") - mtext(3,line=1.5,text=title,font=2,cex=lwd.cex) + mtext(3,line=1.5,text=title,font=2,cex=text.cex) } #Add number of SV to plot @@ -500,18 +512,18 @@ plotSizeDistrib <- function(dat,svtypes,n.breaks=250,k=5, #Add filter labels if(!is.null(filter.legend)){ - legend("topright",bg=NA,bty="n",pch=NA,legend=filter.legend,cex=lwd.cex) + legend("topright",bg=NA,bty="n",pch=NA,legend=filter.legend,cex=text.cex) } } + #Plot comparative size distributions for a series of AC & AF restrictions -plotSizeDistribSeries <- function(dat,svtypes,max.AFs,legend.labs, - n.breaks=100,min.size=50,max.size=1000000, - autosomal=T,biallelic=T, - title=NULL){ +plotSizeDistribSeries <- function(dat, svtypes, max.AFs, legend.labs, + n.breaks=100, min.size=50, max.size=1000000, + autosomal=F, biallelic=T, title=NULL, lwd.cex=1){ #Process sizes & compute range + breaks filter.legend <- NULL if(autosomal==T){ - dat <- dat[which(dat$chr %in% c(1:22,paste("chr",1:22,sep=""))),] + dat <- dat[which(dat$chr %in% sex.chroms),] filter.legend <- c(filter.legend,"Autosomal SV only") } if(biallelic==T){ @@ -519,72 +531,84 @@ plotSizeDistribSeries <- function(dat,svtypes,max.AFs,legend.labs, filter.legend <- c(filter.legend,"Biallelic SV only") } sizes <- log10(dat$length) - sizes <- lapply(1:length(max.AFs),function(i){ - if(i==1){ - return(sizes[which(dat$AF<=max.AFs[i])]) - }else{ - return(sizes[which(dat$AF>max.AFs[i-1] & dat$AF<=max.AFs[i])]) - } - }) - xlims <- range(sizes[which(!is.infinite(unlist(sizes)))],na.rm=T) - xlims[1] <- max(c(log10(min.size),xlims[1])) - xlims[2] <- min(c(log10(max.size),xlims[2])) - breaks <- seq(xlims[1],xlims[2],by=(xlims[2]-xlims[1])/n.breaks) - mids <- (breaks[1:(length(breaks)-1)]+breaks[2:length(breaks)])/2 - - #Gather size densities per AF tranche - dens <- lapply(sizes,function(vals){ - h <- hist(vals[which(!is.infinite(vals) & vals>=xlims[1] & vals<=xlims[2])],plot=F,breaks=breaks) - h$counts[1] <- h$counts[1]+length(which(!is.infinite(vals) & valsxlims[2])) - return(h$counts/length(vals)) - }) - - #Prepare plot area - ylims <- c(0,quantile(unlist(dens),probs=0.995,na.rm=T)) - dens <- lapply(dens,function(vals){ - vals[which(vals>max(ylims))] <- max(ylims) - return(vals) - }) - par(bty="n",mar=c(3.5,3.5,3,0.5)) - plot(x=xlims,y=ylims,type="n", - xaxt="n",yaxt="n",xlab="",ylab="",yaxs="i") - - #Add vertical gridlines - logscale.all <- log10(as.numeric(sapply(0:8,function(i){(1:9)*10^i}))) - logscale.minor <- log10(as.numeric(sapply(0:8,function(i){c(5,10)*10^i}))) - logscale.minor.labs <- as.character(sapply(c("bp","kb","Mb"),function(suf){paste(c(1,5,10,50,100,500),suf,sep="")})) - logscale.minor.labs <- c(logscale.minor.labs[-1],"1Gb") - logscale.major <- log10(as.numeric(10^(0:8))) - abline(v=logscale.all,col="gray97") - abline(v=logscale.minor,col="gray92") - abline(v=logscale.major,col="gray85") - - #Add axes, title, and Alu/SVA/L1 ticks - axis(1,at=logscale.all,tck=-0.015,col="gray50",labels=NA) - axis(1,at=logscale.minor,tck=-0.0225,col="gray20",labels=NA) - axis(1,at=logscale.major,tck=-0.03,labels=NA) - axis(1,at=logscale.minor,tick=F,cex.axis=0.8,line=-0.4,las=2, - labels=logscale.minor.labs) - mtext(1,text="Size",line=2.25) - axis(2,at=axTicks(2),tck=-0.025,labels=NA) - axis(2,at=axTicks(2),tick=F,line=-0.4,cex.axis=0.8,las=2, - labels=paste(round(100*axTicks(2),1),"%",sep="")) - mtext(2,text="Fraction of SV",line=2) - sapply(1:2,function(i){ - axis(3,at=log10(c(300,6000))[i],labels=NA,tck=-0.01) - axis(3,at=log10(c(300,6000))[i],tick=F,line=-0.9,cex.axis=0.8, - labels=c("Alu","L1")[i],font=3) - }) - axis(3,at=log10(c(1000,2000)),labels=NA,tck=-0.01) - axis(3,at=mean(log10(c(1000,2000))),tick=F,line=-0.9,cex.axis=0.8,labels="SVA",font=3) - mtext(3,line=1.5,text=title,font=2) - - #Add lines per AF tranche - col.pal <- rev(colorRampPalette(c("#440154","#365C8C","#25A584","#FDE725"))(length(sizes))) - sapply(length(dens):1,function(i){ - points(x=mids,y=dens[[i]],type="l",lwd=2,col=col.pal[i]) - }) + if(length(sizes) > 0){ + sizes <- lapply(1:length(max.AFs),function(i){ + if(i==1){ + return(sizes[which(dat$AF<=max.AFs[i])]) + }else{ + return(sizes[which(dat$AF>max.AFs[i-1] & dat$AF<=max.AFs[i])]) + } + }) + xlims <- range(sizes[which(!is.infinite(unlist(sizes)))],na.rm=T) + xlims[1] <- max(c(log10(min.size),xlims[1])) + xlims[2] <- min(c(log10(max.size),xlims[2])) + breaks <- seq(xlims[1],xlims[2],by=(xlims[2]-xlims[1])/n.breaks) + mids <- (breaks[1:(length(breaks)-1)]+breaks[2:length(breaks)])/2 + + #Gather size densities per AF tranche + dens <- lapply(sizes,function(vals){ + h <- hist(vals[which(!is.infinite(vals) & vals>=xlims[1] & vals<=xlims[2])],plot=F,breaks=breaks) + h$counts[1] <- h$counts[1]+length(which(!is.infinite(vals) & valsxlims[2])) + return(h$counts/length(vals)) + }) + + #Prepare plot area + ylims <- c(0,quantile(unlist(dens),probs=0.99,na.rm=T)) + dens <- lapply(dens,function(vals){ + vals[which(vals>max(ylims))] <- max(ylims) + return(vals) + }) + par(bty="n",mar=c(3.5,3.5,3,0.5)) + plot(x=xlims,y=ylims,type="n", + xaxt="n",yaxt="n",xlab="",ylab="",yaxs="i") + + #Add vertical gridlines + logscale.all <- log10(as.numeric(sapply(0:8,function(i){(1:9)*10^i}))) + logscale.minor <- log10(as.numeric(sapply(0:8,function(i){c(5,10)*10^i}))) + logscale.minor.labs <- as.character(sapply(c("bp","kb","Mb"),function(suf){paste(c(1,5,10,50,100,500),suf,sep="")})) + logscale.minor.labs <- c(logscale.minor.labs[-1],"1Gb") + logscale.major <- log10(as.numeric(10^(0:8))) + abline(v=logscale.all,col="gray97") + abline(v=logscale.minor,col="gray92") + abline(v=logscale.major,col="gray85") + + #Add axes, title, and Alu/SVA/L1 ticks + axis(1,at=logscale.all,tck=-0.015,col="gray50",labels=NA) + axis(1,at=logscale.minor,tck=-0.0225,col="gray20",labels=NA) + axis(1,at=logscale.major,tck=-0.03,labels=NA) + axis(1,at=logscale.minor,tick=F,cex.axis=0.8,line=-0.4,las=2, + labels=logscale.minor.labs) + mtext(1,text="Size",line=2.25) + axis(2,at=axTicks(2),tck=-0.025,labels=NA) + axis(2,at=axTicks(2),tick=F,line=-0.4,cex.axis=0.8,las=2, + labels=paste(round(100*axTicks(2),1),"%",sep="")) + mtext(2,text="Fraction of SV",line=2) + sapply(1:2,function(i){ + axis(3,at=log10(c(300,6000))[i],labels=NA,tck=-0.01) + axis(3,at=log10(c(300,6000))[i],tick=F,line=-0.9,cex.axis=0.8, + labels=c("Alu","L1")[i],font=3) + }) + axis(3,at=log10(c(1000,2000)),labels=NA,tck=-0.01) + axis(3,at=mean(log10(c(1000,2000))),tick=F,line=-0.9,cex.axis=0.8,labels="SVA",font=3) + mtext(3,line=1.5,text=title,font=2) + + #Add points & rolling mean per AF tranche + col.pal <- rev(colorRampPalette(c("#440154","#365C8C","#25A584","#FDE725"))(length(sizes))) + sapply(1:length(dens), function(i){ + #Points per individual bin + points(x=mids, y=dens[[i]], pch=19, cex=0.25, col=col.pal[i]) + #Rolling mean for line + points(x=mids, y=rollapply(dens[[i]], width=5, mean, partial=T), + type="l", lwd=lwd.cex, col=col.pal[i]) + }) + }else{ + par(bty="n",mar=c(3.5,3.5,3,0.5)) + plot(x=c(0,1),y=c(0,1),type="n", + xaxt="n",yaxt="n",xlab="",ylab="",yaxs="i") + text(x=0.5,y=0.5,labels="No Data") + mtext(3,line=1.5,text=title,font=2,cex=lwd.cex) + } #Add filter labels if(!is.null(filter.legend)){ @@ -594,8 +618,10 @@ plotSizeDistribSeries <- function(dat,svtypes,max.AFs,legend.labs, #Add freq legend legend("right",bg="white",bty="n",lwd=3,col=col.pal,legend=legend.labs,cex=0.8) } + #Wrapper to plot all size distributions wrapperPlotAllSizeDistribs <- function(){ + #All SV pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/size_distribution.all_sv.pdf",sep=""), height=4,width=6) @@ -603,46 +629,52 @@ wrapperPlotAllSizeDistribs <- function(){ title="Size Distribution (All SV)", legend=T) dev.off() + #Singletons pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/size_distribution.singletons.pdf",sep=""), height=4,width=6) plotSizeDistrib(dat=dat[which(dat$AC==1),],svtypes=svtypes, - autosomal=T,biallelic=T, + autosomal=F,biallelic=T, title="Size Distribution (Singletons; AC = 1)", legend=T) dev.off() + #Rare (>1 & <1%) pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/size_distribution.rare_sv.pdf",sep=""), height=4,width=6) plotSizeDistrib(dat=dat[which(dat$AC>1 & dat$AF=rare.max.freq & dat$AF=uncommon.max.freq & dat$AF=common.max.freq),],svtypes=svtypes, - autosomal=T,biallelic=T, + autosomal=F, biallelic=T, title="Size Distribution (AF > 50%)", legend=T) dev.off() + #Frequency series pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/size_distribution.across_freqs.pdf",sep=""), height=4,width=6) @@ -652,6 +684,7 @@ wrapperPlotAllSizeDistribs <- function(){ legend.labs=c("Singleton","<1%","1-10%","10-50%",">50%"), title="Size Distributions by Allele Frequency") dev.off() + #Merged pdf(paste(OUTDIR,"/main_plots/VCF_QC.size_distributions.merged.pdf",sep=""), height=6,width=10) @@ -659,27 +692,23 @@ wrapperPlotAllSizeDistribs <- function(){ heights=c(4,2)) plotSizeDistrib(dat=dat,svtypes=svtypes, title="Size Distribution (All SV)", - legend=T) + legend=T, lwd.cex=1.5) plotSizeDistribSeries(dat=dat,svtypes=svtypes, max.AFs=c(1.1/(2*nsamp),rare.max.freq,uncommon.max.freq, common.max.freq,major.max.freq), legend.labs=c("Singleton","<1%","1-10%","10-50%",">50%"), - title="Size Distributions by Allele Frequency") + title="Size Distributions by Allele Frequency", + lwd.cex=2) plotSizeDistrib(dat=dat[which(dat$AC==1),],svtypes=svtypes, - autosomal=T,biallelic=T, - title="AC = 1",lwd.cex=0.75) + autosomal=F, biallelic=T, title="AC = 1", text.cex=0.75) plotSizeDistrib(dat=dat[which(dat$AC>1 & dat$AF=rare.max.freq & dat$AF=uncommon.max.freq & dat$AF=common.max.freq),],svtypes=svtypes, - autosomal=T,biallelic=T, - title="> 50%",lwd.cex=0.75) + autosomal=F, biallelic=T, title="> 50%", text.cex=0.75) dev.off() } @@ -688,13 +717,13 @@ wrapperPlotAllSizeDistribs <- function(){ #####Allele frequency plots ########################### #Plot single AF spectrum -plotFreqDistrib <- function(dat,svtypes, - autosomal=T,biallelic=T, - title=NULL,lwd.cex=1,legend=F){ +plotFreqDistrib <- function(dat, svtypes, + autosomal=F, biallelic=T, + title=NULL, lwd.cex=1, legend=F){ #Process freqs & compute range + breaks filter.legend <- NULL if(autosomal==T){ - dat <- dat[which(dat$chr %in% c(1:22,paste("chr",1:22,sep=""))),] + dat <- dat[which(dat$chr %in% sex.chroms),] filter.legend <- c(filter.legend,"Autosomal SV only") } if(biallelic==T){ @@ -704,7 +733,7 @@ plotFreqDistrib <- function(dat,svtypes, freqs <- log10(dat$AF) if(length(freqs)>0){ xlims <- range(freqs[which(!is.infinite(freqs))],na.rm=T) - breaks <- seq(xlims[1],xlims[2],by=(xlims[2]-xlims[1])/(20*abs(floor(xlims)[1]))) + breaks <- seq(xlims[1],xlims[2],by=(xlims[2]-xlims[1])/(25*abs(floor(xlims)[1]))) mids <- (breaks[1:(length(breaks)-1)]+breaks[2:length(breaks)])/2 #Gather freq densities per class @@ -721,7 +750,7 @@ plotFreqDistrib <- function(dat,svtypes, dens$ALL <- as.numeric(all.h$counts/length(all.vals)) #Prepare plot area - ylims <- range(c(0,unlist(dens)),na.rm=T) + ylims <- c(0, quantile(unlist(dens), probs=0.99, na.rm=T)) par(bty="n",mar=c(4.5,3.5,3,0.5)) plot(x=xlims,y=ylims,type="n", xaxt="n",yaxt="n",xlab="",ylab="",yaxs="i") @@ -762,7 +791,7 @@ plotFreqDistrib <- function(dat,svtypes, if(any(dens[[i]]>0)){ points(x=mids[which(dens[[i]]>0)],y=dens[[i]][which(dens[[i]]>0)],col=color,pch=19,cex=0.3*lwd.cex) points(x=mids[which(dens[[i]]>0)], - y=rollapply(dens[[i]][which(dens[[i]]>0)],3,mean,partial=T), + y=rollapply(dens[[i]][which(dens[[i]]>0)],width=10,mean,partial=T), type="l",lwd=lwd.cex*lwd,col=color) } } @@ -792,14 +821,14 @@ plotFreqDistrib <- function(dat,svtypes, legend("topright",bg=NA,bty="n",pch=NA,legend=filter.legend,cex=lwd.cex*0.8) } } + #Plot AF spectrum series by sizes -plotFreqDistribSeries <- function(dat,svtypes,max.sizes,legend.labs, - autosomal=T,biallelic=T, - title=NULL){ +plotFreqDistribSeries <- function(dat, svtypes, max.sizes, legend.labs, + autosomal=F, biallelic=T, title=NULL){ #Process freqs & compute range + breaks filter.legend <- NULL if(autosomal==T){ - dat <- dat[which(dat$chr %in% c(1:22,paste("chr",1:22,sep=""))),] + dat <- dat[which(dat$chr %in% sex.chroms),] filter.legend <- c(filter.legend,"Autosomal SV only") } if(biallelic==T){ @@ -807,65 +836,73 @@ plotFreqDistribSeries <- function(dat,svtypes,max.sizes,legend.labs, filter.legend <- c(filter.legend,"Biallelic SV only") } freqs <- log10(dat$AF) - freqs <- lapply(1:length(max.sizes),function(i){ - if(i==1){ - return(freqs[which(dat$length<=max.sizes[i])]) - }else{ - return(freqs[which(dat$length>max.sizes[i-1] & dat$length<=max.sizes[i])]) - } - }) - xlims <- range(freqs[which(!is.infinite(unlist(freqs)))],na.rm=T) - breaks <- seq(xlims[1],xlims[2],by=(xlims[2]-xlims[1])/(20*abs(floor(xlims)[1]))) - mids <- (breaks[1:(length(breaks)-1)]+breaks[2:length(breaks)])/2 - - #Gather freq densities per class - dens <- lapply(freqs,function(vals){ - h <- hist(vals[which(!is.infinite(vals) & vals>=xlims[1] & vals<=xlims[2])],plot=F,breaks=breaks) - h$counts[1] <- h$counts[1]+length(which(!is.infinite(vals) & valsxlims[2])) - return(h$counts/length(vals)) - }) - - #Prepare plot area - ylims <- range(c(0,unlist(dens)),na.rm=T) - par(bty="n",mar=c(4.5,3.5,3,0.5)) - plot(x=xlims,y=ylims,type="n", - xaxt="n",yaxt="n",xlab="",ylab="",yaxs="i") - - #Add vertical gridlines - logscale.all <- log10(as.numeric(sapply(min(floor(xlims)):0,function(i){(1:9)*10^i}))) - logscale.minor <- log10(as.numeric(sapply(min(floor(xlims)):0,function(i){c(5,10)*10^i}))) - logscale.minor.labs <- as.character(paste(100*round(10^logscale.minor,10),"%",sep="")) - logscale.major <- log10(as.numeric(10^(min(floor(xlims)):0))) - abline(v=logscale.all,col="gray97") - abline(v=logscale.minor,col="gray92") - abline(v=logscale.major,col="gray85") - - #Add axes & title - axis(1,at=logscale.all,tck=-0.015,col="gray50",labels=NA) - axis(1,at=logscale.minor,tck=-0.0225,col="gray20",labels=NA) - axis(1,at=logscale.major,tck=-0.03,labels=NA) - axis(1,at=logscale.minor,tick=F,cex.axis=0.8,line=-0.4,las=2, - labels=logscale.minor.labs) - mtext(1,text="Allele Frequency",line=3) - axis(2,at=axTicks(2),tck=-0.025,labels=NA) - axis(2,at=axTicks(2),tick=F,line=-0.4,cex.axis=0.8,las=2, - labels=paste(round(100*axTicks(2),1),"%",sep="")) - mtext(2,text="Fraction of SV",line=2.2) - mtext(3,line=1.5,text=title,font=2) - - #Add points & rolling mean lines per size tranche - col.pal <- colorRampPalette(c("#440154","#365C8C","#25A584","#FDE725"))(length(freqs)) - sapply(1:length(dens),function(i){ - if(all(!is.nan(dens[[i]]))){ - if(any(dens[[i]]>0)){ - points(x=mids[which(dens[[i]]>0)],y=dens[[i]][which(dens[[i]]>0)],col=col.pal[i],pch=19,cex=0.3) - points(x=mids[which(dens[[i]]>0)], - y=rollapply(dens[[i]][which(dens[[i]]>0)],3,mean,partial=T), - type="l",lwd=2,col=col.pal[i]) + if(length(freqs) > 0){ + freqs <- lapply(1:length(max.sizes),function(i){ + if(i==1){ + return(freqs[which(dat$length<=max.sizes[i])]) + }else{ + return(freqs[which(dat$length>max.sizes[i-1] & dat$length<=max.sizes[i])]) } - } - }) + }) + xlims <- range(freqs[which(!is.infinite(unlist(freqs)))],na.rm=T) + breaks <- seq(xlims[1],xlims[2],by=(xlims[2]-xlims[1])/(20*abs(floor(xlims)[1]))) + mids <- (breaks[1:(length(breaks)-1)]+breaks[2:length(breaks)])/2 + + #Gather freq densities per class + dens <- lapply(freqs,function(vals){ + h <- hist(vals[which(!is.infinite(vals) & vals>=xlims[1] & vals<=xlims[2])],plot=F,breaks=breaks) + h$counts[1] <- h$counts[1]+length(which(!is.infinite(vals) & valsxlims[2])) + return(h$counts/length(vals)) + }) + + #Prepare plot area + ylims <- c(0, quantile(unlist(dens), probs=0.99, na.rm=T)) + par(bty="n",mar=c(4.5,3.5,3,0.5)) + plot(x=xlims,y=ylims,type="n", + xaxt="n",yaxt="n",xlab="",ylab="",yaxs="i") + + #Add vertical gridlines + logscale.all <- log10(as.numeric(sapply(min(floor(xlims)):0,function(i){(1:9)*10^i}))) + logscale.minor <- log10(as.numeric(sapply(min(floor(xlims)):0,function(i){c(5,10)*10^i}))) + logscale.minor.labs <- as.character(paste(100*round(10^logscale.minor,10),"%",sep="")) + logscale.major <- log10(as.numeric(10^(min(floor(xlims)):0))) + abline(v=logscale.all,col="gray97") + abline(v=logscale.minor,col="gray92") + abline(v=logscale.major,col="gray85") + + #Add axes & title + axis(1,at=logscale.all,tck=-0.015,col="gray50",labels=NA) + axis(1,at=logscale.minor,tck=-0.0225,col="gray20",labels=NA) + axis(1,at=logscale.major,tck=-0.03,labels=NA) + axis(1,at=logscale.minor,tick=F,cex.axis=0.8,line=-0.4,las=2, + labels=logscale.minor.labs) + mtext(1,text="Allele Frequency",line=3) + axis(2,at=axTicks(2),tck=-0.025,labels=NA) + axis(2,at=axTicks(2),tick=F,line=-0.4,cex.axis=0.8,las=2, + labels=paste(round(100*axTicks(2),1),"%",sep="")) + mtext(2,text="Fraction of SV",line=2.2) + mtext(3,line=1.5,text=title,font=2) + + #Add points & rolling mean lines per size tranche + col.pal <- colorRampPalette(c("#440154","#365C8C","#25A584","#FDE725"))(length(freqs)) + sapply(1:length(dens),function(i){ + if(all(!is.nan(dens[[i]]))){ + if(any(dens[[i]]>0)){ + points(x=mids[which(dens[[i]]>0)],y=dens[[i]][which(dens[[i]]>0)],col=col.pal[i],pch=19,cex=0.3) + points(x=mids[which(dens[[i]]>0)], + y=rollapply(dens[[i]][which(dens[[i]]>0)],3,mean,partial=T), + type="l",lwd=2,col=col.pal[i]) + } + } + }) + }else{ + par(bty="n",mar=c(4.5,3.5,3,0.5)) + plot(x=c(0,1),y=c(0,1),type="n", + xaxt="n",yaxt="n",xlab="",ylab="",yaxs="i") + text(x=0.5,y=0.5,labels="No Data") + mtext(3,line=1.5,text=title,font=2,cex=lwd.cex) + } #Add filter labels if(!is.null(filter.legend)){ @@ -876,6 +913,7 @@ plotFreqDistribSeries <- function(dat,svtypes,max.sizes,legend.labs, legend("right",bg="white",bty="n",lwd=3,col=col.pal,cex=0.7, legend=gsub("\n","",legend.labs,fixed=T)) } + #Wrapper to plot all AF distributions wrapperPlotAllFreqDistribs <- function(){ #All SV @@ -885,6 +923,7 @@ wrapperPlotAllFreqDistribs <- function(){ title="AF Distribution (All SV)", legend=T) dev.off() + #Tiny (<100bp) pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/freq_distribution.tiny_sv.pdf",sep=""), height=4,width=4) @@ -892,6 +931,7 @@ wrapperPlotAllFreqDistribs <- function(){ title="AF Distribution (< 100bp)", legend=T) dev.off() + #Small (>100bp & <500bp) pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/freq_distribution.small_sv.pdf",sep=""), height=4,width=4) @@ -899,6 +939,7 @@ wrapperPlotAllFreqDistribs <- function(){ title="AF Distribution (100bp - 500bp)", legend=T) dev.off() + #Medium (>500bp & <2.5kb) pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/freq_distribution.medium_sv.pdf",sep=""), height=4,width=4) @@ -906,6 +947,7 @@ wrapperPlotAllFreqDistribs <- function(){ title="AF Distribution (500bp - 2.5kb)", legend=T) dev.off() + #Med-Large (>2.5kb & <10kb) pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/freq_distribution.medlarge_sv.pdf",sep=""), height=4,width=4) @@ -913,6 +955,7 @@ wrapperPlotAllFreqDistribs <- function(){ title="AF Distribution (2.5kb - 10kb)", legend=T) dev.off() + #Large (>10kb & <50kb) pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/freq_distribution.large_sv.pdf",sep=""), height=4,width=4) @@ -920,6 +963,7 @@ wrapperPlotAllFreqDistribs <- function(){ title="AF Distribution (10kb - 50kb)", legend=T) dev.off() + #Huge (>50kb) pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/freq_distribution.huge_sv.pdf",sep=""), height=4,width=4) @@ -927,6 +971,7 @@ wrapperPlotAllFreqDistribs <- function(){ title="AF Distribution (> 50kb)", legend=T) dev.off() + #Size series pdf(paste(OUTDIR,"/supporting_plots/vcf_summary_plots/freq_distribution.across_sizes.pdf",sep=""), height=4,width=4) @@ -937,6 +982,7 @@ wrapperPlotAllFreqDistribs <- function(){ "2.5-10kb","10kb-50kb",">50kb"), title="AF Distributions by SV Size") dev.off() + #Merged pdf(paste(OUTDIR,"/main_plots/VCF_QC.freq_distributions.merged.pdf",sep=""), height=6,width=10) @@ -975,7 +1021,7 @@ wrapperPlotAllFreqDistribs <- function(){ #Plot single HW ternary comparison plotHWSingle <- function(dat,svtypes,title=NULL,full.legend=T,lab.cex=1){ #Restrict data to biallelic, autosomal sites - HW.dat <- dat[which(dat$chr %in% c(1:22,paste("chr",1:22,sep="")) & !is.na(dat$AN)),] + HW.dat <- dat[which(dat$chr %in% sex.chroms & !is.na(dat$AN)),] #Only run if there's data if(nrow(HW.dat)>0){ @@ -1071,7 +1117,7 @@ plotAlleleCarrierCorrelation <- function(dat,autosomal=T,biallelic=T, #Process freqs filter.legend <- NULL if(autosomal==T){ - dat <- dat[which(dat$chr %in% c(1:22,paste("chr",1:22,sep=""))),] + dat <- dat[which(dat$chr %in% sex.chroms),] filter.legend <- c(filter.legend,"Autosomal SV only") } if(biallelic==T){ @@ -1120,6 +1166,7 @@ plotAlleleCarrierCorrelation <- function(dat,autosomal=T,biallelic=T, legend("topleft",bg=NA,bty="n",pch=NA,legend=filter.legend,cex=0.8) } } + #Wrapper to plot all HW distributions wrapperPlotAllHWDistribs <- function(){ #All SV @@ -1199,10 +1246,10 @@ wrapperPlotAllHWDistribs <- function(){ ###RSCRIPT FUNCTIONALITY ######################## ###Load libraries as needed -require(optparse) -require(RColorBrewer) -require(zoo) -require(HardyWeinberg) +require(optparse, quietly=T) +require(RColorBrewer, quietly=T) +require(zoo, quietly=T) +require(HardyWeinberg, quietly=T) ###List of command-line options option_list <- list( @@ -1231,12 +1278,6 @@ OUTDIR <- args$args[2] nsamp <- opts$nsamp svtypes.file <- opts$svtypes -#Dev parameters -# INFILE <- "~/scratch/gnomAD_v2_SV_MASTER_RD_VCF.VCF_sites.stats.bed.gz" -# OUTDIR <- "~/scratch/VCF_plots_test/" -# nsamp <- 14245 -# svtypes.file <- "~/Desktop/Collins/Talkowski/code/sv-pipeline/ref/vcf_qc_refs/SV_colors.txt" - ###Prepares I/O files #Read & clean data dat <- read.table(INFILE,comment.char="",sep="\t",header=T,check.names=F) @@ -1269,10 +1310,13 @@ if(!is.null(svtypes.file)){ ###Plotting block #SV counts wrapperPlotAllCountBars() + #SV sizes wrapperPlotAllSizeDistribs() + #SV frequencies wrapperPlotAllFreqDistribs() + #Genotype frequencies wrapperPlotAllHWDistribs() diff --git a/src/sv-pipeline/scripts/vcf_qc/runIRS.sh b/src/sv-pipeline/scripts/vcf_qc/runIRS.sh new file mode 100755 index 000000000..f74c292d2 --- /dev/null +++ b/src/sv-pipeline/scripts/vcf_qc/runIRS.sh @@ -0,0 +1,40 @@ +#!/usr/bin/env bash +# Author: asanchis@broadinstitute.org +export SV_DIR=/data/talkowski/an436/software/svtoolkit +mx="-Xmx64g" +classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar:${SV_DIR}/lib/gatk/Queue.jar" +while getopts s:o:r:d:g:a: flag +do + case "${flag}" in + s) sites=${OPTARG};; + o) output=${OPTARG};; + r) report=${OPTARG};; + d) discovery=${OPTARG};; + g) genome=${OPTARG};; + a) array=${OPTARG};; + esac +done +echo "sites: $sites"; +echo "output: $output"; +echo "report: $report"; +echo "discovery: $discovery"; +echo "genome: $genome"; +echo "array: $array"; +if [[ $genome = "37" ]]; then + genome_path=/data/talkowski/an436/resources/genomes/human_g1k_v37_chrom/human_g1k_v37.chr.canonic.fasta +elif [[ $genome = "38" ]]; then + genome_path=/data/talkowski/xuefang/data/reference/GRCh38.1KGP/GRCh38_full_analysis_set_plus_decoy_hla.fa +fi +echo "genome_path: $genome_path" +# Run discovery +java ${mx} -cp ${classpath} \ + org.broadinstitute.sv.main.SVAnnotator \ + -A IntensityRankSum \ + -R $genome_path \ + -vcf $sites \ + -O $output \ + -arrayIntensityFile $array \ + -sample $discovery \ + -irsSampleTag SAMPLES \ + -writeReport true \ + -reportFile $report diff --git a/wdl/AnnotateGenomicContext.wdl b/wdl/AnnotateGenomicContext.wdl new file mode 100755 index 000000000..a1c9b2d83 --- /dev/null +++ b/wdl/AnnotateGenomicContext.wdl @@ -0,0 +1,296 @@ +version 1.0 + +# Author: Xuefang Zhao + +import "Structs.wdl" + +# Workflow to annotate vcf file with genomic context +workflow AnnotateSVsWithGenomicContext { + input { + File vcf + File vcf_index + File Repeat_Masks + File Simple_Repeats + File Segmental_Duplicates + + String sv_base_mini_docker + String sv_benchmark_docker + String sv_pipeline_docker + + # overrides for MiniTasks + RuntimeAttr? runtime_override_extract_SV_sites + RuntimeAttr? runtime_override_vcf_to_bed + RuntimeAttr? runtime_attr_override_anno_gc + RuntimeAttr? runtime_attr_override_inte_gc + } + + call ExtractSitesFromVcf { + input: + vcf = vcf, + vcf_index = vcf_index, + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override=runtime_override_extract_SV_sites + } + + call Vcf2Bed{ + input: + vcf = ExtractSitesFromVcf.out, + vcf_index = ExtractSitesFromVcf.out_idx, + sv_pipeline_docker=sv_pipeline_docker, + runtime_attr_override=runtime_override_vcf_to_bed + } + + call AnnotateGenomicContext{ + input: + bed_gz = Vcf2Bed.out, + simp_rep = Simple_Repeats, + seg_dup = Segmental_Duplicates, + rep_mask = Repeat_Masks, + sv_base_mini_docker = sv_base_mini_docker, + runtime_attr_override = runtime_attr_override_anno_gc + } + + call IntegrateGenomicContext{ + input: + bed_gz = Vcf2Bed.out, + le_bp_vs_sr = AnnotateGenomicContext.le_bp_vs_sr, + le_bp_vs_sd = AnnotateGenomicContext.le_bp_vs_sd, + le_bp_vs_rm = AnnotateGenomicContext.le_bp_vs_rm, + ri_bp_vs_sr = AnnotateGenomicContext.ri_bp_vs_sr, + ri_bp_vs_sd = AnnotateGenomicContext.ri_bp_vs_sd, + ri_bp_vs_rm = AnnotateGenomicContext.ri_bp_vs_rm, + lg_cnv_vs_sr = AnnotateGenomicContext.lg_cnv_vs_sr, + lg_cnv_vs_sd = AnnotateGenomicContext.lg_cnv_vs_sd, + lg_cnv_vs_rm = AnnotateGenomicContext.lg_cnv_vs_rm, + sv_benchmark_docker = sv_benchmark_docker, + runtime_attr_override = runtime_attr_override_inte_gc + } + + output{ + File annotated_SVs = IntegrateGenomicContext.anno + } +} + +task ExtractSitesFromVcf { + input { + File vcf + File vcf_index + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + Float vcf_size = size(vcf, "GiB") + Int vm_disk_size = ceil(vcf_size * 1.2) + + RuntimeAttr runtime_default = object { + mem_gb: 2, + disk_gb: vm_disk_size, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + String prefix = basename(vcf, ".vcf.gz") + command <<< + set -euxo pipefail + + zcat ~{vcf} | cut -f1-10 |bgzip > ~{prefix}.sites.vcf.gz + tabix ~{prefix}.sites.vcf.gz + >>> + + output { + File out = "~{prefix}.sites.vcf.gz" + File out_idx = "~{prefix}.sites.vcf.gz.tbi" + } +} + +task Vcf2Bed{ + input { + File vcf + File vcf_index + String sv_pipeline_docker + RuntimeAttr? runtime_attr_override + } + + Float vcf_size = size(vcf, "GiB") + Int vm_disk_size = ceil(vcf_size * 1.2) + + RuntimeAttr runtime_default = object { + mem_gb: 2, + disk_gb: vm_disk_size, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + String prefix = basename(vcf, ".vcf.gz") + command <<< + set -euxo pipefail + + svtk vcf2bed -i SVTYPE -i SVLEN ~{vcf} ~{prefix}.bed + cut -f1-4,7,8 ~{prefix}.bed | bgzip > ~{prefix}.bed.gz + >>> + + output { + File out = "~{prefix}.bed.gz" + } +} + +task AnnotateGenomicContext{ + input { + File bed_gz + File simp_rep + File seg_dup + File rep_mask + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 3, + disk_gb: 10, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + String prefix = basename(bed_gz, ".bed.gz") + command <<< + set -euxo pipefail + + zcat ~{bed_gz} | awk '{print $1,$2,$2,$4,$5}' | sed -e 's/ /\t/g' > ~{prefix}.le_bp + zcat ~{bed_gz} | awk '{print $1,$3,$3,$4,$5}' | sed -e 's/ /\t/g' > ~{prefix}.ri_bp + zcat ~{bed_gz} | awk '{if ($5=="DEL" || $5=="DUP" || $5=="CNV" ) print}' | awk '{if ($3-$2>5000) print}' | cut -f1-5 > ~{prefix}.lg_cnv + + bedtools coverage -a ~{prefix}.le_bp -b ~{simp_rep} | awk '{if ($9>0) print}'> ~{prefix}.le_bp.vs.SR + bedtools coverage -a ~{prefix}.le_bp -b ~{seg_dup} | awk '{if ($9>0) print}'> ~{prefix}.le_bp.vs.SD + bedtools coverage -a ~{prefix}.le_bp -b ~{rep_mask} | awk '{if ($9>0) print}'> ~{prefix}.le_bp.vs.RM + + bedtools coverage -a ~{prefix}.ri_bp -b ~{simp_rep} | awk '{if ($9>0) print}'> ~{prefix}.ri_bp.vs.SR + bedtools coverage -a ~{prefix}.ri_bp -b ~{seg_dup} | awk '{if ($9>0) print}'> ~{prefix}.ri_bp.vs.SD + bedtools coverage -a ~{prefix}.ri_bp -b ~{rep_mask} | awk '{if ($9>0) print}'> ~{prefix}.ri_bp.vs.RM + + bedtools coverage -a ~{prefix}.lg_cnv -b ~{simp_rep} > ~{prefix}.lg_cnv.vs.SR + bedtools coverage -a ~{prefix}.lg_cnv -b ~{seg_dup} > ~{prefix}.lg_cnv.vs.SD + bedtools coverage -a ~{prefix}.lg_cnv -b ~{rep_mask} > ~{prefix}.lg_cnv.vs.RM + + >>> + + output { + File le_bp_vs_sr = "~{prefix}.le_bp.vs.SR" + File le_bp_vs_sd = "~{prefix}.le_bp.vs.SD" + File le_bp_vs_rm = "~{prefix}.le_bp.vs.RM" + File ri_bp_vs_sr = "~{prefix}.ri_bp.vs.SR" + File ri_bp_vs_sd = "~{prefix}.ri_bp.vs.SD" + File ri_bp_vs_rm = "~{prefix}.ri_bp.vs.RM" + File lg_cnv_vs_sr = "~{prefix}.lg_cnv.vs.SR" + File lg_cnv_vs_sd = "~{prefix}.lg_cnv.vs.SD" + File lg_cnv_vs_rm = "~{prefix}.lg_cnv.vs.RM" + } +} + +task IntegrateGenomicContext{ + input { + File bed_gz + File le_bp_vs_sr + File le_bp_vs_sd + File le_bp_vs_rm + File ri_bp_vs_sr + File ri_bp_vs_sd + File ri_bp_vs_rm + File lg_cnv_vs_sr + File lg_cnv_vs_sd + File lg_cnv_vs_rm + + String sv_benchmark_docker + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 1, + disk_gb: 10, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_benchmark_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + String prefix = basename(bed_gz, ".bed.gz") + command <<< + set -euxo pipefail + + Rscript /src/integrate_Genomic_Content_annotations.R \ + --bed ~{bed_gz} \ + --out ~{prefix}.GC \ + --le_bp_vs_sr ~{le_bp_vs_sr} \ + --le_bp_vs_sd ~{le_bp_vs_sd} \ + --le_bp_vs_rm ~{le_bp_vs_rm} \ + --ri_bp_vs_sr ~{ri_bp_vs_sr} \ + --ri_bp_vs_sd ~{ri_bp_vs_sd} \ + --ri_bp_vs_rm ~{ri_bp_vs_rm} \ + --lg_cnv_vs_sr ~{lg_cnv_vs_sr} \ + --lg_cnv_vs_sd ~{lg_cnv_vs_sd} \ + --lg_cnv_vs_rm ~{lg_cnv_vs_rm} + + >>> + + output { + File anno = "~{prefix}.GC" + } +} + + + + + + + + + + diff --git a/wdl/AnnotateVcfWithRandomForestScores.wdl b/wdl/AnnotateVcfWithRandomForestScores.wdl new file mode 100755 index 000000000..07dfafd57 --- /dev/null +++ b/wdl/AnnotateVcfWithRandomForestScores.wdl @@ -0,0 +1,251 @@ +## Copyright Broad Institute, 2022 +## +## +## Consolidate boost scores per sample across all batches and write those scores +## directly into an input VCF +## +## +## LICENSING : +## This script is released under the WDL source code license (BSD-3) (see LICENSE in +## https://github.com/broadinstitute/wdl). Note however that the programs it calls may +## be subject to different licenses. Users are responsible for checking that they are +## authorized to run all programs before running this script. Please see the docker +## page at https://hub.docker.com/r/broadinstitute/genomes-in-the-cloud/ for detailed +## licensing information pertaining to the included programs. + +version 1.0 + +import "Structs.wdl" + + +workflow AnnotateVcfWithRandomForestScores { + input { + File vcf + File vcf_idx + Array[File] boost_score_tarballs + String sv_base_mini_docker + String sv_benchmark_docker + RuntimeAttr? runtime_override_subset_vcf + RuntimeAttr? runtime_override_annotate_vcf + RuntimeAttr? runtime_attr_override_merge_vcfs + } + + # Scatter over tarballs of boost scores + scatter ( boost_res in boost_score_tarballs ) { + + # Subset VCF to samples in tarball + call SubsetVcf { + input: + vcf=vcf, + vcf_idx=vcf_idx, + boost_tarball=boost_res, + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override=runtime_override_subset_vcf + } + + # Annotate boost scores + call AnnotateRFScores { + input: + vcf=SubsetVcf.subsetted_vcf, + vcf_idx=SubsetVcf.subsetted_vcf_idx, + boost_tarball=boost_res, + sv_benchmark_docker=sv_benchmark_docker, + runtime_attr_override=runtime_override_annotate_vcf + } + } + + # Column-wise merge of all annotated VCFs + call MergeVcfs { + input: + vcfs=AnnotateRFScores.annotated_vcf, + vcf_idxs=AnnotateRFScores.annotated_vcf_idx, + out_prefix=basename(vcf, ".vcf.gz") + ".boost_annotated", + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override = runtime_attr_override_merge_vcfs + } + + output { + File annotated_vcf = MergeVcfs.merged_vcf + File annotated_vcf_idx = MergeVcfs.merged_vcf_idx + } +} + + +task SubsetVcf { + input { + File vcf + File vcf_idx + File boost_tarball + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + String vcf_out_prefix = basename(boost_tarball, ".tar.gz") + + Float input_size = size(select_all([vcf, boost_tarball]), "GB") + Float base_disk_gb = 10.0 + RuntimeAttr runtime_default = object { + mem_gb: 4, + disk_gb: ceil(base_disk_gb + (input_size * 10.0)), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Decompress & reorganize tarball + mkdir boost_results + tar -xzvf ~{boost_tarball} --directory boost_results/ + find boost_results -name "RF_results.*.tsv" \ + | xargs -I {} mv {} boost_results/ + + # Get list of sample IDs + find boost_results/ -name "*.tsv" \ + | xargs -I {} basename {} \ + | sed 's/\./\t/g' | cut -f2 \ + | sort -Vk1,1 | uniq \ + > samples.list + + # Subset & index VCF + bcftools view \ + -S samples.list \ + --force-samples \ + -O z \ + -o ~{vcf_out_prefix}.subsetted.vcf.gz \ + ~{vcf} + tabix -p vcf ~{vcf_out_prefix}.subsetted.vcf.gz + >>> + + output { + File subsetted_vcf = "~{vcf_out_prefix}.subsetted.vcf.gz" + File subsetted_vcf_idx = "~{vcf_out_prefix}.subsetted.vcf.gz.tbi" + } +} + + +task AnnotateRFScores { + input { + File vcf + File vcf_idx + File boost_tarball + String sv_benchmark_docker + RuntimeAttr? runtime_attr_override + } + + String out_prefix = basename(vcf, ".vcf.gz") + ".boost_anno" + + Float input_size = size([vcf, boost_tarball], "GB") + Float base_disk_gb = 50.0 + Float compression_factor = 10.0 + RuntimeAttr runtime_default = object { + mem_gb: 16, + disk_gb: ceil(base_disk_gb + (input_size * compression_factor)), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_benchmark_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Get list of variant IDs and samples in VCF + bcftools query -f '%ID\n' ~{vcf} > vids.list + bcftools query -l ~{vcf} > samples.list + + # Decompress Boost results and subset to variant IDs present in VCF + mkdir boost_results + mkdir boost_results/raw + tar -xzvf ~{boost_tarball} --directory boost_results/raw/ + while read sample; do + infile=$( find boost_results -name "RF_results.$sample.tsv" | sed -n '1p' ) + if ! [ -z $infile ]; then + awk -F'\t' -v OFS='\t' 'ARGIND==1{inFileA[$1]; next} {if ($1 in inFileA) print }' vids.list $infile > boost_results/$sample.scores.tsv + echo -e "$sample\tboost_results/$sample.scores.tsv" >> boost_anno_inputs.tsv + fi + done < samples.list + + # Add Boost scores to VCF + /src/add_random_forest_scores_to_vcf.py \ + --vcf ~{vcf} \ + --RF_tsv boost_anno_inputs.tsv \ + --outfile ~{out_prefix}.vcf.gz + tabix -p vcf ~{out_prefix}.vcf.gz + >>> + + output { + File annotated_vcf = "~{out_prefix}.vcf.gz" + File annotated_vcf_idx = "~{out_prefix}.vcf.gz.tbi" + } +} + + +task MergeVcfs { + input { + Array[File] vcfs + Array[File] vcf_idxs + String out_prefix + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + Float input_size = size(vcfs, "GB") + Float base_disk_gb = 10.0 + RuntimeAttr runtime_default = object { + mem_gb: 2, + disk_gb: ceil(base_disk_gb + (input_size * 4.0)), + cpu_cores: 1, + preemptible_tries: 0, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + bcftools merge \ + -m id \ + -O z \ + -o ~{out_prefix}.vcf.gz \ + ~{sep=" " vcfs} + tabix -p vcf ~{out_prefix}.vcf.gz + >>> + + output { + File merged_vcf = "~{out_prefix}.vcf.gz" + File merged_vcf_idx = "~{out_prefix}.vcf.gz.tbi" + } +} diff --git a/wdl/CollectQcPerSample.wdl b/wdl/CollectQcPerSample.wdl index d195d6bb8..57627e771 100644 --- a/wdl/CollectQcPerSample.wdl +++ b/wdl/CollectQcPerSample.wdl @@ -4,13 +4,13 @@ version 1.0 import "Tasks0506.wdl" as MiniTasks -# Workflow to gather lists of variant IDs per sample from an SV VCF +# Workflow to gather lists of variant IDs per sample from one or more SV VCFs workflow CollectQcPerSample { input { - File vcf + Array[File] vcfs + Boolean vcf_format_has_cn = true File samples_list String prefix - Int samples_per_shard String sv_base_mini_docker String sv_pipeline_docker @@ -20,25 +20,16 @@ workflow CollectQcPerSample { # overrides for mini tasks RuntimeAttr? runtime_override_split_samples_list - RuntimeAttr? runtime_override_tar_shard_vid_lists + RuntimeAttr? runtime_override_merge_sharded_per_sample_vid_lists } - # Shard sample list - call MiniTasks.SplitUncompressed as SplitSamplesList { - input: - whole_file=samples_list, - lines_per_shard=samples_per_shard, - shard_prefix=prefix + ".list_shard.", - sv_pipeline_docker=sv_pipeline_docker, - runtime_attr_override=runtime_override_split_samples_list - } - - # Collect VCF-wide summary stats per sample list - scatter (sublist in SplitSamplesList.shards) { + # Collect VCF-wide summary stats per sample list per VCF + scatter ( vcf in vcfs ) { call CollectVidsPerSample { input: vcf=vcf, - samples_list=sublist, + vcf_format_has_cn=vcf_format_has_cn, + samples_list=samples_list, prefix=prefix, sv_pipeline_docker=sv_pipeline_docker, runtime_attr_override=runtime_override_collect_vids_per_sample @@ -46,18 +37,18 @@ workflow CollectQcPerSample { } # Merge all VID lists into single output directory and tar it - call MiniTasks.FilesToTarredFolder as TarShardVidLists { + call MergeShardedPerSampleVidLists { input: - in_files=flatten(CollectVidsPerSample.vid_lists), - folder_name=prefix + "_perSample_VIDs_merged", - tarball_prefix=prefix + "_perSample_VIDs", + tarballs=CollectVidsPerSample.vid_lists_tarball, + samples_list=samples_list, + prefix=prefix, sv_base_mini_docker=sv_base_mini_docker, - runtime_attr_override=runtime_override_tar_shard_vid_lists + runtime_attr_override=runtime_override_merge_sharded_per_sample_vid_lists } # Final output output { - File vid_lists = TarShardVidLists.tarball + File vid_lists = MergeShardedPerSampleVidLists.merged_tarball } } @@ -66,23 +57,25 @@ workflow CollectQcPerSample { task CollectVidsPerSample { input { File vcf + Boolean vcf_format_has_cn = true File samples_list String prefix String sv_pipeline_docker RuntimeAttr? runtime_attr_override } + + String outdirprefix = prefix + "_perSample_VIDs" - # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to - # be held in memory or disk while working, potentially in a form that takes up more space) + # Must scale disk proportionally to size of input VCF Float input_size = size([vcf, samples_list], "GiB") - Float compression_factor = 5.0 - Float base_disk_gb = 5.0 - Float base_mem_gb = 2.0 + Float disk_scaling_factor = 1.5 + Float base_disk_gb = 10.0 + Float base_mem_gb = 3.75 RuntimeAttr runtime_default = object { - mem_gb: base_mem_gb + compression_factor * input_size, - disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)), + mem_gb: 3.75, + disk_gb: ceil(base_disk_gb + (input_size * disk_scaling_factor)), cpu_cores: 1, - preemptible_tries: 3, + preemptible_tries: 1, max_retries: 1, boot_disk_gb: 10 } @@ -98,62 +91,101 @@ task CollectVidsPerSample { } command <<< + set -eu -o pipefail - - # For purposes of memory, cut vcf to samples of interest - zcat ~{vcf} > uncompressed.vcf - rm ~{vcf} - grep -B9999999999 -m1 -Ev "^#" uncompressed.vcf | sed '$ d' > header.vcf \ - || cp uncompressed.vcf header.vcf - N_HEADER=$(wc -l < header.vcf) - IDXS=$( grep -Ev "^##" header.vcf \ - | sed 's/\t/\n/g' \ - | awk -v OFS="\t" '{ print $1, NR }' \ - | fgrep -wf ~{samples_list} \ - | cut -f2 \ - | sort -nk1,1 \ - | uniq \ - | paste -s -d, \ - | awk '{ print "1-9,"$1 }' \ - || printf "") - - if [ -z "$IDXS" ]; then - # nothing to find, make empty dir for output glob to look at - mkdir -p "~{prefix}_perSample_VIDs" + + # Make output directory + mkdir -p ~{outdirprefix} + + # Filter VCF to list of samples of interest, split into list of genotypes per + # sample, and write one .tsv file per sample to output directory + if [ ~{vcf_format_has_cn} == "true" ]; then + bcftools view -S ~{samples_list} ~{vcf} \ + | bcftools view --min-ac 1 \ + | bcftools query -f '[%SAMPLE\t%ID\t%ALT\t%GT\t%GQ\t%CN\t%CNQ\n]' \ + | awk '{OFS="\t"; gt = $4; gq = $5; if ($3 == "") { gq = $7; if ($6 == 2) { gt = "0/0" } else if ($6 == 1 || $6 == 3) { gt = "0/1" } else { gt = "1/1"} }; print $1, $2, gt, gq}' \ + | awk -v outprefix="~{outdirprefix}" '$3 != "0/0" && $3 != "./." {OFS="\t"; print $2, $3, $4 >> outprefix"/"$1".VIDs_genotypes.txt" }' else - cut -f"$IDXS" uncompressed.vcf \ - | vcftools --vcf - --stdout --non-ref-ac-any 1 --recode --recode-INFO-all \ - | grep -v "^#" \ - | cut -f3 \ - > VIDs_to_keep.list - - # Gather list of VIDs and genotypes per sample - { - grep -Ev "^##" header.vcf | cut -f"$IDXS"; - tail -n+$((N_HEADER + 1)) uncompressed.vcf | cut -f"$IDXS" | fgrep -wf VIDs_to_keep.list; - } \ - | /opt/sv-pipeline/scripts/vcf_qc/perSample_vcf_parsing_helper.R \ - /dev/stdin \ - ~{samples_list} \ - "~{prefix}_perSample_VIDs/" - - # Gzip all output lists - for FILE in ~{prefix}_perSample_VIDs/*.VIDs_genotypes.txt; do - gzip -f "$FILE" - rm -f "$FILE" - done - - # Check if one file per sample is present - NUM_GENOTYPE_FILES=$(find "~{prefix}_perSample_VIDs/" -name "*.VIDs_genotypes.txt.gz" | wc -l) - NUM_SAMPLES=$(sort ~{samples_list} | uniq | wc -l) - if [ $NUM_GENOTYPE_FILES -lt $NUM_SAMPLES ]; then - echo "ERROR IN TASK collect_VIDs_perSample! FEWER PER-SAMPLE GENOTYPE FILES LOCATED THAN NUMBER OF INPUT SAMPLES" - exit 1 - fi + bcftools view -S ~{samples_list} ~{vcf} \ + | bcftools view --min-ac 1 \ + | bcftools query -f '[%SAMPLE\t%ID\t%ALT\t%GT\t%GQ\n]' \ + | awk '{OFS="\t"; gt = $4; gq = $5; if ($3 ~ /CN0/) { if ($4 == "0/2") { gt = "0/0" } else if ($4 == "0/1" || $4 == "0/3") { gt = "0/1" } else { gt = "1/1"} }; print $1, $2, gt, gq}' \ + | awk -v outprefix="~{outdirprefix}" '$3 != "0/0" && $3 != "./." {OFS="\t"; print $2, $3, $4 >> outprefix"/"$1".VIDs_genotypes.txt" }' fi + + # Gzip all output lists + for FILE in ~{outdirprefix}/*.VIDs_genotypes.txt; do + gzip -f "$FILE" + rm -f "$FILE" + done + + # Bundle all files as a tarball (to make it easier on call caching for large cohorts) + cd ~{outdirprefix} && \ + tar -czvf ../~{outdirprefix}.tar.gz *.VIDs_genotypes.txt.gz && \ + cd - + >>> + + output { + File vid_lists_tarball = "~{outdirprefix}.tar.gz" + } +} + + +# Merge multiple tarballs of per-sample VID lists +task MergeShardedPerSampleVidLists { + input { + Array[File] tarballs + File samples_list + String prefix + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 3.75, + disk_gb: 20, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Create final output directory + mkdir "~{prefix}_perSample_VID_lists" + + # Extract each tarball into its own unique directory + mkdir shards + while read i tarball_path; do + mkdir "shards/shard_$i" + tar -xzvf "$tarball_path" --directory "shards/shard_$i"/ + done < <( awk -v OFS="\t" '{ print NR, $1 }' ~{write_lines(tarballs)} ) + + # Merge all shards per sample and write to final directory + while read sample; do + find shards/ -name "$sample.VIDs_genotypes.txt.gz" \ + | xargs -I {} zcat {} \ + | sort -Vk1,1 -k2,2n -k3,3n \ + | gzip -c \ + > "~{prefix}_perSample_VID_lists/$sample.VIDs_genotypes.txt.gz" + done < ~{samples_list} + + # Compress final output directory + tar -czvf "~{prefix}_perSample_VID_lists.tar.gz" "~{prefix}_perSample_VID_lists" >>> output { - Array[File] vid_lists = glob("~{prefix}_perSample_VIDs/*.VIDs_genotypes.txt.gz") + File merged_tarball = "~{prefix}_perSample_VID_lists.tar.gz" } } diff --git a/wdl/MainVcfQc.wdl b/wdl/MainVcfQc.wdl new file mode 100644 index 000000000..1c1df8973 --- /dev/null +++ b/wdl/MainVcfQc.wdl @@ -0,0 +1,845 @@ +version 1.0 + +# Author: Ryan Collins + +# Note: this WDL has been customized specifically for the gnomAD-SV v3 callset +# Some components of this WDL will not be generalizable for most cohorts + +import "ShardedQcCollection.wdl" as ShardedQcCollection +import "CollectQcPerSample.wdl" as CollectQcPerSample +import "ShardedCohortBenchmarking.wdl" as CohortExternalBenchmark +import "PerSampleExternalBenchmark.wdl" as PerSampleExternalBenchmark +import "Tasks0506.wdl" as MiniTasks +import "Utils.wdl" as Utils + +# Main workflow to perform comprehensive quality control (QC) on +# an SV VCF output by GATK-SV +workflow MasterVcfQc { + input { + Array[File] vcfs + Array[File] vcf_idxs + Boolean vcf_format_has_cn = true + File? ped_file + File? list_of_samples_to_include + Int max_trios = 1000 + String prefix + Int sv_per_shard + Int samples_per_shard + Array[Array[String]]? site_level_comparison_datasets # Array of two-element arrays, one per dataset, each of format [prefix, gs:// path to directory with one BED per population] + Array[Array[String]]? sample_level_comparison_datasets # Array of two-element arrays, one per dataset, each of format [prefix, gs:// path to per-sample tarballs] + Array[String] contigs + Int? random_seed + + String sv_base_mini_docker + String sv_pipeline_docker + String sv_pipeline_qc_docker + + # overrides for local tasks + RuntimeAttr? runtime_override_plot_qc_vcf_wide + RuntimeAttr? runtime_override_site_level_benchmark_plot + RuntimeAttr? runtime_override_custom_external + RuntimeAttr? runtime_override_plot_qc_per_sample + RuntimeAttr? runtime_override_plot_qc_per_family + RuntimeAttr? runtime_override_per_sample_benchmark_plot + RuntimeAttr? runtime_override_sanitize_outputs + + # overrides for MiniTasks or Utils + RuntimeAttr? runtime_overrite_subset_vcf + RuntimeAttr? runtime_override_merge_vcfwide_stat_shards + RuntimeAttr? runtime_override_merge_vcf_2_bed + + # overrides for ShardedQcCollection + RuntimeAttr? runtime_override_collect_sharded_vcf_stats + RuntimeAttr? runtime_override_svtk_vcf_2_bed + RuntimeAttr? runtime_override_split_vcf_to_qc + RuntimeAttr? runtime_override_merge_subvcf_stat_shards + + # overrides for ShardedCohortBenchmarking + RuntimeAttr? runtime_override_site_level_benchmark + RuntimeAttr? runtime_override_merge_site_level_benchmark + + # overrides for CollectQcPerSample + RuntimeAttr? runtime_override_collect_vids_per_sample + RuntimeAttr? runtime_override_split_samples_list + RuntimeAttr? runtime_override_tar_shard_vid_lists + RuntimeAttr? runtime_override_merge_sharded_per_sample_vid_lists + + # overrides for PerSampleExternalBenchmark + RuntimeAttr? runtime_override_benchmark_samples + RuntimeAttr? runtime_override_split_shuffled_list + RuntimeAttr? runtime_override_merge_and_tar_shard_benchmarks + } + + # Restrict to a subset of all samples, if optioned. This can be useful to + # exclude outlier samples, or restrict to males/females on X/Y (for example) + + if (defined(list_of_samples_to_include)) { + scatter ( vcf_info in zip(vcfs, vcf_idxs) ) { + call Utils.SubsetVcfBySamplesList as SubsetVcf { + input: + vcf=vcf_info.left, + vcf_idx=vcf_info.right, + list_of_samples_to_keep=select_first([list_of_samples_to_include]), + subset_name=basename(vcf_info.left, '.vcf.gz') + ".subsetted", + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override=runtime_overrite_subset_vcf + } + } + } + + Array[File] vcfs_for_qc = select_first([SubsetVcf.vcf_subset, vcfs]) + Array[File] vcf_idxs_for_qc = select_first([SubsetVcf.vcf_subset_idx, vcf_idxs]) + + # Scatter raw variant data collection per chromosome + scatter ( contig in contigs ) { + # Collect VCF-wide summary stats + call ShardedQcCollection.ShardedQcCollection as CollectQcVcfwide { + input: + vcfs=vcfs_for_qc, + vcf_idxs=vcf_idxs_for_qc, + contig=contig, + sv_per_shard=sv_per_shard, + prefix="~{prefix}.~{contig}.shard", + sv_base_mini_docker=sv_base_mini_docker, + sv_pipeline_docker=sv_pipeline_docker, + runtime_override_collect_sharded_vcf_stats=runtime_override_collect_sharded_vcf_stats, + runtime_override_svtk_vcf_2_bed=runtime_override_svtk_vcf_2_bed, + runtime_override_split_vcf_to_qc=runtime_override_split_vcf_to_qc, + runtime_override_merge_subvcf_stat_shards=runtime_override_merge_subvcf_stat_shards, + runtime_override_merge_svtk_vcf_2_bed=runtime_override_merge_vcf_2_bed + } + } + + # Merge shards into single VCF stats file + call MiniTasks.ConcatBeds as MergeVcfwideStatShards { + input: + shard_bed_files=CollectQcVcfwide.vcf_stats, + prefix=prefix + ".VCF_sites.stats", + index_output=true, + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override=runtime_override_merge_vcfwide_stat_shards + } + + # Merge vcf2bed output + call MiniTasks.ConcatBeds as MergeVcf2Bed { + input: + shard_bed_files=CollectQcVcfwide.vcf2bed_out, + prefix=prefix + ".vcf2bed", + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override=runtime_override_merge_vcf_2_bed + } + + # Plot VCF-wide summary stats + call PlotQcVcfWide { + input: + vcf_stats=MergeVcfwideStatShards.merged_bed_file, + samples_list=CollectQcVcfwide.samples_list[0], + prefix=prefix, + sv_pipeline_qc_docker=sv_pipeline_qc_docker, + runtime_attr_override=runtime_override_plot_qc_vcf_wide + } + + # Collect and plot site-level benchmarking vs. external datasets + if (defined(site_level_comparison_datasets)) { + scatter ( comparison_dataset_info in select_first([site_level_comparison_datasets, + [[], []]]) ) { + + # Collect site-level external benchmarking data + call CohortExternalBenchmark.ShardedCohortBenchmarking as CollectSiteLevelBenchmarking { + input: + vcf_stats=MergeVcfwideStatShards.merged_bed_file, + prefix=prefix, + contigs=contigs, + benchmarking_bucket=comparison_dataset_info[1], + comparator=comparison_dataset_info[0], + sv_pipeline_qc_docker=sv_pipeline_qc_docker, + sv_base_mini_docker=sv_base_mini_docker, + runtime_override_site_level_benchmark=runtime_override_site_level_benchmark, + runtime_override_merge_site_level_benchmark=runtime_override_merge_site_level_benchmark + } + + # Plot site-level benchmarking results + call PlotQcExternalBenchmarking as PlotSiteLevelBenchmarking { + input: + benchmarking_tarball=CollectSiteLevelBenchmarking.benchmarking_results_tarball, + prefix=prefix, + comparator=comparison_dataset_info[0], + sv_pipeline_qc_docker=sv_pipeline_qc_docker, + runtime_attr_override=runtime_override_site_level_benchmark_plot + } + } + } + + # Shard sample list + call MiniTasks.SplitUncompressed as SplitSamplesList { + input: + whole_file=CollectQcVcfwide.samples_list[0], + lines_per_shard=samples_per_shard, + shard_prefix=prefix + ".list_shard.", + sv_pipeline_docker=sv_pipeline_docker, + runtime_attr_override=runtime_override_split_samples_list + } + + # Collect per-sample VID lists for each sample shard + scatter ( shard in SplitSamplesList.shards ) { + call CollectQcPerSample.CollectQcPerSample as CollectPerSampleVidLists { + input: + vcfs=vcfs, + vcf_format_has_cn=vcf_format_has_cn, + samples_list=shard, + prefix=prefix, + sv_base_mini_docker=sv_base_mini_docker, + sv_pipeline_docker=sv_pipeline_docker, + runtime_override_collect_vids_per_sample=runtime_override_collect_vids_per_sample, + runtime_override_merge_sharded_per_sample_vid_lists=runtime_override_merge_sharded_per_sample_vid_lists + } + } + + # Merge all VID lists into single output directory and tar it + call TarShardVidLists { + input: + in_tarballs=CollectPerSampleVidLists.vid_lists, + folder_name=prefix + "_perSample_VIDs_merged", + tarball_prefix=prefix + "_perSample_VIDs", + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override=runtime_override_tar_shard_vid_lists + } + + # Plot per-sample stats + call PlotQcPerSample { + input: + vcf_stats=MergeVcfwideStatShards.merged_bed_file, + samples_list=CollectQcVcfwide.samples_list[0], + per_sample_tarball=TarShardVidLists.vid_lists, + prefix=prefix, + sv_pipeline_qc_docker=sv_pipeline_qc_docker, + runtime_attr_override=runtime_override_plot_qc_per_sample + } + + # Plot per-family stats if .ped file provided as input + if (defined(ped_file)) { + call PlotQcPerFamily { + input: + vcf_stats=MergeVcfwideStatShards.merged_bed_file, + samples_list=CollectQcVcfwide.samples_list[0], + ped_file=select_first([ped_file]), + max_trios=max_trios, + per_sample_tarball=TarShardVidLists.vid_lists, + prefix=prefix, + sv_pipeline_qc_docker=sv_pipeline_qc_docker, + runtime_attr_override=runtime_override_plot_qc_per_family + } + } + + # Collect and plot per-sample benchmarking vs. external callsets + if (defined(sample_level_comparison_datasets)) { + scatter ( comparison_dataset_info in select_first([sample_level_comparison_datasets, + [[], []]]) ) { + + # Collect per-sample external benchmarking data + call PerSampleExternalBenchmark.PerSampleExternalBenchmark as CollectPerSampleBenchmarking { + input: + vcf_stats=MergeVcfwideStatShards.merged_bed_file, + samples_list=CollectQcVcfwide.samples_list[0], + per_sample_tarball=TarShardVidLists.vid_lists, + comparison_tarball=select_first([comparison_dataset_info[1]]), + prefix=prefix, + contigs=contigs, + comparison_set_name=comparison_dataset_info[0], + samples_per_shard=samples_per_shard, + random_seed=random_seed, + sv_base_mini_docker=sv_base_mini_docker, + sv_pipeline_docker=sv_pipeline_docker, + sv_pipeline_qc_docker=sv_pipeline_qc_docker, + runtime_override_benchmark_samples=runtime_override_benchmark_samples, + runtime_override_split_shuffled_list=runtime_override_split_shuffled_list, + runtime_override_merge_and_tar_shard_benchmarks=runtime_override_merge_and_tar_shard_benchmarks + } + + # Plot per-sample benchmarking results + call PlotQcPerSampleBenchmarking as PlotPerSampleBenchmarking { + input: + per_sample_benchmarking_tarball=CollectPerSampleBenchmarking.benchmarking_results_tarball, + samples_list=CollectQcVcfwide.samples_list[0], + comparison_set_name=comparison_dataset_info[0], + prefix=prefix, + sv_pipeline_qc_docker=sv_pipeline_qc_docker, + runtime_attr_override=runtime_override_per_sample_benchmark_plot + } + } + } + + # Sanitize all outputs + call SanitizeOutputs { + input: + prefix=prefix, + samples_list=CollectQcVcfwide.samples_list[0], + vcf_stats=MergeVcfwideStatShards.merged_bed_file, + vcf_stats_idx=MergeVcfwideStatShards.merged_bed_idx, + plot_qc_vcfwide_tarball=PlotQcVcfWide.plots_tarball, + plot_qc_site_level_external_benchmarking_tarballs=PlotSiteLevelBenchmarking.tarball_wPlots, + collect_qc_per_sample_tarball=TarShardVidLists.vid_lists, + plot_qc_per_sample_tarball=PlotQcPerSample.perSample_plots_tarball, + plot_qc_per_family_tarball=PlotQcPerFamily.perFamily_plots_tarball, + cleaned_fam_file=PlotQcPerFamily.cleaned_fam_file, + plot_qc_per_sample_external_benchmarking_tarballs=PlotPerSampleBenchmarking.perSample_plots_tarball, + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override=runtime_override_sanitize_outputs + } + + # Final output + output { + File sv_vcf_qc_output = SanitizeOutputs.vcf_qc_tarball + File vcf2bed_output = MergeVcf2Bed.merged_bed_file + } +} + + +# Plot VCF-wide QC stats +task PlotQcVcfWide { + input { + File vcf_stats + File samples_list + String prefix + String sv_pipeline_qc_docker + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 3.75, + disk_gb: 20, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_qc_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Plot VCF-wide distributions + /opt/sv-pipeline/scripts/vcf_qc/plot_sv_vcf_distribs.R \ + -N $( cat ~{samples_list} | sort | uniq | wc -l ) \ + -S /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ + ~{vcf_stats} \ + plotQC_vcfwide_output/ + + # Prep outputs + tar -czvf ~{prefix}.plotQC_vcfwide_output.tar.gz \ + plotQC_vcfwide_output + >>> + + output { + File plots_tarball = "~{prefix}.plotQC_vcfwide_output.tar.gz" + } +} + + +# Task to merge VID lists across shards +task TarShardVidLists { + input { + Array[File] in_tarballs + String? folder_name + String? tarball_prefix + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + String tar_folder_name = select_first([folder_name, "merged"]) + String outfile_name = select_first([tarball_prefix, tar_folder_name]) + ".tar.gz" + + # Since the input files are often/always compressed themselves, assume compression factor for tarring is 1.0 + Float input_size = size(in_tarballs, "GB") + Float base_disk_gb = 10.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb, + disk_gb: ceil(base_disk_gb + input_size * 2.0), + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + # Create final output directory + mkdir "~{tar_folder_name}" + + while read tarball_path; do + tar -xzvf "$tarball_path" --directory ~{tar_folder_name}/ + done < ~{write_lines(in_tarballs)} + + # Compress final output directory + tar -czvf "~{outfile_name}" "~{tar_folder_name}" + >>> + + output { + File vid_lists = outfile_name + } +} + + +# Plot external benchmarking results +task PlotQcExternalBenchmarking { + input { + File benchmarking_tarball + String prefix + String comparator + String sv_pipeline_qc_docker + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 3.75, + disk_gb: 20, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_qc_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Plot benchmarking stats + /opt/sv-pipeline/scripts/vcf_qc/plotQC.external_benchmarking.helper.sh \ + ~{benchmarking_tarball} \ + ~{comparator} + + # Prep outputs + tar -czvf ~{prefix}.collectQC_benchmarking_~{comparator}_output.wPlots.tar.gz \ + collectQC_benchmarking_~{comparator}_output + >>> + + output { + File tarball_wPlots = "~{prefix}.collectQC_benchmarking_~{comparator}_output.wPlots.tar.gz" + } +} + + +# Plot per-sample stats +task PlotQcPerSample { + input { + File vcf_stats + File samples_list + File per_sample_tarball + String prefix + String sv_pipeline_qc_docker + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 7.75, + disk_gb: 50, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_qc_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Make per-sample directory + mkdir ~{prefix}_perSample/ + + # Untar per-sample VID lists + mkdir tmp_untar/ + tar -xvzf ~{per_sample_tarball} \ + --directory tmp_untar/ + find tmp_untar/ -name "*.VIDs_genotypes.txt.gz" | while read FILE; do + mv $FILE ~{prefix}_perSample/ + done + + # Plot per-sample distributions + /opt/sv-pipeline/scripts/vcf_qc/plot_sv_perSample_distribs.R \ + -S /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ + ~{vcf_stats} \ + ~{samples_list} \ + ~{prefix}_perSample/ \ + ~{prefix}_perSample_plots/ + + # Prepare output + tar -czvf ~{prefix}.plotQC_perSample.tar.gz \ + ~{prefix}_perSample_plots + >>> + + output { + File perSample_plots_tarball = "~{prefix}.plotQC_perSample.tar.gz" + } +} + + +# Plot per-family stats +task PlotQcPerFamily { + input { + File vcf_stats + File samples_list + File ped_file + File per_sample_tarball + Int max_trios + Int? random_seed = 2021 + String prefix + String sv_pipeline_qc_docker + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 7.75, + disk_gb: 50, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_qc_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Clean fam file + /opt/sv-pipeline/scripts/vcf_qc/cleanFamFile.sh \ + ~{samples_list} \ + ~{ped_file} \ + cleaned.fam + rm ~{ped_file} ~{samples_list} + + # Only run if any families remain after cleaning + n_fams=$( grep -Ev "^#" cleaned.fam | wc -l ) + echo -e "DETECTED $n_fams FAMILIES" + if [ $n_fams -gt 0 ]; then + + # Make per-sample directory + mkdir ~{prefix}_perSample/ + + # Untar per-sample VID lists + mkdir tmp_untar/ + tar -xvzf ~{per_sample_tarball} \ + --directory tmp_untar/ + for FILE in $( find tmp_untar/ -name "*.VIDs_genotypes.txt.gz" ); do + mv -v $FILE ~{prefix}_perSample/ + done + + # Subset fam file, if optioned + n_trios=$( grep -Ev "^#" cleaned.fam \ + | awk '{ if ($2 != "0" && $2 != "." && \ + $3 != "0" && $3 != "." && \ + $4 != "0" && $4 != ".") print $0 }' \ + | wc -l ) + echo -e "DETECTED $n_trios COMPLETE TRIOS" + if [ $n_trios -gt ~{max_trios} ]; then + grep -E '^#' cleaned.fam > fam_header.txt + grep -Ev "^#" cleaned.fam \ + | awk '{ if ($2 != "0" && $2 != "." && \ + $3 != "0" && $3 != "." && \ + $4 != "0" && $4 != ".") print $0 }' \ + | sort -R --random-source <( yes ~{random_seed} ) \ + > cleaned.shuffled.fam + awk -v max_trios="~{max_trios}" 'NR <= max_trios' cleaned.shuffled.fam \ + | cat fam_header.txt - \ + > cleaned.subset.fam + echo -e "SUBSETTED TO $( cat cleaned.subset.fam | wc -l | awk '{ print $1-1 }' ) RANDOM FAMILIES" + else + echo -e "NUMBER OF TRIOS DETECTED ( $n_trios ) LESS THAN MAX_TRIOS ( ~{max_trios} ); PROCEEDING WITHOUT DOWNSAMPLING" + cp cleaned.fam cleaned.subset.fam + fi + + # Run family analysis + echo -e "STARTING FAMILY-BASED ANALYSIS" + /opt/sv-pipeline/scripts/vcf_qc/analyze_fams.R \ + -S /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ + ~{vcf_stats} \ + cleaned.subset.fam \ + ~{prefix}_perSample/ \ + ~{prefix}_perFamily_plots/ + + else + + mkdir ~{prefix}_perFamily_plots/ + + fi + + # Prepare output + echo -e "COMPRESSING RESULTS AS A TARBALL" + tar -czvf ~{prefix}.plotQC_perFamily.tar.gz \ + ~{prefix}_perFamily_plots + >>> + + output { + File perFamily_plots_tarball = "~{prefix}.plotQC_perFamily.tar.gz" + File cleaned_fam_file = "cleaned.fam" + } +} + + +# Plot per-sample benchmarking +task PlotQcPerSampleBenchmarking { + input { + File per_sample_benchmarking_tarball + File samples_list + String comparison_set_name + String prefix + String sv_pipeline_qc_docker + Int? max_samples = 3000 + Int? random_seed = 2021 + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 7.75, + disk_gb: 50, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_qc_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Untar per-sample benchmarking results + mkdir tmp_untar/ + tar -xvzf ~{per_sample_benchmarking_tarball} \ + --directory tmp_untar/ + mkdir results/ + + # Subset to max_samples + find tmp_untar/ -name "*.sensitivity.bed.gz" \ + | xargs -I {} basename {} | sed 's/\.sensitivity\.bed\.gz//g' \ + | sort -V \ + > all_samples.list + n_samples_all=$( cat all_samples.list | wc -l ) + echo -e "IDENTIFIED $n_samples_all TOTAL SAMPLES" + if [ $n_samples_all -gt ~{max_samples} ]; then + echo -e "SUBSETTING TO ~{max_samples} SAMPLES" + cat all_samples.list \ + | sort -R --random-source <( yes ~{random_seed} ) \ + | awk -v max_samples=~{max_samples} '{ if (NR<=max_samples) print }' \ + > ~{prefix}.plotted_samples.list + else + cp all_samples.list ~{prefix}.plotted_samples.list + fi + + while read ID; do + find tmp_untar -name "$ID.*.bed.gz" | xargs -I {} mv {} results/ + done < ~{prefix}.plotted_samples.list + + # Plot per-sample benchmarking + /opt/sv-pipeline/scripts/vcf_qc/plot_perSample_benchmarking.R \ + -c ~{comparison_set_name} \ + results/ \ + ~{samples_list} \ + /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ + ~{prefix}.~{comparison_set_name}_perSample_benchmarking_plots/ + + # Prepare output + tar -czvf ~{prefix}.~{comparison_set_name}_perSample_benchmarking_plots.tar.gz \ + ~{prefix}.~{comparison_set_name}_perSample_benchmarking_plots + >>> + + output { + File perSample_plots_tarball = "~{prefix}.~{comparison_set_name}_perSample_benchmarking_plots.tar.gz" + File samples_plotted = "~{prefix}.plotted_samples.list" + } +} + + +# Sanitize final output +task SanitizeOutputs { + input { + String prefix + File samples_list + File vcf_stats + File vcf_stats_idx + File plot_qc_vcfwide_tarball + Array[File]? plot_qc_site_level_external_benchmarking_tarballs + File collect_qc_per_sample_tarball + File plot_qc_per_sample_tarball + File? plot_qc_per_family_tarball + File? cleaned_fam_file + Array[File]? plot_qc_per_sample_external_benchmarking_tarballs + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + # simple compress + tar workf + Float input_size = size( + flatten([[ vcf_stats, samples_list, vcf_stats, vcf_stats_idx, plot_qc_vcfwide_tarball, + plot_qc_site_level_external_benchmarking_tarballs, + collect_qc_per_sample_tarball, plot_qc_per_sample_tarball, + plot_qc_per_family_tarball, cleaned_fam_file ], + select_first([plot_qc_site_level_external_benchmarking_tarballs, []]), + select_first([plot_qc_per_sample_external_benchmarking_tarballs, []])]), + "GiB" + ) + Float compression_factor = 5.0 + Float base_disk_gb = 5.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb, + disk_gb: ceil(base_disk_gb + input_size * (2.0 + compression_factor)), + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Prep output directory tree + mkdir ~{prefix}_SV_VCF_QC_output/ + mkdir ~{prefix}_SV_VCF_QC_output/data/ + mkdir ~{prefix}_SV_VCF_QC_output/data/variant_info_per_sample/ + mkdir ~{prefix}_SV_VCF_QC_output/plots/ + mkdir ~{prefix}_SV_VCF_QC_output/plots/main_plots/ + mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/ + mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/vcf_summary_plots/ + mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/external_benchmarking_tarballs/ + for tarball_fname in ~{sep=" " plot_qc_site_level_external_benchmarking_tarballs}; do + dname="$( basename -s '.tar.gz' $tarball_fname )_site_level_benchmarking_plots/" + mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/$dname + done + mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/per_sample_plots/ + if ~{defined(plot_qc_per_family_tarball)}; then + mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/sv_inheritance_plots/ + fi + for tarball_fname in ~{sep=" " plot_qc_per_sample_external_benchmarking_tarballs}; do + dname="$( basename -s '.tar.gz' $tarball_fname )_per_sample_benchmarking_plots/" + mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/$dname + done + + # Process VCF-wide stats + cp ~{vcf_stats} \ + ~{prefix}_SV_VCF_QC_output/data/~{prefix}.VCF_sites.stats.bed.gz + cp ~{vcf_stats_idx} \ + ~{prefix}_SV_VCF_QC_output/data/~{prefix}.VCF_sites.stats.bed.gz.tbi + + # Process VCF-wide plots + tar -xzvf ~{plot_qc_vcfwide_tarball} + cp plotQC_vcfwide_output/main_plots/* \ + ~{prefix}_SV_VCF_QC_output/plots/main_plots/ + cp plotQC_vcfwide_output/supporting_plots/vcf_summary_plots/* \ + ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/vcf_summary_plots/ + + # Process site-level external benchmarking plots + if ~{defined(plot_qc_site_level_external_benchmarking_tarballs)}; then + # For now, just dump them all into a tmp holding directory + # TODO: clean this up so it appropriately relocates all files & plots + cp ~{sep=" " plot_qc_site_level_external_benchmarking_tarballs} \ + ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/external_benchmarking_tarballs/ + fi + + # Process per-sample stats + tar -xzvf ~{collect_qc_per_sample_tarball} + cp ~{prefix}_perSample_VIDs_merged/*.VIDs_genotypes.txt.gz \ + ~{prefix}_SV_VCF_QC_output/data/variant_info_per_sample/ + + # Process per-sample plots + tar -xzvf ~{plot_qc_per_sample_tarball} + cp ~{prefix}_perSample_plots/main_plots/* \ + ~{prefix}_SV_VCF_QC_output/plots/main_plots/ + cp ~{prefix}_perSample_plots/supporting_plots/per_sample_plots/* \ + ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/per_sample_plots/ + + # Process per-family plots + if ~{defined(plot_qc_per_family_tarball)}; then + tar -xzvf ~{plot_qc_per_family_tarball} + cp ~{prefix}_perFamily_plots/main_plots/* \ + ~{prefix}_SV_VCF_QC_output/plots/main_plots/ || true + cp ~{prefix}_perFamily_plots/supporting_plots/sv_inheritance_plots/* \ + ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/sv_inheritance_plots/ || true + fi + + # Process per-sample external benchmarking plots + if ~{defined(plot_qc_per_sample_external_benchmarking_tarballs)}; then + # For now, just dump them all into a tmp holding directory + # TODO: clean this up so it appropriately relocates all files & plots + cp ~{sep=" " plot_qc_per_sample_external_benchmarking_tarballs} \ + ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/external_benchmarking_tarballs/ + fi + + # Process misc files + if ~{defined(cleaned_fam_file)}; then + cp ~{cleaned_fam_file} \ + ~{prefix}_SV_VCF_QC_output/data/~{prefix}.cleaned_trios.fam + fi + cp ~{samples_list} \ + ~{prefix}_SV_VCF_QC_output/data/~{prefix}.samples_analyzed.list + + # Compress final output + tar -czvf ~{prefix}_SV_VCF_QC_output.tar.gz \ + ~{prefix}_SV_VCF_QC_output + >>> + + output { + File vcf_qc_tarball = "~{prefix}_SV_VCF_QC_output.tar.gz" + } +} + diff --git a/wdl/MasterVcfQc.wdl b/wdl/MasterVcfQc.wdl deleted file mode 100644 index 480cbcca4..000000000 --- a/wdl/MasterVcfQc.wdl +++ /dev/null @@ -1,936 +0,0 @@ -version 1.0 - -# Author: Ryan Collins - -import "ShardedQcCollection.wdl" as ShardedQcCollection -import "CollectQcPerSample.wdl" as CollectQcPerSample -import "PerSampleExternalBenchmark.wdl" as PerSampleExternalBenchmark - -import "Tasks0506.wdl" as MiniTasks - -# Master workflow to perform comprehensive quality control (QC) on -# an SV VCF output by GATK-SV -workflow MasterVcfQc { - input { - File vcf - File ped_file - String prefix - Int sv_per_shard - Int samples_per_shard - Array[File]? thousand_genomes_tarballs - Array[File]? hgsv_tarballs - Array[File]? asc_tarballs - File? sanders_2015_tarball - File? collins_2017_tarball - File? werling_2018_tarball - Array[String] contigs - Int? random_seed - File? vcf_idx - - String sv_base_mini_docker - String sv_pipeline_docker - String sv_pipeline_qc_docker - - # overrides for local tasks - RuntimeAttr? runtime_override_plot_qc_vcf_wide - RuntimeAttr? runtime_override_thousand_g_benchmark - RuntimeAttr? runtime_override_thousand_g_plot - RuntimeAttr? runtime_override_asc_benchmark - RuntimeAttr? runtime_override_asc_plot - RuntimeAttr? runtime_override_custom_external - RuntimeAttr? runtime_override_hgsv_benchmark - RuntimeAttr? runtime_override_hgsv_plot - RuntimeAttr? runtime_override_plot_qc_per_sample - RuntimeAttr? runtime_override_plot_qc_per_family - RuntimeAttr? runtime_override_sanders_per_sample_plot - RuntimeAttr? runtime_override_collins_per_sample_plot - RuntimeAttr? runtime_override_werling_per_sample_plot - RuntimeAttr? runtime_override_sanitize_outputs - - # overrides for MiniTasks - RuntimeAttr? runtime_override_merge_vcfwide_stat_shards - RuntimeAttr? runtime_override_merge_vcf_2_bed - - # overrides for ShardedQcCollection - RuntimeAttr? runtime_override_collect_sharded_vcf_stats - RuntimeAttr? runtime_override_svtk_vcf_2_bed - RuntimeAttr? runtime_override_split_vcf_to_qc - RuntimeAttr? runtime_override_merge_subvcf_stat_shards - RuntimeAttr? runtime_override_merge_svtk_vcf_2_bed - - # overrides for CollectQcPerSample - RuntimeAttr? runtime_override_collect_vids_per_sample - RuntimeAttr? runtime_override_split_samples_list - RuntimeAttr? runtime_override_tar_shard_vid_lists - - # overrides for PerSampleExternalBenchmark - RuntimeAttr? runtime_override_benchmark_samples - RuntimeAttr? runtime_override_split_shuffled_list - RuntimeAttr? runtime_override_merge_and_tar_shard_benchmarks - } - # Scatter raw variant data collection per chromosome - scatter ( contig in contigs ) { - # Collect VCF-wide summary stats - call ShardedQcCollection.ShardedQcCollection as CollectQcVcfwide { - input: - vcf=vcf, - vcf_idx=vcf_idx, - contig=contig, - sv_per_shard=sv_per_shard, - prefix="~{prefix}.~{contig}.shard", - sv_base_mini_docker=sv_base_mini_docker, - sv_pipeline_docker=sv_pipeline_docker, - runtime_override_collect_sharded_vcf_stats=runtime_override_collect_sharded_vcf_stats, - runtime_override_svtk_vcf_2_bed=runtime_override_svtk_vcf_2_bed, - runtime_override_split_vcf_to_qc=runtime_override_split_vcf_to_qc, - runtime_override_merge_subvcf_stat_shards=runtime_override_merge_subvcf_stat_shards, - runtime_override_merge_svtk_vcf_2_bed=runtime_override_merge_svtk_vcf_2_bed - } - } - - # Merge shards into single VCF stats file - call MiniTasks.ConcatBeds as MergeVcfwideStatShards { - input: - shard_bed_files=CollectQcVcfwide.vcf_stats, - prefix=prefix + ".VCF_sites.stats", - index_output=true, - sv_base_mini_docker=sv_base_mini_docker, - runtime_attr_override=runtime_override_merge_vcfwide_stat_shards - - } - - # Merge vcf2bed output - call MiniTasks.ConcatBeds as MergeVcf2Bed { - input: - shard_bed_files=CollectQcVcfwide.vcf2bed_out, - prefix=prefix + ".vcf2bed", - sv_base_mini_docker=sv_base_mini_docker, - runtime_attr_override=runtime_override_merge_vcf_2_bed - } - - # Plot VCF-wide summary stats - call PlotQcVcfWide { - input: - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - samples_list=CollectQcVcfwide.samples_list[0], - prefix=prefix, - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_plot_qc_vcf_wide - } - # Collect per-sample VID lists - call CollectQcPerSample.CollectQcPerSample as CollectPerSampleVidLists { - input: - vcf=vcf, - samples_list=CollectQcVcfwide.samples_list[0], - prefix=prefix, - samples_per_shard=samples_per_shard, - sv_base_mini_docker=sv_base_mini_docker, - sv_pipeline_docker=sv_pipeline_docker, - runtime_override_collect_vids_per_sample=runtime_override_collect_vids_per_sample, - runtime_override_split_samples_list=runtime_override_split_samples_list, - runtime_override_tar_shard_vid_lists=runtime_override_tar_shard_vid_lists, - sv_base_mini_docker=sv_base_mini_docker, - sv_pipeline_docker=sv_pipeline_docker - } - - # Plot per-sample stats - call PlotQcPerSample { - input: - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - samples_list=CollectQcVcfwide.samples_list[0], - per_sample_tarball=CollectPerSampleVidLists.vid_lists, - prefix=prefix, - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_plot_qc_per_sample - } - - # Plot per-family stats - call PlotQcPerFamily { - input: - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - samples_list=CollectQcVcfwide.samples_list[0], - ped_file=ped_file, - per_sample_tarball=CollectPerSampleVidLists.vid_lists, - prefix=prefix, - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_plot_qc_per_family - } - # Sanitize all outputs - call SanitizeOutputs { - input: - prefix=prefix, - samples_list=CollectQcVcfwide.samples_list[0], - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - vcf_stats_idx=MergeVcfwideStatShards.merged_bed_idx, - plot_qc_vcfwide_tarball=PlotQcVcfWide.plots_tarball, - plot_qc_external_benchmarking_thousand_g_tarball=ThousandGPlot.tarball_wPlots, - plot_qc_external_benchmarking_asc_tarball=AscPlot.tarball_wPlots, - plot_qc_external_benchmarking_hgsv_tarball=HgsvPlot.tarball_wPlots, - collect_qc_per_sample_tarball=CollectPerSampleVidLists.vid_lists, - plot_qc_per_sample_tarball=PlotQcPerSample.perSample_plots_tarball, - plot_qc_per_family_tarball=PlotQcPerFamily.perFamily_plots_tarball, - cleaned_fam_file=PlotQcPerFamily.cleaned_fam_file, - plot_qc_per_sample_sanders_tarball=SandersPerSamplePlot.perSample_plots_tarball, - plot_qc_per_sample_collins_tarball=CollinsPerSamplePlot.perSample_plots_tarball, - plot_qc_per_sample_werling_tarball=WerlingPerSamplePlot.perSample_plots_tarball, - sv_base_mini_docker=sv_base_mini_docker, - runtime_attr_override=runtime_override_sanitize_outputs - } - - # Collect external benchmarking vs 1000G - if (defined(thousand_genomes_tarballs)) { - call VcfExternalBenchmark as ThousandGBenchmark { - input: - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - prefix=prefix, - benchmarking_archives=select_first([thousand_genomes_tarballs]), - comparator="1000G_Sudmant", - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_thousand_g_benchmark - } - - # Plot external benchmarking vs 1000G - call PlotQcExternalBenchmarking as ThousandGPlot { - input: - benchmarking_tarball=ThousandGBenchmark.benchmarking_results_tarball, - prefix=prefix, - comparator="1000G_Sudmant", - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_thousand_g_plot - } - } - - # Collect external benchmarking vs ASC - if (defined(hgsv_tarballs)) { - call VcfExternalBenchmark as AscBenchmark { - input: - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - prefix=prefix, - benchmarking_archives=select_first([asc_tarballs]), - comparator="ASC_Werling", - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_asc_benchmark - } - - # Plot external benchmarking vs ASC - call PlotQcExternalBenchmarking as AscPlot { - input: - benchmarking_tarball=AscBenchmark.benchmarking_results_tarball, - prefix=prefix, - comparator="ASC_Werling", - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_asc_plot - } - } - - # Collect external benchmarking vs HGSV - if (defined(hgsv_tarballs)) { - call VcfExternalBenchmark as HgsvBenchmark { - input: - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - prefix=prefix, - benchmarking_archives=select_first([hgsv_tarballs]), - comparator="HGSV_Chaisson", - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_hgsv_benchmark - } - - # Plot external benchmarking vs HGSV - call PlotQcExternalBenchmarking as HgsvPlot { - input: - benchmarking_tarball=HgsvBenchmark.benchmarking_results_tarball, - prefix=prefix, - comparator="HGSV_Chaisson", - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_hgsv_plot - } - } - - if (defined(sanders_2015_tarball)) { - # Collect per-sample external benchmarking vs Sanders 2015 arrays - call PerSampleExternalBenchmark.PerSampleExternalBenchmark as SandersPerSampleBenchmark { - input: - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - samples_list=CollectQcVcfwide.samples_list[0], - per_sample_tarball=CollectPerSampleVidLists.vid_lists, - comparison_tarball=select_first([sanders_2015_tarball]), - prefix=prefix, - comparison_set_name="Sanders_2015_array", - samples_per_shard=samples_per_shard, - random_seed=random_seed, - sv_base_mini_docker=sv_base_mini_docker, - sv_pipeline_docker=sv_pipeline_docker, - runtime_override_benchmark_samples=runtime_override_benchmark_samples, - runtime_override_split_shuffled_list=runtime_override_split_shuffled_list, - runtime_override_merge_and_tar_shard_benchmarks=runtime_override_merge_and_tar_shard_benchmarks - } - - # Plot per-sample external benchmarking vs Sanders 2015 arrays - call PlotQcPerSampleBenchmarking as SandersPerSamplePlot { - input: - per_sample_benchmarking_tarball=SandersPerSampleBenchmark.benchmarking_results_tarball, - samples_list=CollectQcVcfwide.samples_list[0], - comparison_set_name="Sanders_2015_array", - prefix=prefix, - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_sanders_per_sample_plot - } - } - - if (defined(collins_2017_tarball)) { - # Collect per-sample external benchmarking vs Collins 2017 liWGS - call PerSampleExternalBenchmark.PerSampleExternalBenchmark as CollinsPerSampleBenchmark { - input: - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - samples_list=CollectQcVcfwide.samples_list[0], - per_sample_tarball=CollectPerSampleVidLists.vid_lists, - comparison_tarball=select_first([collins_2017_tarball]), - prefix=prefix, - comparison_set_name="Collins_2017_liWGS", - samples_per_shard=samples_per_shard, - random_seed=random_seed, - sv_base_mini_docker=sv_base_mini_docker, - sv_pipeline_docker=sv_pipeline_docker, - runtime_override_benchmark_samples=runtime_override_benchmark_samples, - runtime_override_split_shuffled_list=runtime_override_split_shuffled_list, - runtime_override_merge_and_tar_shard_benchmarks=runtime_override_merge_and_tar_shard_benchmarks - } - - # Plot per-sample external benchmarking vs Collins 2017 liWGS - call PlotQcPerSampleBenchmarking as CollinsPerSamplePlot { - input: - per_sample_benchmarking_tarball=CollinsPerSampleBenchmark.benchmarking_results_tarball, - samples_list=CollectQcVcfwide.samples_list[0], - comparison_set_name="Collins_2017_liWGS", - prefix=prefix, - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_collins_per_sample_plot - } - } - - if (defined(werling_2018_tarball)) { - # Collect per-sample external benchmarking vs Werling 2018 WGS - call PerSampleExternalBenchmark.PerSampleExternalBenchmark as WerlingPerSampleBenchmark { - input: - vcf_stats=MergeVcfwideStatShards.merged_bed_file, - samples_list=CollectQcVcfwide.samples_list[0], - per_sample_tarball=CollectPerSampleVidLists.vid_lists, - comparison_tarball=select_first([werling_2018_tarball]), - prefix=prefix, - comparison_set_name="Werling_2018_WGS", - samples_per_shard=samples_per_shard, - random_seed=random_seed, - sv_base_mini_docker=sv_base_mini_docker, - sv_pipeline_docker=sv_pipeline_docker, - runtime_override_benchmark_samples=runtime_override_benchmark_samples, - runtime_override_split_shuffled_list=runtime_override_split_shuffled_list, - runtime_override_merge_and_tar_shard_benchmarks=runtime_override_merge_and_tar_shard_benchmarks - } - - # Plot per-sample external benchmarking vs Werling 2018 WGS - call PlotQcPerSampleBenchmarking as WerlingPerSamplePlot { - input: - per_sample_benchmarking_tarball=WerlingPerSampleBenchmark.benchmarking_results_tarball, - samples_list=CollectQcVcfwide.samples_list[0], - comparison_set_name="Werling_2018_WGS", - prefix=prefix, - sv_pipeline_qc_docker=sv_pipeline_qc_docker, - runtime_attr_override=runtime_override_werling_per_sample_plot - } - } - - # Final output - output { - File sv_vcf_qc_output = SanitizeOutputs.vcf_qc_tarball - File vcf2bed_output = MergeVcf2Bed.merged_bed_file - } -} - - -# Plot VCF-wide QC stats -task PlotQcVcfWide { - input { - File vcf_stats - File samples_list - String prefix - String sv_pipeline_qc_docker - RuntimeAttr? runtime_attr_override - } - - # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to - # be held in memory or disk while working, potentially in a form that takes up more space) - Float input_size = size([vcf_stats, samples_list], "GiB") - Float compression_factor = 5.0 - Float base_disk_gb = 5.0 - # give extra base memory in case the plotting functions are very inefficient - Float base_mem_gb = 3.75 - RuntimeAttr runtime_default = object { - mem_gb: base_mem_gb + compression_factor * input_size, - disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)), - cpu_cores: 1, - preemptible_tries: 3, - max_retries: 1, - boot_disk_gb: 10 - } - RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) - runtime { - memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" - disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" - cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) - preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) - maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) - docker: sv_pipeline_qc_docker - bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) - } - - command <<< - set -eu -o pipefail - - # Plot VCF-wide distributions - /opt/sv-pipeline/scripts/vcf_qc/plot_sv_vcf_distribs.R \ - -N $( cat ~{samples_list} | sort | uniq | wc -l ) \ - -S /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ - ~{vcf_stats} \ - plotQC_vcfwide_output/ - - # Prep outputs - tar -czvf ~{prefix}.plotQC_vcfwide_output.tar.gz \ - plotQC_vcfwide_output - >>> - - output { - File plots_tarball = "~{prefix}.plotQC_vcfwide_output.tar.gz" - } -} - - -# Task to collect external benchmarking data -task VcfExternalBenchmark { - input { - File vcf_stats - Array[File] benchmarking_archives - String prefix - String comparator - String sv_pipeline_qc_docker - RuntimeAttr? runtime_attr_override - } - - # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to - # be held in memory or disk while working, potentially in a form that takes up more space) - # NOTE: in this case, double input size because it will be compared to a data set stored in - # the docker. Other than having space for the at-rest compressed data, this is like - # processing another data set of comparable size to the input - Float input_size = 2 * size(vcf_stats, "GiB") - Float compression_factor = 5.0 - Float base_disk_gb = 5.0 - Float base_mem_gb = 2.0 - RuntimeAttr runtime_default = object { - mem_gb: base_mem_gb + compression_factor * input_size, - disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)), - cpu_cores: 1, - preemptible_tries: 3, - max_retries: 1, - boot_disk_gb: 10 - } - RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) - runtime { - memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" - disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" - cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) - preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) - maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) - docker: sv_pipeline_qc_docker - bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) - } - - command <<< - set -eu -o pipefail - mkdir benchmarks - cp ~{sep=" " benchmarking_archives} benchmarks/ - - # Run benchmarking script - /opt/sv-pipeline/scripts/vcf_qc/collectQC.external_benchmarking.sh \ - ~{vcf_stats} \ - /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ - ~{comparator} \ - benchmarks \ - collectQC_benchmarking_~{comparator}_output/ - - # Prep outputs - tar -czvf ~{prefix}.collectQC_benchmarking_~{comparator}_output.tar.gz \ - collectQC_benchmarking_~{comparator}_output - >>> - - output { - File benchmarking_results_tarball = "~{prefix}.collectQC_benchmarking_~{comparator}_output.tar.gz" - } -} - -# Plot external benchmarking results -task PlotQcExternalBenchmarking { - input { - File benchmarking_tarball - String prefix - String comparator - String sv_pipeline_qc_docker - RuntimeAttr? runtime_attr_override - } - - # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to - # be held in memory or disk while working, potentially in a form that takes up more space) - Float input_size = size(benchmarking_tarball, "GiB") - Float compression_factor = 5.0 - Float base_disk_gb = 5.0 - # give extra base memory in case the plotting functions are very inefficient - Float base_mem_gb = 8.0 - RuntimeAttr runtime_default = object { - mem_gb: base_mem_gb + compression_factor * input_size, - disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)), - cpu_cores: 1, - preemptible_tries: 3, - max_retries: 1, - boot_disk_gb: 10 - } - RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) - runtime { - memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" - disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" - cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) - preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) - maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) - docker: sv_pipeline_qc_docker - bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) - } - - command <<< - set -eu -o pipefail - - # Plot benchmarking stats - /opt/sv-pipeline/scripts/vcf_qc/plotQC.external_benchmarking.helper.sh \ - ~{benchmarking_tarball} \ - ~{comparator} - - # Prep outputs - tar -czvf ~{prefix}.collectQC_benchmarking_~{comparator}_output.wPlots.tar.gz \ - collectQC_benchmarking_~{comparator}_output - >>> - - output { - File tarball_wPlots = "~{prefix}.collectQC_benchmarking_~{comparator}_output.wPlots.tar.gz" - } -} - - -# Plot per-sample stats -task PlotQcPerSample { - input { - File vcf_stats - File samples_list - File per_sample_tarball - String prefix - String sv_pipeline_qc_docker - RuntimeAttr? runtime_attr_override - } - - # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to - # be held in memory or disk while working, potentially in a form that takes up more space) - Float input_size = size([vcf_stats, samples_list], "GiB") - Float compression_factor = 5.0 - Float base_disk_gb = 5.0 - # give extra base memory in case the plotting functions are very inefficient - Float base_mem_gb = 3.75 - RuntimeAttr runtime_default = object { - mem_gb: base_mem_gb + compression_factor * input_size, - disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)), - cpu_cores: 1, - preemptible_tries: 3, - max_retries: 1, - boot_disk_gb: 10 - } - RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) - runtime { - memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" - disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" - cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) - preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) - maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) - docker: sv_pipeline_qc_docker - bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) - } - - command <<< - set -eu -o pipefail - - # Make per-sample directory - mkdir ~{prefix}_perSample/ - - # Untar per-sample VID lists - mkdir tmp_untar/ - tar -xvzf ~{per_sample_tarball} \ - --directory tmp_untar/ - find tmp_untar/ -name "*.VIDs_genotypes.txt.gz" | while read FILE; do - mv $FILE ~{prefix}_perSample/ - done - - # Plot per-sample distributions - /opt/sv-pipeline/scripts/vcf_qc/plot_sv_perSample_distribs.R \ - -S /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ - ~{vcf_stats} \ - ~{samples_list} \ - ~{prefix}_perSample/ \ - ~{prefix}_perSample_plots/ - - # Prepare output - tar -czvf ~{prefix}.plotQC_perSample.tar.gz \ - ~{prefix}_perSample_plots - >>> - - output { - File perSample_plots_tarball = "~{prefix}.plotQC_perSample.tar.gz" - } -} - - -# Plot per-family stats -task PlotQcPerFamily { - input { - File vcf_stats - File samples_list - File ped_file - File per_sample_tarball - String prefix - String sv_pipeline_qc_docker - RuntimeAttr? runtime_attr_override - } - - # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to - # be held in memory or disk while working, potentially in a form that takes up more space) - Float input_size = size([vcf_stats, samples_list, ped_file, per_sample_tarball], "GiB") - Float compression_factor = 5.0 - Float base_disk_gb = 5.0 - # give extra base memory in case the plotting functions are very inefficient - Float base_mem_gb = 3.75 - RuntimeAttr runtime_default = object { - mem_gb: base_mem_gb + compression_factor * input_size, - disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)), - cpu_cores: 1, - preemptible_tries: 3, - max_retries: 1, - boot_disk_gb: 10 - } - RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) - runtime { - memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" - disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" - cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) - preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) - maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) - docker: sv_pipeline_qc_docker - bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) - } - - command <<< - set -eu -o pipefail - - # Clean fam file - /opt/sv-pipeline/scripts/vcf_qc/cleanFamFile.sh \ - ~{samples_list} \ - ~{ped_file} \ - cleaned.fam - rm ~{ped_file} ~{samples_list} - - # Only run if any families remain after cleaning - if [ $( grep -Ev "^#" cleaned.fam | wc -l ) -gt 0 ]; then - - # Make per-sample directory - mkdir ~{prefix}_perSample/ - - # Untar per-sample VID lists - mkdir tmp_untar/ - tar -xvzf ~{per_sample_tarball} \ - --directory tmp_untar/ - find tmp_untar/ -name "*.VIDs_genotypes.txt.gz" | while read FILE; do - mv $FILE ~{prefix}_perSample/ - done - - # Run family analysis - /opt/sv-pipeline/scripts/vcf_qc/analyze_fams.R \ - -S /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ - ~{vcf_stats} \ - cleaned.fam \ - ~{prefix}_perSample/ \ - ~{prefix}_perFamily_plots/ - - else - - mkdir ~{prefix}_perFamily_plots/ - - fi - - # Prepare output - tar -czvf ~{prefix}.plotQC_perFamily.tar.gz \ - ~{prefix}_perFamily_plots - >>> - - output { - File perFamily_plots_tarball = "~{prefix}.plotQC_perFamily.tar.gz" - File cleaned_fam_file = "cleaned.fam" - } -} - - -# Plot per-sample benchmarking -task PlotQcPerSampleBenchmarking { - input { - File per_sample_benchmarking_tarball - File samples_list - String comparison_set_name - String prefix - String sv_pipeline_qc_docker - RuntimeAttr? runtime_attr_override - } - - # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to - # be held in memory or disk while working, potentially in a form that takes up more space) - Float input_size = size([per_sample_benchmarking_tarball, samples_list], "GiB") - Float compression_factor = 5.0 - Float base_disk_gb = 5.0 - # give extra base memory in case the plotting functions are very inefficient - Float base_mem_gb = 8.0 - RuntimeAttr runtime_default = object { - mem_gb: base_mem_gb + compression_factor * input_size, - disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)), - cpu_cores: 1, - preemptible_tries: 3, - max_retries: 1, - boot_disk_gb: 10 - } - RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) - runtime { - memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" - disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" - cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) - preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) - maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) - docker: sv_pipeline_qc_docker - bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) - } - - command <<< - set -eu -o pipefail - - # Untar per-sample benchmarking results - mkdir tmp_untar/ - tar -xvzf ~{per_sample_benchmarking_tarball} \ - --directory tmp_untar/ - mkdir results/ - find tmp_untar/ -name "*.sensitivity.bed.gz" | while read FILE; do - mv $FILE results/ - done - find tmp_untar/ -name "*.specificity.bed.gz" | while read FILE; do - mv $FILE results/ - done - - # Plot per-sample benchmarking - /opt/sv-pipeline/scripts/vcf_qc/plot_perSample_benchmarking.R \ - -c ~{comparison_set_name} \ - results/ \ - ~{samples_list} \ - /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ - ~{prefix}.~{comparison_set_name}_perSample_benchmarking_plots/ - - # Prepare output - tar -czvf ~{prefix}.~{comparison_set_name}_perSample_benchmarking_plots.tar.gz \ - ~{prefix}.~{comparison_set_name}_perSample_benchmarking_plots - >>> - - output { - File perSample_plots_tarball = "~{prefix}.~{comparison_set_name}_perSample_benchmarking_plots.tar.gz" - } -} - - -# Sanitize final output -task SanitizeOutputs { - input { - String prefix - File samples_list - File vcf_stats - File vcf_stats_idx - File plot_qc_vcfwide_tarball - File? plot_qc_external_benchmarking_thousand_g_tarball - File? plot_qc_external_benchmarking_asc_tarball - File? plot_qc_external_benchmarking_hgsv_tarball - File collect_qc_per_sample_tarball - File plot_qc_per_sample_tarball - File plot_qc_per_family_tarball - File cleaned_fam_file - File? plot_qc_per_sample_sanders_tarball - File? plot_qc_per_sample_collins_tarball - File? plot_qc_per_sample_werling_tarball - String sv_base_mini_docker - RuntimeAttr? runtime_attr_override - } - - # simple compress + tar workf - Float input_size = size( - [ vcf_stats, samples_list, vcf_stats, vcf_stats_idx, plot_qc_vcfwide_tarball, - plot_qc_external_benchmarking_thousand_g_tarball, plot_qc_external_benchmarking_asc_tarball, - plot_qc_external_benchmarking_hgsv_tarball, collect_qc_per_sample_tarball, - plot_qc_per_sample_tarball, plot_qc_per_family_tarball, cleaned_fam_file ], - "GiB" - ) + size( - [ plot_qc_per_sample_sanders_tarball, plot_qc_per_sample_collins_tarball, - plot_qc_per_sample_werling_tarball ], - "GiB" - ) - Float compression_factor = 5.0 - Float base_disk_gb = 5.0 - Float base_mem_gb = 2.0 - RuntimeAttr runtime_default = object { - mem_gb: base_mem_gb, - disk_gb: ceil(base_disk_gb + input_size * (2.0 + compression_factor)), - cpu_cores: 1, - preemptible_tries: 3, - max_retries: 1, - boot_disk_gb: 10 - } - RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) - runtime { - memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" - disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" - cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) - preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) - maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) - docker: sv_base_mini_docker - bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) - } - - command <<< - set -eu -o pipefail - - # Prep output directory tree - mkdir ~{prefix}_SV_VCF_QC_output/ - mkdir ~{prefix}_SV_VCF_QC_output/data/ - mkdir ~{prefix}_SV_VCF_QC_output/data/variant_info_per_sample/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/main_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/vcf_summary_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/1000G_Sudmant_benchmarking_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/ASC_Werling_benchmarking_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/HGSV_Chaisson_benchmarking_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/per_sample_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/sv_inheritance_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/Sanders_2015_array_perSample_benchmarking_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/Collins_2017_liWGS_perSample_benchmarking_plots/ - mkdir ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/Werling_2018_WGS_perSample_benchmarking_plots/ - - # Process VCF-wide stats - cp ~{vcf_stats} \ - ~{prefix}_SV_VCF_QC_output/data/~{prefix}.VCF_sites.stats.bed.gz - cp ~{vcf_stats_idx} \ - ~{prefix}_SV_VCF_QC_output/data/~{prefix}.VCF_sites.stats.bed.gz.tbi - - # Process VCF-wide plots - tar -xzvf ~{plot_qc_vcfwide_tarball} - cp plotQC_vcfwide_output/main_plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/main_plots/ - cp plotQC_vcfwide_output/supporting_plots/vcf_summary_plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/vcf_summary_plots/ - - if ~{defined(plot_qc_external_benchmarking_thousand_g_tarball)}; then - # Process 1000G benchmarking stats & plots - tar -xzvf ~{plot_qc_external_benchmarking_thousand_g_tarball} - cp collectQC_benchmarking_1000G_Sudmant_output/data/1000G_Sudmant.SV.ALL.overlaps.bed.gz* \ - ~{prefix}_SV_VCF_QC_output/data/ - cp collectQC_benchmarking_1000G_Sudmant_output/plots/1000G_Sudmant_ALL_samples/main_plots/VCF_QC.1000G_Sudmant_ALL.callset_benchmarking.png \ - ~{prefix}_SV_VCF_QC_output/plots/main_plots/ - cp -r collectQC_benchmarking_1000G_Sudmant_output/plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/1000G_Sudmant_benchmarking_plots/ - fi - - if ~{defined(plot_qc_external_benchmarking_asc_tarball)}; then - # Process ASC benchmarking stats & plots - tar -xzvf ~{plot_qc_external_benchmarking_asc_tarball} - cp collectQC_benchmarking_ASC_Werling_output/data/ASC_Werling.SV.ALL.overlaps.bed.gz* \ - ~{prefix}_SV_VCF_QC_output/data/ - cp collectQC_benchmarking_ASC_Werling_output/plots/ASC_Werling_ALL_samples/main_plots/VCF_QC.ASC_Werling_ALL.callset_benchmarking.png \ - ~{prefix}_SV_VCF_QC_output/plots/main_plots/ - cp -r collectQC_benchmarking_ASC_Werling_output/plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/ASC_Werling_benchmarking_plots/ - fi - - if ~{defined(plot_qc_external_benchmarking_hgsv_tarball)}; then - # Process HGSV benchmarking stats & plots - tar -xzvf ~{plot_qc_external_benchmarking_hgsv_tarball} - cp collectQC_benchmarking_HGSV_Chaisson_output/data/HGSV_Chaisson.SV.ALL.overlaps.bed.gz* \ - ~{prefix}_SV_VCF_QC_output/data/ - cp collectQC_benchmarking_HGSV_Chaisson_output/plots/HGSV_Chaisson_ALL_samples/main_plots/VCF_QC.HGSV_Chaisson_ALL.callset_benchmarking.png \ - ~{prefix}_SV_VCF_QC_output/plots/main_plots/ - cp -r collectQC_benchmarking_HGSV_Chaisson_output/plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/HGSV_Chaisson_benchmarking_plots/ - fi - - # Process per-sample stats - tar -xzvf ~{collect_qc_per_sample_tarball} - cp ~{prefix}_perSample_VIDs_merged/*.VIDs_genotypes.txt.gz \ - ~{prefix}_SV_VCF_QC_output/data/variant_info_per_sample/ - - # Process per-sample plots - tar -xzvf ~{plot_qc_per_sample_tarball} - cp ~{prefix}_perSample_plots/main_plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/main_plots/ - cp ~{prefix}_perSample_plots/supporting_plots/per_sample_plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/per_sample_plots/ - - # Process per-family plots - tar -xzvf ~{plot_qc_per_family_tarball} - cp ~{prefix}_perFamily_plots/main_plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/main_plots/ || true - cp ~{prefix}_perFamily_plots/supporting_plots/sv_inheritance_plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/sv_inheritance_plots/ || true - - if ~{defined(plot_qc_per_sample_sanders_tarball)}; then - # Process Sanders per-sample benchmarking plots - tar -xzvf ~{plot_qc_per_sample_sanders_tarball} - cp ~{prefix}.Sanders_2015_array_perSample_benchmarking_plots/main_plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/main_plots/ || true - cp ~{prefix}.Sanders_2015_array_perSample_benchmarking_plots/supporting_plots/per_sample_benchmarking_Sanders_2015_array/* \ - ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/Sanders_2015_array_perSample_benchmarking_plots/ || true - fi - - if ~{defined(plot_qc_per_sample_collins_tarball)}; then - # Process Collins per-sample benchmarking plots - tar -xzvf ~{plot_qc_per_sample_collins_tarball} - cp ~{prefix}.Collins_2017_liWGS_perSample_benchmarking_plots/main_plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/main_plots/ || true - cp ~{prefix}.Collins_2017_liWGS_perSample_benchmarking_plots/supporting_plots/per_sample_benchmarking_Collins_2017_liWGS/* \ - ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/Collins_2017_liWGS_perSample_benchmarking_plots/ || true - fi - - if ~{defined(plot_qc_per_sample_werling_tarball)}; then - # Process Werling per-sample benchmarking plots - tar -xzvf ~{plot_qc_per_sample_werling_tarball} - cp ~{prefix}.Werling_2018_WGS_perSample_benchmarking_plots/main_plots/* \ - ~{prefix}_SV_VCF_QC_output/plots/main_plots/ || true - cp ~{prefix}.Werling_2018_WGS_perSample_benchmarking_plots/supporting_plots/per_sample_benchmarking_Werling_2018_WGS/* \ - ~{prefix}_SV_VCF_QC_output/plots/supplementary_plots/Werling_2018_WGS_perSample_benchmarking_plots/ || true - fi - - # Process misc files - cp ~{cleaned_fam_file} \ - ~{prefix}_SV_VCF_QC_output/data/~{prefix}.cleaned_trios.fam - cp ~{samples_list} \ - ~{prefix}_SV_VCF_QC_output/data/~{prefix}.samples_analyzed.list - - # Compress final output - tar -czvf ~{prefix}_SV_VCF_QC_output.tar.gz \ - ~{prefix}_SV_VCF_QC_output - >>> - - output { - File vcf_qc_tarball = "~{prefix}_SV_VCF_QC_output.tar.gz" - } -} - diff --git a/wdl/PerSampleExternalBenchmark.wdl b/wdl/PerSampleExternalBenchmark.wdl index 7ebadcbf6..b434a3595 100644 --- a/wdl/PerSampleExternalBenchmark.wdl +++ b/wdl/PerSampleExternalBenchmark.wdl @@ -12,12 +12,14 @@ workflow PerSampleExternalBenchmark { File per_sample_tarball File comparison_tarball String prefix + Array[String] contigs String comparison_set_name Int samples_per_shard Int? random_seed String sv_base_mini_docker String sv_pipeline_docker + String sv_pipeline_qc_docker # overrides for local tasks RuntimeAttr? runtime_override_benchmark_samples @@ -47,23 +49,24 @@ workflow PerSampleExternalBenchmark { per_sample_tarball=per_sample_tarball, comparison_tarball=comparison_tarball, prefix=prefix, + contigs=contigs, comparison_set_name=comparison_set_name, - sv_pipeline_docker=sv_pipeline_docker, + sv_pipeline_qc_docker=sv_pipeline_qc_docker, runtime_attr_override=runtime_override_benchmark_samples } } - call MiniTasks.FilesToTarredFolder as MergeAndTarShardBenchmarks { + call MergeTarballs as MergeTarredResults { input: - in_files=flatten(BenchmarkSamples.benchmarking_results), - folder_name="~{prefix}_~{comparison_set_name}_results_merged", + in_tarballs=BenchmarkSamples.benchmarking_results, + folder_name=prefix + "_vs_" + comparison_set_name, sv_base_mini_docker=sv_base_mini_docker, runtime_attr_override=runtime_override_merge_and_tar_shard_benchmarks } # Return tarball of results output { - File benchmarking_results_tarball = MergeAndTarShardBenchmarks.tarball + File benchmarking_results_tarball = MergeTarredResults.tarball } } @@ -76,8 +79,9 @@ task BenchmarkSamples { File per_sample_tarball File comparison_tarball String prefix + Array[String] contigs String comparison_set_name - String sv_pipeline_docker + String sv_pipeline_qc_docker RuntimeAttr? runtime_attr_override } @@ -86,14 +90,14 @@ task BenchmarkSamples { # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to # be held in memory or disk while working, potentially in a form that takes up more space) Float input_size = size([vcf_stats, samples_list, per_sample_tarball, comparison_tarball], "GiB") - Float compression_factor = 5.0 + Float compression_factor = 1.5 Float base_disk_gb = 5.0 - Float base_mem_gb = 2.0 + Float base_mem_gb = 3.0 RuntimeAttr runtime_default = object { mem_gb: base_mem_gb + compression_factor * input_size, disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)), cpu_cores: 1, - preemptible_tries: 3, + preemptible_tries: 1, max_retries: 1, boot_disk_gb: 10 } @@ -104,7 +108,7 @@ task BenchmarkSamples { cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) - docker: sv_pipeline_docker + docker: sv_pipeline_qc_docker bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) } @@ -117,16 +121,70 @@ task BenchmarkSamples { -p ~{comparison_set_name} \ ~{vcf_stats} \ ~{samples_list} \ + ~{write_lines(contigs)} \ ~{per_sample_tarball} \ ~{comparison_tarball} \ ~{output_folder}/ + + # Tar benchmarking results for easier caching of downstream steps + tar -czvf ~{output_folder}.tar.gz ~{output_folder} >>> output { - Array[File] benchmarking_results = flatten([ - glob("~{output_folder}/*.sensitivity.bed.gz"), - glob("~{output_folder}/*.specificity.bed.gz") - ]) + File benchmarking_results = "~{output_folder}.tar.gz" } } + +# Task to merge benchmarking results across shards +task MergeTarballs { + input { + Array[File] in_tarballs + String? folder_name + String? tarball_prefix + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + String tar_folder_name = select_first([folder_name, "merged"]) + String outfile_name = select_first([tarball_prefix, tar_folder_name]) + ".tar.gz" + + # Since the input files are often/always compressed themselves, assume compression factor for tarring is 1.0 + Float input_size = size(in_tarballs, "GB") + Float base_disk_gb = 10.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb, + disk_gb: ceil(base_disk_gb + input_size * 2.0), + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + # Create final output directory + mkdir "~{tar_folder_name}" + + while read tarball_path; do + tar -xzvf "$tarball_path" --directory ~{tar_folder_name}/ + done < ~{write_lines(in_tarballs)} + + # Compress final output directory + tar -czvf "~{outfile_name}" "~{tar_folder_name}" + >>> + + output { + File tarball = outfile_name + } +} diff --git a/wdl/ReClusterCleanVcfAcrossGenomicContext.wdl b/wdl/ReClusterCleanVcfAcrossGenomicContext.wdl new file mode 100755 index 000000000..194596456 --- /dev/null +++ b/wdl/ReClusterCleanVcfAcrossGenomicContext.wdl @@ -0,0 +1,131 @@ +version 1.0 + +# Author: Xuefang Zhao + +import "Structs.wdl" +import "AnnotateGenomicContext.wdl" as AnnotateGenomicContext +import "ReClusterCleanVcfSVsUnit.wdl" as ReClusterCleanVcfSVsUnit +import "TasksBenchmark.wdl" as mini_tasks + +# Workflow to annotate vcf file with genomic context +workflow ReClusterCleanVcfAcrossGenomicContext { + input { + File vcf + File vcf_index + File Repeat_Masks + File Simple_Repeats + File Segmental_Duplicates + File? annotated_SV_origin + + String prefix + String vid_prefix + String svtype + Array[String] Genomic_Context_list + Array[Array[Int]] size_list + + Int num_vids + Int num_samples + Int dist + Int svsize + Float frac + Float sample_overlap + + String sv_base_mini_docker + String sv_benchmark_docker + String sv_pipeline_docker + + # overrides for MiniTasks + RuntimeAttr? runtime_override_vcf_to_bed + RuntimeAttr? runtime_attr_override_anno_gc + RuntimeAttr? runtime_attr_override_inte_gc + RuntimeAttr? runtime_override_extract_SV_sites + RuntimeAttr? runtime_attr_override_svtk_vcfcluster + RuntimeAttr? runtime_attr_override_extract_svid + RuntimeAttr? runtime_attr_override_integrate_vcfs + RuntimeAttr? runtime_attr_override_concat_vcfs + RuntimeAttr? runtime_attr_override_sort_reclustered_vcf + RuntimeAttr? runtime_attr_override_concat_clustered_SVID + + } + + if(!defined(annotated_SV_origin)){ + call AnnotateGenomicContext.AnnotateSVsWithGenomicContext as AnnotateSVsWithGenomicContext { + input: + vcf = vcf, + vcf_index = vcf_index, + Repeat_Masks = Repeat_Masks, + Simple_Repeats = Simple_Repeats, + Segmental_Duplicates = Segmental_Duplicates, + + sv_base_mini_docker =sv_base_mini_docker, + sv_benchmark_docker = sv_benchmark_docker, + sv_pipeline_docker = sv_pipeline_docker, + + runtime_override_extract_SV_sites = runtime_override_extract_SV_sites, + runtime_override_vcf_to_bed = runtime_override_vcf_to_bed, + runtime_attr_override_anno_gc = runtime_attr_override_anno_gc, + runtime_attr_override_inte_gc = runtime_attr_override_inte_gc + } + } + + File annotated_SVs = select_first([annotated_SV_origin, AnnotateSVsWithGenomicContext.annotated_SVs]) + + scatter (i in range(length(Genomic_Context_list))) { + call ReClusterCleanVcfSVsUnit.ReClusterCleanVcfSVs as ReClusterCleanVcfSVs{ + input: + vcf = vcf, + vcf_index = vcf_index, + annotated_SV_origin = annotated_SVs, + + Repeat_Masks = Repeat_Masks, + Simple_Repeats = Simple_Repeats, + Segmental_Duplicates = Segmental_Duplicates, + + prefix = "~{prefix}_~{Genomic_Context_list[i]}", + vid_prefix = "~{vid_prefix}_~{Genomic_Context_list[i]}", + svtype = svtype, + Genomic_Context = Genomic_Context_list[i], + min_size = size_list[i][0], + max_size = size_list[i][1], + + num_vids = num_vids, + num_samples = num_samples, + dist = dist, + svsize = svsize, + frac = frac, + sample_overlap = sample_overlap, + + sv_base_mini_docker =sv_base_mini_docker, + sv_benchmark_docker = sv_benchmark_docker, + sv_pipeline_docker = sv_pipeline_docker, + + runtime_attr_override_extract_svid = runtime_attr_override_extract_svid, + runtime_override_vcf_to_bed = runtime_override_vcf_to_bed, + + runtime_override_vcf_to_bed = runtime_override_vcf_to_bed, + runtime_attr_override_anno_gc = runtime_attr_override_anno_gc, + runtime_attr_override_inte_gc = runtime_attr_override_inte_gc, + runtime_override_extract_SV_sites = runtime_override_extract_SV_sites, + runtime_attr_override_svtk_vcfcluster = runtime_attr_override_svtk_vcfcluster, + runtime_attr_override_extract_svid = runtime_attr_override_extract_svid, + runtime_attr_override_integrate_vcfs = runtime_attr_override_integrate_vcfs, + runtime_attr_override_sort_reclustered_vcf= runtime_attr_override_sort_reclustered_vcf + } + } + + call mini_tasks.ConcatVcfs as ConcatVcfs{ + input: + vcfs = ReClusterCleanVcfSVs.reclustered_SV, + vcfs_idx = ReClusterCleanVcfSVs.reclustered_SV_idx, + outfile_prefix = prefix, + sv_base_mini_docker = sv_base_mini_docker, + runtime_attr_override = runtime_attr_override_concat_vcfs + } + + output{ + File reclustered_SV_genomic_context = ConcatVcfs.concat_vcf + File reclustered_SV_genomic_context_idx = ConcatVcfs.concat_vcf_idx + } +} + + diff --git a/wdl/ReClusterCleanVcfAcrossSVTYPE.wdl b/wdl/ReClusterCleanVcfAcrossSVTYPE.wdl new file mode 100755 index 000000000..f9429d950 --- /dev/null +++ b/wdl/ReClusterCleanVcfAcrossSVTYPE.wdl @@ -0,0 +1,153 @@ +version 1.0 + +# Author: Xuefang Zhao + +import "Structs.wdl" +import "AnnotateGenomicContext.wdl" as AnnotateGenomicContext +import "ReClusterCleanVcfAcrossGenomicContext.wdl" as ReClusterCleanVcfAcrossGenomicContext +import "TasksBenchmark.wdl" as mini_tasks +# Workflow to annotate vcf file with genomic context +workflow ReClusterCleanVcfAcrossSVTYPE { + input { + File vcf + File vcf_index + File Repeat_Masks + File Simple_Repeats + File Segmental_Duplicates + File? annotated_SV_origin + + String prefix + String vid_prefix + Array[String] svtype_list + Array[Array[String]] Genomic_Context_list_list + Array[Array[Array[Int]]] size_list_list + + Int num_vids + Int num_samples + Int dist + Int svsize + Float frac + Float sample_overlap + + String sv_base_mini_docker + String sv_benchmark_docker + String sv_pipeline_docker + + # overrides for MiniTasks + RuntimeAttr? runtime_override_vcf_to_bed + RuntimeAttr? runtime_attr_override_anno_gc + RuntimeAttr? runtime_attr_override_inte_gc + RuntimeAttr? runtime_override_extract_SV_sites + RuntimeAttr? runtime_attr_override_svtk_vcfcluster + RuntimeAttr? runtime_attr_override_extract_svid + RuntimeAttr? runtime_attr_override_integrate_vcfs + RuntimeAttr? runtime_attr_override_concat_vcfs + RuntimeAttr? runtime_attr_override_sort_reclustered_vcf + RuntimeAttr? runtime_attr_override_concat_clustered_SVID + } + + if(!defined(annotated_SV_origin)){ + call AnnotateGenomicContext.AnnotateSVsWithGenomicContext as AnnotateSVsWithGenomicContext { + input: + vcf = vcf, + vcf_index = vcf_index, + Repeat_Masks = Repeat_Masks, + Simple_Repeats = Simple_Repeats, + Segmental_Duplicates = Segmental_Duplicates, + + sv_base_mini_docker =sv_base_mini_docker, + sv_benchmark_docker = sv_benchmark_docker, + sv_pipeline_docker = sv_pipeline_docker, + + runtime_override_extract_SV_sites = runtime_override_extract_SV_sites, + runtime_override_vcf_to_bed = runtime_override_vcf_to_bed, + runtime_attr_override_anno_gc = runtime_attr_override_anno_gc, + runtime_attr_override_inte_gc = runtime_attr_override_inte_gc + } + } + + File annotated_SVs = select_first([annotated_SV_origin, AnnotateSVsWithGenomicContext.annotated_SVs]) + + scatter (i in range(length(svtype_list))) { + call ReClusterCleanVcfAcrossGenomicContext.ReClusterCleanVcfAcrossGenomicContext as ReClusterCleanVcfAcrossGenomicContext{ + input: + vcf = vcf, + vcf_index = vcf_index, + annotated_SV_origin = annotated_SVs, + + Repeat_Masks = Repeat_Masks, + Simple_Repeats = Simple_Repeats, + Segmental_Duplicates = Segmental_Duplicates, + + prefix = "~{prefix}_~{svtype_list[i]}", + vid_prefix = "~{vid_prefix}_~{svtype_list[i]}", + svtype = svtype_list[i], + Genomic_Context_list = Genomic_Context_list_list[i], + size_list = size_list_list[i], + + num_vids = num_vids, + num_samples = num_samples, + dist = dist, + svsize = svsize, + frac = frac, + sample_overlap = sample_overlap, + + sv_base_mini_docker =sv_base_mini_docker, + sv_benchmark_docker = sv_benchmark_docker, + sv_pipeline_docker = sv_pipeline_docker, + + runtime_attr_override_extract_svid = runtime_attr_override_extract_svid, + runtime_override_vcf_to_bed = runtime_override_vcf_to_bed, + + runtime_override_vcf_to_bed = runtime_override_vcf_to_bed, + runtime_attr_override_anno_gc = runtime_attr_override_anno_gc, + runtime_attr_override_inte_gc = runtime_attr_override_inte_gc, + runtime_override_extract_SV_sites = runtime_override_extract_SV_sites, + runtime_attr_override_svtk_vcfcluster = runtime_attr_override_svtk_vcfcluster, + runtime_attr_override_extract_svid = runtime_attr_override_extract_svid, + runtime_attr_override_integrate_vcfs = runtime_attr_override_integrate_vcfs, + runtime_attr_override_sort_reclustered_vcf= runtime_attr_override_sort_reclustered_vcf + + } + } + + + call mini_tasks.ConcatVcfs as ConcatVcfs{ + input: + vcfs = ReClusterCleanVcfAcrossGenomicContext.reclustered_SV_genomic_context, + vcfs_idx = ReClusterCleanVcfAcrossGenomicContext.reclustered_SV_genomic_context_idx, + outfile_prefix = prefix, + sv_base_mini_docker = sv_base_mini_docker, + runtime_attr_override = runtime_attr_override_concat_vcfs + } + + call mini_tasks.IntegrateReClusterdVcfs{ + input: + vcf_all = vcf, + vcf_all_idx = vcf_index, + vcf_recluster = ConcatVcfs.concat_vcf, + vcf_recluster_idx = ConcatVcfs.concat_vcf_idx, + sv_pipeline_docker = sv_pipeline_docker, + runtime_attr_override = runtime_attr_override_integrate_vcfs + } + + call mini_tasks.SortReClusterdVcfs{ + input: + vcf_1 = IntegrateReClusterdVcfs.reclustered_Part1, + vcf_2 = IntegrateReClusterdVcfs.reclustered_Part2, + vcf_1_idx = IntegrateReClusterdVcfs.reclustered_Part1_idx, + vcf_2_idx = IntegrateReClusterdVcfs.reclustered_Part2_idx, + sv_pipeline_docker = sv_pipeline_docker, + runtime_attr_override = runtime_attr_override_sort_reclustered_vcf + } + + + + output{ + File svid_annotation = IntegrateReClusterdVcfs.SVID_anno + File reclustered_SV_svtype = SortReClusterdVcfs.sorted_vcf + File reclustered_SV_svtype_idx = SortReClusterdVcfs.sorted_vcf_idx + } +} + + diff --git a/wdl/ReClusterCleanVcfSVsUnit.wdl b/wdl/ReClusterCleanVcfSVsUnit.wdl new file mode 100755 index 000000000..5b368ecdd --- /dev/null +++ b/wdl/ReClusterCleanVcfSVsUnit.wdl @@ -0,0 +1,235 @@ +version 1.0 + +# Author: Xuefang Zhao + +import "Structs.wdl" +import "AnnotateGenomicContext.wdl" as AnnotateGenomicContext +import "TasksBenchmark.wdl" as mini_tasks + +# Workflow to annotate vcf file with genomic context +workflow ReClusterCleanVcfSVs { + input { + File vcf + File vcf_index + File Repeat_Masks + File Simple_Repeats + File Segmental_Duplicates + File? annotated_SV_origin + + String prefix + String vid_prefix + String svtype + String Genomic_Context + Int min_size + Int max_size + + Int num_vids + Int num_samples + Int dist + Int svsize + Float frac + Float sample_overlap + + String sv_base_mini_docker + String sv_benchmark_docker + String sv_pipeline_docker + + # overrides for MiniTasks + RuntimeAttr? runtime_override_vcf_to_bed + RuntimeAttr? runtime_attr_override_anno_gc + RuntimeAttr? runtime_attr_override_inte_gc + RuntimeAttr? runtime_override_extract_SV_sites + RuntimeAttr? runtime_attr_override_svtk_vcfcluster + RuntimeAttr? runtime_attr_override_extract_svid + RuntimeAttr? runtime_attr_override_integrate_vcfs + RuntimeAttr? runtime_attr_override_sort_reclustered_vcf + } + + if(!defined(annotated_SV_origin)){ + call AnnotateGenomicContext.AnnotateSVsWithGenomicContext as AnnotateSVsWithGenomicContext { + input: + vcf = vcf, + vcf_index = vcf_index, + Repeat_Masks = Repeat_Masks, + Simple_Repeats = Simple_Repeats, + Segmental_Duplicates = Segmental_Duplicates, + + sv_base_mini_docker =sv_base_mini_docker, + sv_benchmark_docker = sv_benchmark_docker, + sv_pipeline_docker = sv_pipeline_docker, + + runtime_override_extract_SV_sites = runtime_override_extract_SV_sites, + runtime_override_vcf_to_bed = runtime_override_vcf_to_bed, + runtime_attr_override_anno_gc = runtime_attr_override_anno_gc, + runtime_attr_override_inte_gc = runtime_attr_override_inte_gc + } + } + + File annotated_SVs = select_first([annotated_SV_origin, AnnotateSVsWithGenomicContext.annotated_SVs]) + + call Extract_SVID_by_GC { + input: + vcf = vcf, + vcf_index = vcf_index, + SVID_GC = annotated_SVs, + Genomic_Context = Genomic_Context, + svtype = svtype, + min_size = min_size, + max_size = max_size, + + sv_benchmark_docker = sv_benchmark_docker, + runtime_attr_override = runtime_attr_override_extract_svid + } + + call SvtkVcfCluster { + input: + vcf = Extract_SVID_by_GC.out_vcf, + vcf_idx = Extract_SVID_by_GC.out_idx, + prefix = prefix, + vid_prefix = vid_prefix, + num_vids = num_vids, + num_samples = num_samples, + dist = dist, + frac = frac, + sample_overlap = sample_overlap, + svsize = svsize, + svtype = svtype, + sv_pipeline_docker = sv_pipeline_docker, + runtime_attr_override = runtime_attr_override_svtk_vcfcluster + } + + output{ + File reclustered_SV = SvtkVcfCluster.vcf_out + File reclustered_SV_idx = SvtkVcfCluster.vcf_out_idx + } +} + + + +task SvtkVcfCluster { + input { + File vcf + File vcf_idx + String prefix + String vid_prefix + Int num_vids + Int num_samples + Int dist + Float frac + Float sample_overlap + File? exclude_list + File? exclude_list_idx + Int svsize + String svtype + String sv_pipeline_docker + RuntimeAttr? runtime_attr_override + } + + Float default_mem_gb = 10 + (120.0 * (num_vids / 19000.0) * (num_samples / 140000.0)) + RuntimeAttr runtime_default = object { + mem_gb: default_mem_gb, + disk_gb: ceil(30.0 + size(vcf, "GiB") * 30.0), + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -euo pipefail + ~{if defined(exclude_list) && !defined(exclude_list_idx) then "tabix -p bed ~{exclude_list}" else ""} + + #Run clustering + svtk vcfcluster <(echo "~{vcf}") - \ + -d ~{dist} \ + -f ~{frac} \ + ~{if defined(exclude_list) then "-x ~{exclude_list}" else ""} \ + -z ~{svsize} \ + -p ~{vid_prefix} \ + -t ~{svtype} \ + -o ~{sample_overlap} \ + --preserve-ids \ + --preserve-genotypes \ + --preserve-header \ + | bcftools sort -O z -o ~{prefix}.vcf.gz - + tabix -p vcf ~{prefix}.vcf.gz + >>> + + output { + File vcf_out = "~{prefix}.vcf.gz" + File vcf_out_idx = "~{prefix}.vcf.gz.tbi" + } +} + + +task Extract_SVID_by_GC{ + input { + File? SVID_GC + File vcf + File vcf_index + String Genomic_Context + String svtype + Int min_size + Int max_size + String sv_benchmark_docker + RuntimeAttr? runtime_attr_override + } + + Float vcf_size = size(vcf, "GiB") + Int vm_disk_size = ceil(vcf_size * 2) + + RuntimeAttr runtime_default = object { + mem_gb: 1, + disk_gb: vm_disk_size, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_benchmark_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + String prefix = basename(vcf, ".vcf.gz") + + command <<< + set -euxo pipefail + + python /src/extract_SVs_by_svtype_genomiccontect.py \ + ~{vcf} \ + ~{prefix}.~{svtype}.~{Genomic_Context}.vcf.gz \ + ~{SVID_GC} \ + ~{svtype} \ + ~{Genomic_Context} \ + ~{min_size} \ + ~{max_size} + + tabix -p vcf ~{prefix}.~{svtype}.~{Genomic_Context}.vcf.gz + + >>> + + output { + File out_vcf = "~{prefix}.~{svtype}.~{Genomic_Context}.vcf.gz" + File out_idx = "~{prefix}.~{svtype}.~{Genomic_Context}.vcf.gz.tbi" + } +} + + diff --git a/wdl/ReClusterCleanVcfWholeGenome.wdl b/wdl/ReClusterCleanVcfWholeGenome.wdl new file mode 100755 index 000000000..2cd14af4d --- /dev/null +++ b/wdl/ReClusterCleanVcfWholeGenome.wdl @@ -0,0 +1,95 @@ +version 1.0 + +# Author: Xuefang Zhao + +import "Structs.wdl" +import "ReClusterCleanVcfAcrossSVTYPE.wdl" as ReClusterCleanVcfAcrossSVTYPE +# Workflow to annotate vcf file with genomic context + +workflow ReClusterCleanVcfWholeGenome { + input { + Array[File] vcf_list + Array[File] vcf_index_list + Array[String] chromosome_name_list + String prefix + String vid_prefix + + File Repeat_Masks + File Simple_Repeats + File Segmental_Duplicates + + Array[String] svtype_list + Array[Array[String]] Genomic_Context_list_list + Array[Array[Array[Int]]] size_list_list + + Int num_vids + Int num_samples + Int dist + Int svsize + Float frac + Float sample_overlap + + String sv_base_mini_docker + String sv_benchmark_docker + String sv_pipeline_docker + + # overrides for MiniTasks + RuntimeAttr? runtime_override_vcf_to_bed + RuntimeAttr? runtime_attr_override_anno_gc + RuntimeAttr? runtime_attr_override_inte_gc + RuntimeAttr? runtime_override_extract_SV_sites + RuntimeAttr? runtime_attr_override_svtk_vcfcluster + RuntimeAttr? runtime_attr_override_extract_svid + RuntimeAttr? runtime_attr_override_integrate_vcfs + RuntimeAttr? runtime_attr_override_concat_clustered_SVID + RuntimeAttr? runtime_attr_override_sort_reclustered_vcf + RuntimeAttr? runtime_attr_override_concat_vcfs + } + + scatter(i in range(length(vcf_list))){ + call ReClusterCleanVcfAcrossSVTYPE.ReClusterCleanVcfAcrossSVTYPE as ReClusterCleanVcfAcrossSVTYPE{ + input: + vcf = vcf_list[i], + vcf_index = vcf_index_list[i], + prefix = "~{prefix}_~{chromosome_name_list[i]}", + vid_prefix = "~{vid_prefix}_~{chromosome_name_list[i]}", + + Repeat_Masks = Repeat_Masks, + Simple_Repeats = Simple_Repeats, + Segmental_Duplicates = Segmental_Duplicates, + + svtype_list = svtype_list, + Genomic_Context_list_list = Genomic_Context_list_list, + size_list_list = size_list_list, + + num_vids = num_vids, + num_samples = num_samples, + dist = dist, + svsize = svsize, + frac = frac, + sample_overlap = sample_overlap, + sv_base_mini_docker = sv_base_mini_docker, + sv_benchmark_docker = sv_benchmark_docker, + sv_pipeline_docker = sv_pipeline_docker, + runtime_override_vcf_to_bed = runtime_override_vcf_to_bed, + runtime_attr_override_anno_gc = runtime_attr_override_anno_gc, + runtime_attr_override_inte_gc = runtime_attr_override_inte_gc, + runtime_override_extract_SV_sites = runtime_override_extract_SV_sites, + runtime_attr_override_svtk_vcfcluster = runtime_attr_override_svtk_vcfcluster, + runtime_attr_override_extract_svid = runtime_attr_override_extract_svid, + runtime_attr_override_integrate_vcfs = runtime_attr_override_integrate_vcfs, + runtime_attr_override_concat_vcfs = runtime_attr_override_concat_vcfs, + runtime_attr_override_sort_reclustered_vcf = runtime_attr_override_sort_reclustered_vcf, + runtime_attr_override_concat_clustered_SVID = runtime_attr_override_concat_clustered_SVID + } + } + + output{ + Array[File] svid_annotation = ReClusterCleanVcfAcrossSVTYPE.svid_annotation + Array[File] out_vcfs = ReClusterCleanVcfAcrossSVTYPE.reclustered_SV_svtype + Array[File] out_vcf_idxes = ReClusterCleanVcfAcrossSVTYPE.reclustered_SV_svtype_idx + } +} + + + diff --git a/wdl/ShardedCohortBenchmarking.wdl b/wdl/ShardedCohortBenchmarking.wdl new file mode 100644 index 000000000..25da087ad --- /dev/null +++ b/wdl/ShardedCohortBenchmarking.wdl @@ -0,0 +1,166 @@ +version 1.0 + +# Author: Ryan Collins + +# Workflow to scatter site-level benchmarking vs. an external dataset by chromosome + +import "Tasks0506.wdl" as MiniTasks + +workflow ShardedCohortBenchmarking { + input { + File vcf_stats + String prefix + Array[String] contigs + String benchmarking_bucket + String comparator + String sv_pipeline_qc_docker + String sv_base_mini_docker + RuntimeAttr? runtime_override_site_level_benchmark + RuntimeAttr? runtime_override_merge_site_level_benchmark + } + + # Collect site-level external benchmarking data per chromosome + scatter ( contig in contigs ) { + call VcfExternalBenchmarkSingleChrom as CollectSiteLevelBenchmarking { + input: + vcf_stats=vcf_stats, + prefix=prefix, + contig=contig, + benchmarking_bucket=benchmarking_bucket, + comparator=comparator, + sv_pipeline_qc_docker=sv_pipeline_qc_docker, + runtime_attr_override=runtime_override_site_level_benchmark + } + } + + # Merge results across chromosomes + call MergeContigBenchmarks as MergeBenchmarking { + input: + in_tarballs=CollectSiteLevelBenchmarking.benchmarking_results_tarball, + comparator=comparator, + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override=runtime_override_merge_site_level_benchmark + } + + output { + File benchmarking_results_tarball = MergeBenchmarking.merged_results_tarball + } +} + + +# Task to collect external benchmarking data for a single chromosome +task VcfExternalBenchmarkSingleChrom { + input { + File vcf_stats + String benchmarking_bucket + String prefix + String contig + String comparator + String sv_pipeline_qc_docker + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 3.75, + disk_gb: 40, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_qc_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # Copy benchmarking BED files to local directory + mkdir benchmarks + gsutil -m cp ~{benchmarking_bucket}/*.bed.gz benchmarks/ + + # Run benchmarking script + echo ~{contig} > contigs.list + /opt/sv-pipeline/scripts/vcf_qc/collectQC.external_benchmarking.sh \ + ~{vcf_stats} \ + /opt/sv-pipeline/scripts/vcf_qc/SV_colors.txt \ + contigs.list \ + benchmarks \ + collectQC_benchmarking_~{comparator}_~{contig}_output/ + + # Prep outputs + tar -czvf ~{prefix}.collectQC_benchmarking_~{comparator}_~{contig}_output.tar.gz \ + collectQC_benchmarking_~{comparator}_~{contig}_output + >>> + + output { + File benchmarking_results_tarball = "~{prefix}.collectQC_benchmarking_~{comparator}_~{contig}_output.tar.gz" + } +} + + +# Task to merge external benchmarking data across chromosomes +task MergeContigBenchmarks { + input { + Array[File] in_tarballs + String comparator + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + RuntimeAttr runtime_default = object { + mem_gb: 3.75, + disk_gb: 40, + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + # Untar all shards + mkdir sharded_results + while read tarball_path; do + tar -xzvf "$tarball_path" --directory sharded_results/ + done < ~{write_lines(in_tarballs)} + + # Create final output directory + mkdir collectQC_benchmarking_~{comparator}_output + mkdir collectQC_benchmarking_~{comparator}_output/data + + # Merge each unique BED + find sharded_results/ -name "*.bed.gz" | xargs -I {} basename {} | sort -V | uniq > bed_filenames.list + while read fname; do + find sharded_results/ -name $fname > matching_beds.list + sed -n '1p' matching_beds.list | xargs -I {} zcat {} | sed -n '1p' > header.bed + cat matching_beds.list | xargs -I {} zcat {} | fgrep -v "#" \ + | sort -Vk1,1 -k2,2n -k3,3n | cat header.bed - | bgzip -c \ + > collectQC_benchmarking_~{comparator}_output/data/$fname + tabix -f collectQC_benchmarking_~{comparator}_output/data/$fname + done < bed_filenames.list + + # Compress final output directory + tar -czvf \ + collectQC_benchmarking_~{comparator}_output.tar.gz \ + collectQC_benchmarking_~{comparator}_output + >>> + + output { + File merged_results_tarball = "collectQC_benchmarking_~{comparator}_output.tar.gz" + } +} diff --git a/wdl/ShardedQcCollection.wdl b/wdl/ShardedQcCollection.wdl index ffd227bdc..a6e4c34a7 100644 --- a/wdl/ShardedQcCollection.wdl +++ b/wdl/ShardedQcCollection.wdl @@ -2,19 +2,18 @@ version 1.0 # Author: Ryan Collins -# Workflow to gather SV VCF summary stats for an input VCF +# Workflow to gather SV VCF summary stats for one or more input VCFs import "Tasks0506.wdl" as MiniTasks workflow ShardedQcCollection { input { - File vcf + Array[File] vcfs + Array[File] vcf_idxs String contig Int sv_per_shard String prefix - File? vcf_idx - String sv_base_mini_docker String sv_pipeline_docker @@ -28,20 +27,23 @@ workflow ShardedQcCollection { RuntimeAttr? runtime_override_merge_svtk_vcf_2_bed } - # Tabix to chromosome of interest, and shard input VCF for stats collection - call MiniTasks.SplitVcf as SplitVcfToQc { - input: - vcf=vcf, - vcf_idx=vcf_idx, - contig=contig, - min_vars_per_shard=sv_per_shard, - prefix="vcf.shard.", - sv_base_mini_docker=sv_base_mini_docker, - runtime_attr_override=runtime_override_split_vcf_to_qc + # Tabix each VCF to chromosome of interest, and shard input VCF for stats collection + scatter ( vcf_info in zip(vcfs, vcf_idxs) ) { + call MiniTasks.SplitVcf as SplitVcfToQc { + input: + vcf=vcf_info.left, + vcf_idx=vcf_info.right, + contig=contig, + min_vars_per_shard=sv_per_shard, + prefix="vcf.shard.", + sv_base_mini_docker=sv_base_mini_docker, + runtime_attr_override=runtime_override_split_vcf_to_qc + } } + Array[File] vcf_shards = flatten(SplitVcfToQc.vcf_shards) # Scatter over VCF shards - scatter (shard in SplitVcfToQc.vcf_shards) { + scatter (shard in vcf_shards) { # Collect VCF-wide summary stats call CollectShardedVcfStats { input: @@ -100,14 +102,14 @@ task CollectShardedVcfStats { # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to # be held in memory or disk while working, potentially in a form that takes up more space) Float input_size = size(vcf, "GiB") - Float compression_factor = 5.0 + Float compression_factor = 2.0 Float base_disk_gb = 5.0 - Float base_mem_gb = 2.0 + Float base_mem_gb = 3.75 RuntimeAttr runtime_default = object { mem_gb: base_mem_gb + compression_factor * input_size, disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)), cpu_cores: 1, - preemptible_tries: 3, + preemptible_tries: 1, max_retries: 1, boot_disk_gb: 10 } @@ -171,7 +173,7 @@ task SvtkVcf2bed { mem_gb: base_mem_gb, disk_gb: ceil(base_disk_gb + input_size * 2.0), cpu_cores: 1, - preemptible_tries: 3, + preemptible_tries: 1, max_retries: 1, boot_disk_gb: 10 } diff --git a/wdl/TasksBenchmark.wdl b/wdl/TasksBenchmark.wdl index 021ccd83d..21424f347 100644 --- a/wdl/TasksBenchmark.wdl +++ b/wdl/TasksBenchmark.wdl @@ -2,6 +2,249 @@ version 1.0 import "Structs.wdl" +# use zcat to concatenate compressed files +# -replaces "combine" task in some workflows +# -if filter_command is omitted, input files will be concatenated as +# usual +# -if filter_command is passed, it must be a valid bash command, +# accepting the resulting file via pipe on stdin, and outputing the +# desired file on stdout +task ZcatCompressedFiles { + input { + Array[File] shards + String? outfile_name + String? filter_command + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + String output_file_name = select_first([outfile_name, "output.txt.gz"]) + Boolean do_filter = defined(filter_command) && filter_command != "" + + # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to + # be held in memory or disk while working, potentially in a form that takes up more space) + Float input_size = size(shards, "GB") + Float compression_factor = 5.0 + Float base_disk_gb = 5.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb + if do_filter then compression_factor * input_size else 0.0, + disk_gb: ceil(base_disk_gb + input_size * if do_filter then 2.0 + compression_factor else 2.0), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command { + set -eu -o pipefail + + while read SHARD; do + if [ -n "$SHARD" ]; then + zcat "$SHARD" + fi + done < ~{write_lines(shards)} \ + ~{if do_filter then "| " + select_first([filter_command]) else ""} \ + | bgzip -c \ + > "~{outfile_name}" + } + + output { + File outfile=outfile_name + } + } + +# concatenate uncompressed files +# -replaces "combine" task in some workflows +# -if filter_command is omitted, input files will be concatenated as +# usual +# -if filter_command is passed, it must be a valid bash command, +# accepting the resulting file via pipe on stdin, and outputing the +# desired file on stdout +task CatUncompressedFiles { + input { + Array[File] shards + String? outfile_name + String? filter_command + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + String output_file_name = select_first([outfile_name, "output.txt"]) + Boolean do_filter = defined(filter_command) && filter_command != "" + + # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to + # be held in memory or disk while working, potentially in a form that takes up more space) + Float input_size = size(shards, "GB") + Float base_mem_gb = 2.0 + Float base_disk_gb = 5.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb + (if do_filter then input_size else 0.0), + disk_gb: ceil(base_disk_gb + input_size * (if do_filter then 3.0 else 2.0)), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + while read SHARD; do + if [ -n "$SHARD" ]; then + cat "$SHARD" + fi + done < ~{write_lines(shards)} \ + ~{if do_filter then "| " + select_first([filter_command]) else ""} \ + > ~{output_file_name} + >>> + + output { + File outfile=output_file_name + } + } + +# Combine multiple sorted VCFs +task ConcatVcfs { + input { + Array[File] vcfs + Array[File]? vcfs_idx + Boolean merge_sort = false + String? outfile_prefix + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + String outfile_name = outfile_prefix + ".vcf.gz" + String merge_flag = if merge_sort then "--allow-overlaps" else "" + + # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to + # be held in memory or disk while working, potentially in a form that takes up more space) + Float input_size = size(vcfs, "GB") + Float compression_factor = 5.0 + Float base_disk_gb = 5.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb + compression_factor * input_size, + disk_gb: ceil(base_disk_gb + input_size * (2.0 + compression_factor)), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -euo pipefail + VCFS="~{write_lines(vcfs)}" + if ~{!defined(vcfs_idx)}; then + cat ${VCFS} | xargs -n1 tabix + fi + bcftools concat -a ~{merge_flag} --output-type z --file-list ${VCFS} --output ~{outfile_name} + tabix -p vcf ~{outfile_name} + >>> + + output { + File concat_vcf = "~{outfile_name}" + File concat_vcf_idx = "~{outfile_name}.tbi" + } + } + +# Merge shards after VCF stats collection +task ConcatBeds { + input { + Array[File] shard_bed_files + String prefix + Boolean? index_output + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + Boolean call_tabix = select_first([index_output, true]) + String output_file="~{prefix}.bed.gz" + + # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to + # be held in memory or disk while working, potentially in a form that takes up more space) + Float input_size = size(shard_bed_files, "GB") + Float compression_factor = 5.0 + Float base_disk_gb = 5.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb + compression_factor * input_size, + disk_gb: ceil(base_disk_gb + input_size * (2.0 + compression_factor)), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu + + # note head -n1 stops reading early and sends SIGPIPE to zcat, + # so setting pipefail here would result in early termination + + # no more early stopping + set -o pipefail + + while read SPLIT; do + zcat $SPLIT + done < ~{write_lines(shard_bed_files)} \ + | (grep -Ev "^#" || printf "") \ + | sort -Vk1,1 -k2,2n -k3,3n \ + | bgzip -c \ + > ~{output_file} + + if ~{call_tabix}; then + tabix -f -p bed ~{output_file} + else + touch ~{output_file}.tbi + fi + >>> + + output { + File merged_bed_file = output_file + File merged_bed_idx = output_file + ".tbi" + } + } + # Merge shards after VaPoR task ConcatVaPoR { input { @@ -23,13 +266,13 @@ task ConcatVaPoR { Float base_disk_gb = 5.0 Float base_mem_gb = 2.0 RuntimeAttr runtime_default = object { - mem_gb: base_mem_gb + compression_factor * input_size, - disk_gb: ceil(base_disk_gb + input_size * (2.0 + compression_factor)), - cpu_cores: 1, - preemptible_tries: 3, - max_retries: 1, - boot_disk_gb: 10 - } + mem_gb: base_mem_gb + compression_factor * input_size, + disk_gb: ceil(base_disk_gb + input_size * (2.0 + compression_factor)), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) runtime { memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" @@ -51,8 +294,8 @@ task ConcatVaPoR { set -o pipefail while read SPLIT; do - zcat $SPLIT | tail -n+2 - done < ~{write_lines(shard_bed_files)} \ + zcat $SPLIT | tail -n+2 + done < ~{write_lines(shard_bed_files)} \ | sort -Vk1,1 -k2,2n -k3,3n \ | bgzip -c \ > ~{output_file} @@ -64,10 +307,13 @@ task ConcatVaPoR { fi mkdir ~{prefix}.plots + mkdir ~{prefix}.tmp_plots + while read SPLIT; do - tar zxvf $SPLIT -C ~{prefix}.plots/ + tar zxvf $SPLIT -C ~{prefix}.tmp_plots/ done < ~{write_lines(shard_plots)} + mv ~{prefix}.tmp_plots/*/* ~{prefix}.plots/ tar -czf ~{prefix}.plots.tar.gz ~{prefix}.plots/ >>> @@ -75,10 +321,323 @@ task ConcatVaPoR { File merged_bed_file = output_file File merged_bed_plot = "~{prefix}.plots.tar.gz" } -} + } + + +# Task to merge VID lists across shards +task FilesToTarredFolder { + input { + Array[File] in_files + String? folder_name + String? tarball_prefix + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + String tar_folder_name = select_first([folder_name, "merged"]) + String outfile_name = select_first([tarball_prefix, tar_folder_name]) + ".tar.gz" + + # Since the input files are often/always compressed themselves, assume compression factor for tarring is 1.0 + Float input_size = size(in_files, "GB") + Float base_disk_gb = 5.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb, + disk_gb: ceil(base_disk_gb + input_size * 2.0), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + # Create final output directory + mkdir "~{tar_folder_name}" + + while read VID_LIST; do + mv "$VID_LIST" "~{tar_folder_name}" + done < ~{write_lines(in_files)} + + # Compress final output directory + tar -czvf "~{outfile_name}" "~{tar_folder_name}" + >>> + + output { + File tarball = outfile_name + } + } + + +#Create input file for per-batch genotyping of predicted CPX CNV intervals +task PasteFiles { + input { + Array[String] input_strings + Array[File] input_files + String outfile_name + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to + # be held in memory or disk while working, potentially in a form that takes up more space) + Float input_size = size(input_files, "GB") + Float base_disk_gb = 5.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb, + disk_gb: ceil(base_disk_gb + input_size * 2.0), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + paste ~{sep=' ' input_files} \ + > ~{outfile_name} + >>> + + output { + File outfile = outfile_name + } + } + +# Select a subset of vcf records by passing a bash filter command +# records_filter must be a bash command accepting vcf records passed via +# pipe, and outputing the desired records to stdout +task FilterVcf { + input { + File vcf + String outfile_prefix + String records_filter + Boolean? index_output + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + String outfile_name = outfile_prefix + ".vcf.gz" + Boolean call_tabix = select_first([index_output, true]) + + # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to + # be held in memory or disk while working, potentially in a form that takes up more space) + Float input_size = size(vcf, "GB") + Float compression_factor = 5.0 + Float base_disk_gb = 5.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb + compression_factor * input_size, + disk_gb: ceil(base_disk_gb + input_size * 2.0 * (1 + compression_factor)), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + # uncompress vcf + zcat ~{vcf} > uncompressed.vcf + + # Extract vcf header: + # search for first line not starting with '#', stop immediately, + # take everything up to that point, then remove last line. + ONLY_HEADER=false + grep -B9999999999 -m1 -Ev "^#" uncompressed.vcf | sed '$ d' > header.vcf \ + || ONLY_HEADER=true + + if $ONLY_HEADER; then + # no records were found, so filter is trivial, just use original vcf + mv ~{vcf} ~{outfile_name} + else + N_HEADER=$(wc -l < header.vcf) + + # Put filter inside subshell so that there is no pipefail if there are no matches + # NOTE: this is dangerous in the event that the filter is buggy + tail -n+$((N_HEADER+1)) uncompressed.vcf \ + | { ~{records_filter} || true; }\ + | cat header.vcf - \ + | vcf-sort \ + | bgzip -c \ + > "~{outfile_name}" + fi + + if ~{call_tabix}; then + tabix -p vcf -f "~{outfile_name}" + else + touch "~{outfile_name}.tbi" + fi + >>> + + output { + File filtered_vcf = outfile_name + File filtered_vcf_idx = outfile_name + ".tbi" + } + } + +# Find intersection of Variant IDs from vid_list with those present in vcf, return as filtered_vid_list +task SubsetVariantList { + input { + File vid_list + File vcf + String outfile_name + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to + # be held in memory or disk while working, potentially in a form that takes up more space) + Float vid_list_size = size(vid_list, "GB") + Float vcf_size = size(vcf, "GB") + Float compression_factor = 5.0 + Float cut_factor = 2.0 + Float base_disk_gb = 5.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb + compression_factor / cut_factor * vcf_size, + disk_gb: ceil(base_disk_gb + vid_list_size * 2.0 + vcf_size * (1 + compression_factor / cut_factor)), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_base_mini_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + #Get list of variant IDs present in VCF + zcat ~{vcf} | (grep -vE "^#" || printf "") | cut -f3 > valid_vids.list + #Restrict input variant ID list to valid VIDs + (fgrep -wf valid_vids.list ~{vid_list} || printf "") > "~{outfile_name}" + >>> + + output { + File filtered_vid_list = outfile_name + } + } + + +# evenly split text file into even chunks +# if shuffle_file is set to true, shuffle the file before splitting (default = false) +task SplitUncompressed { + input { + File whole_file + Int lines_per_shard + String? shard_prefix + Boolean? shuffle_file + Int? random_seed + String sv_pipeline_docker + RuntimeAttr? runtime_attr_override + } + + String split_prefix=select_first([shard_prefix, "shard_"]) + Boolean do_shuffle=select_first([shuffle_file, false]) + Int random_seed_ = if defined(random_seed) then select_first([random_seed]) else 0 + + # when filtering/sorting/etc, memory usage will likely go up (much of the data will have to + # be held in memory or disk while working, potentially in a form that takes up more space) + Float input_size = size(whole_file, "GB") + Float base_disk_gb = 5.0 + Float base_mem_gb = 2.0 + RuntimeAttr runtime_default = object { + mem_gb: base_mem_gb, + disk_gb: ceil(base_disk_gb + input_size * 2.0), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + command <<< + set -eu -o pipefail + + function get_seeded_random() { + openssl enc -aes-256-ctr -pass pass:"$1" -nosalt /dev/null + } + # note for if ~do_shuffle is true: shuf is faster than sort --random-sort, but + # sort --random-sort will predictably fit in memory, making it a better choice for VMs + ~{if do_shuffle then + "sort --random-sort --random-source=<(get_seeded_random ~{random_seed_}) -o ~{whole_file} ~{whole_file}" + else + "" + } + + N_LINES=$(wc -l < ~{whole_file}) + N_CHUNKS=$((N_LINES / ~{lines_per_shard})) + if [ "$N_CHUNKS" -eq "0" ]; then N_CHUNKS=1; fi + N_DIGITS=${#N_CHUNKS} + + split -d -a $N_DIGITS -n l/$N_CHUNKS \ + --numeric-suffixes=$(printf "%0${N_DIGITS}d" 1) \ + ~{whole_file} \ + ~{shard_prefix} + + # remove whole file if its name starts with split_prefix, to prevent including in glob + if [[ "~{whole_file}" =~ ^"~{split_prefix}".* ]]; then + rm -f "~{whole_file}" + fi + >>> + + output { + Array[File] shards=glob("~{shard_prefix}*") + } + } + #localize a specific contig of a bam/cram file -task LocalizeCram { +task LocalizeCram{ input{ String contig File ref_fasta @@ -112,10 +671,12 @@ task LocalizeCram { set -Eeuo pipefail java -Xmx~{java_mem_mb}M -jar ${GATK_JAR} PrintReads \ - -I ~{bam_or_cram_file} \ - -L ~{contig} \ - -O ~{contig}.bam \ - -R ~{ref_fasta} + -I ~{bam_or_cram_file} \ + -L ~{contig} \ + -O ~{contig}.bam \ + -R ~{ref_fasta} + + #samtools index ~{contig}.bam >>> runtime { @@ -136,8 +697,8 @@ task LocalizeCramRequestPay{ File ref_fai File ref_dict String project_id - String bam_or_cram_file - String bam_or_cram_index + String? bam_or_cram_file + String? bam_or_cram_index String sv_pipeline_docker RuntimeAttr? runtime_attr_override } @@ -164,11 +725,13 @@ task LocalizeCramRequestPay{ set -Eeuo pipefail java -Xmx~{java_mem_mb}M -jar ${GATK_JAR} PrintReads \ - -I ~{bam_or_cram_file} \ - -L ~{contig} \ - -O ~{contig}.bam \ - -R ~{ref_fasta} \ - --gcs-project-for-requester-pays ~{project_id} + -I ~{bam_or_cram_file} \ + -L ~{contig} \ + -O ~{contig}.bam \ + -R ~{ref_fasta} \ + --gcs-project-for-requester-pays ~{project_id} + + samtools index ~{contig}.bam >>> runtime { @@ -272,7 +835,6 @@ task SplitVcf{ maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries]) } } - task vcf2bed{ input{ File vcf @@ -326,6 +888,145 @@ task vcf2bed{ } } +#for the tasks to recluster SVs of specific type and fell within specific genomic regions +task IntegrateReClusterdVcfs{ + input{ + File vcf_all + File vcf_all_idx + File vcf_recluster + File vcf_recluster_idx + + String sv_pipeline_docker + + RuntimeAttr? runtime_attr_override + } + + RuntimeAttr runtime_default = object { + mem_gb: 7.5, + disk_gb: ceil(10.0 + size(vcf_all, "GiB")*4), + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + String prefix = basename(vcf_recluster, ".vcf.gz") + String prefix_all = basename(vcf_all, ".vcf.gz") + + command <<< + svtk vcf2bed -i MEMBERS ~{vcf_recluster} ~{prefix}.bed + cut -f4,7 ~{prefix}.bed > ~{prefix}.SVID_anno + + python3 <1: + SVID_list.append(pin[0]) + for i in pin[1].split(','): + if not i in clustered_vs_origin_SVID.keys(): + clustered_vs_origin_SVID[i] = pin[0] + fin.close() + + fin = pysam.VariantFile("~{vcf_all}") + fin2 = pysam.VariantFile("~{vcf_recluster}") + fo = pysam.VariantFile("~{prefix_all}.ReClustered_unsort.vcf.gz", 'w', header = fin2.header) + fo2 = pysam.VariantFile("~{prefix}.ReClustered_unsort.vcf.gz", 'w', header = fin2.header) + for record in fin: + if not record.id in clustered_vs_origin_SVID.keys(): + fo.write(record) + fin.close() + + for record in fin2: + if record.id in SVID_list: + print(record.id) + fo2.write(record) + fin2.close() + fo.close() + fo2.close() + print("done") + CODE + + tabix -p vcf "~{prefix_all}.ReClustered_unsort.vcf.gz" + tabix -p vcf "~{prefix}.ReClustered_unsort.vcf.gz" + >>> + + output{ + File SVID_anno = "~{prefix}.SVID_anno" + File reclustered_Part1 = "~{prefix_all}.ReClustered_unsort.vcf.gz" + File reclustered_Part2 = "~{prefix}.ReClustered_unsort.vcf.gz" + File reclustered_Part1_idx = "~{prefix_all}.ReClustered_unsort.vcf.gz.tbi" + File reclustered_Part2_idx = "~{prefix}.ReClustered_unsort.vcf.gz.tbi" + } +} + +task SortReClusterdVcfs{ + input{ + File vcf_1 + File vcf_2 + File vcf_1_idx + File vcf_2_idx + + String sv_pipeline_docker + RuntimeAttr? runtime_attr_override + } + + RuntimeAttr runtime_default = object { + mem_gb: 7.5, + disk_gb: ceil(10.0 + size(vcf_1, "GiB")*2 + size(vcf_2, "GiB")*2), + cpu_cores: 1, + preemptible_tries: 1, + max_retries: 1, + boot_disk_gb: 10 + } + + RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default]) + + runtime { + memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB" + disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD" + cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores]) + preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries]) + maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries]) + docker: sv_pipeline_docker + bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb]) + } + + String prefix_all = basename(vcf_1, ".ReClustered_unsort.vcf.gz") + + command <<< + echo "~{vcf_1}" >> vcf_list + echo "~{vcf_2}" >> vcf_list + bcftools concat --allow-overlaps --output-type z --file-list vcf_list --output ~{prefix_all}.ReClustered.vcf.gz + tabix -p vcf ~{prefix_all}.ReClustered.vcf.gz + + >>> + + output{ + File sorted_vcf = "~{prefix_all}.ReClustered.vcf.gz" + File sorted_vcf_idx = "~{prefix_all}.ReClustered.vcf.gz.tbi" + } +} + + diff --git a/wdl/Utils.wdl b/wdl/Utils.wdl index fdd7b2221..557883007 100644 --- a/wdl/Utils.wdl +++ b/wdl/Utils.wdl @@ -223,3 +223,64 @@ task SubsetPedFile { maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries]) } } + +# Subset a VCF to a specific subset of samples +task SubsetVcfBySamplesList { + input { + File vcf + File? vcf_idx + File list_of_samples_to_keep + String subset_name = "subset" + String sv_base_mini_docker + RuntimeAttr? runtime_attr_override + } + + String vcf_subset_filename = basename(vcf, ".vcf.gz") + ".~{subset_name}.vcf.gz" + String vcf_subset_idx_filename = vcf_subset_filename + ".tbi" + + # Disk must be scaled proportionally to the size of the VCF + Float input_size = size(vcf, "GiB") + Float disk_scaling_factor = 1.5 + Float base_disk_gb = 10.0 + RuntimeAttr default_attr = object { + mem_gb: 3.75, + disk_gb: ceil(base_disk_gb + (input_size * disk_scaling_factor)), + cpu_cores: 1, + preemptible_tries: 3, + max_retries: 1, + boot_disk_gb: 10 + } + RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr]) + + command <<< + + set -euo pipefail + + bcftools view \ + -S ~{list_of_samples_to_keep} \ + --force-samples \ + ~{vcf} \ + | bcftools view \ + --min-ac 1 \ + -O z \ + -o ~{vcf_subset_filename} + + tabix -f -p vcf ~{vcf_subset_filename} + + >>> + + output { + File vcf_subset = vcf_subset_filename + File vcf_subset_idx = vcf_subset_idx_filename + } + + runtime { + cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores]) + memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB" + disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD" + bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb]) + docker: sv_base_mini_docker + preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries]) + maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries]) + } +}