Skip to content

Using both phylogenetic and time-series data from the COVID19 pandemic to estimate number of missing infections and predict forward epidemic trajectory.

License

Notifications You must be signed in to change notification settings

czbiohub-sf/EpiGen-COVID19

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Estimating coronavirus infections using phylogenies and time series data

Preprint out on MedRxiv: Using viral genomics to estimate undetected infections and extent of superspreading events for COVID-19.

Highlights

  • The number of missing infectious is an important parameter to estimate because it provides information about the scale of the epidemic, which in turn affects resource allocation. Also, it affects how controllable the epidemic is and for how long the epidemic will go on.
  • Incidence or prevalence time series are generally the lower bound of the total number of infections due to under-reporting. Viral phylogenies provide can fill in this knowledge gap because viruses continue to evolve as they spread through asymptomatic human populations.

EpiGenMCMC program

Inference uses the EpiGenMCMC program. See also Li et al. (2017) for more details.

Data

Genomic data are manually downloaded from GISAID and stored in a file named ‘data/sequences/gisaid_cov2020_sequences.fasta’. The accompanying metadata is also manually downloaded and stored in ‘data/sequences/gisaid_cov2020_acknowledgement_table.xls’.

Time series data are from Johns Hopkins github page.

Analysis steps

0. Download time series data

Input: data from Johns Hopkins GitHub repo.

Output: data/time_series/CSSE/*.csv, data/sequences/gisaid_metadata.tsv

./00_download_data.sh

1. Clean and transform time series data

Input: data from Johns Hopkins GitHub repo at data/time_series/CSSE/*.csv, early WHO sitrep (data/time_series/WHO_sitreps_20200121-20200122.tsv), data from Li et al. (2020) NEJM (data/time_series/li2020nejm_wuhan_incidence.tsv), data/sequences/gisaid_metadata.tsv

Output: data/timeseries/summary_{region}timeseries{cumulative|new_cases}.tsv, data/timeseries/timeseries_new_cases.tsv, data/timeseries/timeseries_cumulative_cases.tsv

Rscript 01_curate_time_series.R

2. Rename sequences to include dates of collection

Input: data/sequences/gisaid_cov2020_sequences.fasta, data/sequences/gisaid_metadata.tsv

Output: msa/{region}/input_mafft_{analysisid}.fasta, msa/{region}/input_mafft_{analysisid}.fasta

{analysisid} refers to the date of the last data pull.

Rscript 02_filter_seq.R

3. Align sequences against each other using MAFFT

Input: msa/{region}/input_mafft_{analysisid}.fasta

Output: msa/{region}/msa_{analysisid}.fasta

find msa -maxdepth 1 -mindepth 1 -type d | parallel -j 8 ./03_multi_sequence_alignment.sh {}

4. Create a maximum-likelihood phylogeny using iqtree

Input: msa/{region}/msa_mafft_{analysisid}.fasta

Output: tree/iqtree_{analysisid}.{bionj|boottrees|ckp.gz|contree|iqtree|log|mldist|model.gz|treefile}

find msa -maxdepth 1 -mindepth 1 -type d | sed "s/msa/tree/" | parallel -j 8 ./04_build_ml_tree.sh {}

5. Treetime

Compute dates of branching events using TreeTime

Input: tree/{region}/iqtree_{analysisid}.treefile

Output: tree/{region}/iqtree_{analysisid}dates.tsv, tree/{region}/treetime{analysisid}/{ancestral_sequences.fasta|dates.tsv|divergence_tree.nexus|divergence_tree.nexus|molecular_clock.txt|root_to_tip_regression.pdf|sequence_evolution_model.txt|timetree.nexus|timetree.pdf}

ls tree/*/*treefile | xargs -I{} dirname {} | parallel ./05_treetime.sh {}

6. Initial parameter tuning for EpiGenMCMC

Tips with divergence more than expected for the sampling time were excluded.

Generate a grid of parameter combinations to determine which starting parameters to use for the EpiGenMCMC algorithm.

In total, 100 different parameter combinations are tested for each country, using just the time series data, using just the phylogenetic data, and using both time series and phylogenetic data.

EpiGenMCMC estimates parameter values for each parameter combination and for each of the 3 time series dataset.

Input: tree/{region}/treetime_{analysisid}/{timetree.nexus|dates.tsv}, data/timeseries/summary_{region}_timeseries_new_cases.tsv,

Output: - epigenmcmc_results/covid19_{analysisid}_commands - this file contains all the bash commands for running EpiGenMCMC

For each analysis, the R Script generates these files: - prefix: epigenmcmc_results/{region}/covid19_*/{region}{both|epi|gen}{runid}_

  • suffix: epi_data.txt, gen_data.txt, init_states.txt, mcmc_options.txt, params.txt
Rscript 06_create_EpiGenMCMC_inputs.R

Input: epigenmcmc_results/covid19_{analysisid}commands Output: epigenmcmc/{region}/covid19{analysisid}/*logfile.txt

./epigenmcmc_results/covid19_{analysisid}/commands

7. Use initial parameter search to generate input files for model fitting

Use the initial grid search to set the input parameter values.

Input: epigenmcmc/{region}/covid19_{analysisid}/{logfile}.txt Output: epigenmcmc/{region}/covid19_{analysisid}/inference_commands, epigenmcmc/{region}/covid19_{analysisid}/inference_.txt

Rscript 07_summarize_initial_param_search.R

Input: epigenmcmc/{region}/covid19_{analysisid}/inference_commands Output: epigenmcmc/{region}/covid19_{analysisid}/inference_*_{logfile|trajfile}.txt

./epigenmcmc/covid19_{analysisid}/inference_commands

8. Visualize and summarize results

Summarize MCMC results.

Input: epigenmcmc/{region}/covid19_{analysisid}/inference_*_{logfile|trajfile}.txt Output: epigenmcmc/{figures|tables|files}

Rscript 08_sumarize_results.R

9. Simulation results

Rscript 09_params_to_simulate.R

About

Using both phylogenetic and time-series data from the COVID19 pandemic to estimate number of missing infections and predict forward epidemic trajectory.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published