Estimating coronavirus infections using phylogenies and time series data

Preprint out on MedRxiv: Using viral genomics to estimate undetected infections and extent of superspreading events for COVID-19.

Highlights

The number of missing infectious is an important parameter to estimate because it provides information about the scale of the epidemic, which in turn affects resource allocation. Also, it affects how controllable the epidemic is and for how long the epidemic will go on.
Incidence or prevalence time series are generally the lower bound of the total number of infections due to under-reporting. Viral phylogenies provide can fill in this knowledge gap because viruses continue to evolve as they spread through asymptomatic human populations.

EpiGenMCMC program

Inference uses the EpiGenMCMC program. See also Li et al. (2017) for more details.

Data

Genomic data are manually downloaded from GISAID and stored in a file named ‘data/sequences/gisaid_cov2020_sequences.fasta’. The accompanying metadata is also manually downloaded and stored in ‘data/sequences/gisaid_cov2020_acknowledgement_table.xls’.

Time series data are from Johns Hopkins github page.

Analysis steps

0. Download time series data

Input: data from Johns Hopkins GitHub repo.

Output: data/time_series/CSSE/*.csv, data/sequences/gisaid_metadata.tsv

./00_download_data.sh

1. Clean and transform time series data

Input: data from Johns Hopkins GitHub repo at data/time_series/CSSE/*.csv, early WHO sitrep (data/time_series/WHO_sitreps_20200121-20200122.tsv), data from Li et al. (2020) NEJM (data/time_series/li2020nejm_wuhan_incidence.tsv), data/sequences/gisaid_metadata.tsv

Output: data/timeseries/summary_{region}timeseries{cumulative|new_cases}.tsv, data/timeseries/timeseries_new_cases.tsv, data/timeseries/timeseries_cumulative_cases.tsv

Rscript 01_curate_time_series.R

2. Rename sequences to include dates of collection

Input: data/sequences/gisaid_cov2020_sequences.fasta, data/sequences/gisaid_metadata.tsv

Output: msa/{region}/input_mafft_{analysisid}.fasta, msa/{region}/input_mafft_{analysisid}.fasta

{analysisid} refers to the date of the last data pull.

Rscript 02_filter_seq.R

3. Align sequences against each other using `MAFFT`

Input: msa/{region}/input_mafft_{analysisid}.fasta

Output: msa/{region}/msa_{analysisid}.fasta

find msa -maxdepth 1 -mindepth 1 -type d | parallel -j 8 ./03_multi_sequence_alignment.sh {}

4. Create a maximum-likelihood phylogeny using `iqtree`

Input: msa/{region}/msa_mafft_{analysisid}.fasta

Output: tree/iqtree_{analysisid}.{bionj|boottrees|ckp.gz|contree|iqtree|log|mldist|model.gz|treefile}

find msa -maxdepth 1 -mindepth 1 -type d | sed "s/msa/tree/" | parallel -j 8 ./04_build_ml_tree.sh {}

5. Treetime

Compute dates of branching events using TreeTime

Input: tree/{region}/iqtree_{analysisid}.treefile

Output: tree/{region}/iqtree_{analysisid}dates.tsv, tree/{region}/treetime{analysisid}/{ancestral_sequences.fasta|dates.tsv|divergence_tree.nexus|divergence_tree.nexus|molecular_clock.txt|root_to_tip_regression.pdf|sequence_evolution_model.txt|timetree.nexus|timetree.pdf}

ls tree/*/*treefile | xargs -I{} dirname {} | parallel ./05_treetime.sh {}

6. Initial parameter tuning for EpiGenMCMC

Tips with divergence more than expected for the sampling time were excluded.

Generate a grid of parameter combinations to determine which starting parameters to use for the EpiGenMCMC algorithm.

In total, 100 different parameter combinations are tested for each country, using just the time series data, using just the phylogenetic data, and using both time series and phylogenetic data.

EpiGenMCMC estimates parameter values for each parameter combination and for each of the 3 time series dataset.

Input: tree/{region}/treetime_{analysisid}/{timetree.nexus|dates.tsv}, data/timeseries/summary_{region}_timeseries_new_cases.tsv,

Output: - epigenmcmc_results/covid19_{analysisid}_commands - this file contains all the bash commands for running EpiGenMCMC

For each analysis, the R Script generates these files: - prefix: epigenmcmc_results/{region}/covid19_*/{region}{both|epi|gen}{runid}_

suffix: epi_data.txt, gen_data.txt, init_states.txt, mcmc_options.txt, params.txt

Rscript 06_create_EpiGenMCMC_inputs.R

Input: epigenmcmc_results/covid19_{analysisid}commands Output: epigenmcmc/{region}/covid19{analysisid}/*logfile.txt

./epigenmcmc_results/covid19_{analysisid}/commands

7. Use initial parameter search to generate input files for model fitting

Use the initial grid search to set the input parameter values.

Input: epigenmcmc/{region}/covid19_{analysisid}/{logfile}.txt Output: epigenmcmc/{region}/covid19_{analysisid}/inference_commands, epigenmcmc/{region}/covid19_{analysisid}/inference_.txt

Rscript 07_summarize_initial_param_search.R

Input: epigenmcmc/{region}/covid19_{analysisid}/inference_commands Output: epigenmcmc/{region}/covid19_{analysisid}/inference_*_{logfile|trajfile}.txt

./epigenmcmc/covid19_{analysisid}/inference_commands

8. Visualize and summarize results

Summarize MCMC results.

Input: epigenmcmc/{region}/covid19_{analysisid}/inference_*_{logfile|trajfile}.txt Output: epigenmcmc/{figures|tables|files}

Rscript 08_sumarize_results.R

9. Simulation results

Rscript 09_params_to_simulate.R

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
data		data
documents		documents
figures		figures
files		files
packrat		packrat
tables		tables
.Rprofile		.Rprofile
.gitignore		.gitignore
00_download_data.sh		00_download_data.sh
01_curate_time_series.R		01_curate_time_series.R
02_filter_seq.R		02_filter_seq.R
03_multi_sequence_alignment.sh		03_multi_sequence_alignment.sh
04_build_ml_tree.sh		04_build_ml_tree.sh
05_treetime.sh		05_treetime.sh
05a_create_mrbayes_input_file.R		05a_create_mrbayes_input_file.R
05b_mrbayes.sh		05b_mrbayes.sh
06_create_EpiGenMCMC_inputs.R		06_create_EpiGenMCMC_inputs.R
07_summarize_initial_param_search.R		07_summarize_initial_param_search.R
08_summarize_results.R		08_summarize_results.R
08a_priors_plots.R		08a_priors_plots.R
08b_formatted_data.R		08b_formatted_data.R
09_params_to_simulate.R		09_params_to_simulate.R
COVID19.Rproj		COVID19.Rproj
LICENSE.txt		LICENSE.txt
README.Rmd		README.Rmd
README.md		README.md
environment.yml		environment.yml
utils.R		utils.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Estimating coronavirus infections using phylogenies and time series data

Highlights

EpiGenMCMC program

Data

Analysis steps

0. Download time series data

1. Clean and transform time series data

2. Rename sequences to include dates of collection

3. Align sequences against each other using `MAFFT`

4. Create a maximum-likelihood phylogeny using `iqtree`

5. Treetime

6. Initial parameter tuning for EpiGenMCMC

7. Use initial parameter search to generate input files for model fitting

8. Visualize and summarize results

9. Simulation results

About

Releases

Packages

Contributors 2

Languages

License

czbiohub-sf/EpiGen-COVID19

Folders and files

Latest commit

History

Repository files navigation

Estimating coronavirus infections using phylogenies and time series data

Highlights

EpiGenMCMC program

Data

Analysis steps

0. Download time series data

1. Clean and transform time series data

2. Rename sequences to include dates of collection

3. Align sequences against each other using MAFFT

4. Create a maximum-likelihood phylogeny using iqtree

5. Treetime

6. Initial parameter tuning for EpiGenMCMC

7. Use initial parameter search to generate input files for model fitting

8. Visualize and summarize results

9. Simulation results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

3. Align sequences against each other using `MAFFT`

4. Create a maximum-likelihood phylogeny using `iqtree`

Packages