Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the SplitVariants task command in TasksGenotypeBatch.wdl to call svtk only once #618

Merged
merged 15 commits into from
Jan 5, 2024

Conversation

kirtanav98
Copy link
Contributor

@kirtanav98 kirtanav98 commented Nov 20, 2023

This pull request was created to improve the efficiency of the SplitVariants task command in the TasksGenotypeBatch.wdl. The SplitVariants task took an input bed file and a number of splits, called svtk to produce multiple bed files and then used awk to filter and split each bed file based on the number of required splits, proving inefficient with both memory and time and cost.
Instead, this request is to convert the actions of the task into a python script which is called on the input vcf file with the input number of lines per file as well as the variable indicating if bca's should be examined. Svtk is called only once on the vcf to produce a bed file. Each line of the bed file is read only once; the appropriate line is then added to a corresponding text file with the appropriate prefix based on which of the conditions it matches: gt5kb (greater than 5kb in length), lt5kb (less than 5kb in length), and if the bca's are indicated to be examined, if it is a bca or insertion. Once each file has reached the maximum capacity, it is closed an a new file with the same prefix and a new suffix is opened.
This implementation is an improvement in memory and time since svtk is called on the vcf only once, and each line in the bed file is only parsed once as well.
The docker images were updated and implemented.
The appropriate changes were made to the TasksGenotypeBatch. wdl where the SplitVariants task is defined, as well as in GenotypeBatch.wdl. All changes passed validation with womtool and cromwell. Testing involved changing the parameters of bca to True and False with a combination of different input values for n_per_split.
Using a standard bed file with the default parameter of n_per_split=5000 and checking for BCAs, the runtime was about 1.5 times faster than the previous task using the awk commands. The time increases as the n_per_split increases as well but not drastically.

inputs/values/dockers.json Outdated Show resolved Hide resolved
Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this, it has the potential to really speed up this task! And the code has matured a lot over the time you've been working on it.

I've left specific comments on functionality and style throughout the code, and I also have some more general comments and questions here:

  1. Can you add more comments to your script to summarize the functionality of each section and clarify any details that might be hard for a new reader to understand at first glance?

  2. Vahid's comment is correct - you do need to build your own docker image for testing (best stored under broad-dsde-methods) but we don't commit those changes to the main branch because dockers are handled automatically, so you'll need to revert the changes to dockers.json before this PR can be merged. You can make a local copy of the dockers.json outside of your local clone of the repo to save your changes, then revert the dockers.json changes on your branch before pushing.

  3. Can you also give more details on the testing you performed? SplitVariants is called multiple times with multiple combinations of parameters so it's necessary to test all of those cases.

  4. What was the difference in runtime/memory between the version of SplitVariants in the main branch and your branch? This would be good to benchmark (it may only be more efficient for larger cohorts so it may not show improvements in the reference panel, but we at least don't want to see a performance reduction).

wdl/TasksGenotypeBatch.wdl Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved
src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to merge! Thanks for your patience finalizing all those details! I'm looking forward to seeing the performance improvement for this task :)

@kirtanav98 kirtanav98 merged commit 310eb7b into main Jan 5, 2024
5 checks passed
@kirtanav98 kirtanav98 deleted the kv_split_variants branch January 5, 2024 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants