Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce SVConcordance memory footprint #8623

Merged
merged 3 commits into from
Dec 14, 2023
Merged

Conversation

mwalker174
Copy link
Contributor

The SVConcordance tool is currently too inefficient in terms of memory usage, requiring several 100's of GB of heap space on ~100K samples. This PR aims to reduce memory usage in two ways:

  1. Truth VCF records are stripped of all genotype fields except GT and CN, which are necessary and sufficient for concordance computations.
  2. A new option --do-not-sort is introduced to skip output record sorting. A major source of heap usage is the output buffer in the ClosestSVFinder class, which ensures records are emitted in coordinate-sorted order. This buffer quickly fills, however, when there is at least one record being actively clustered that spans a large interval because the buffer cannot be flushed until a variant beyond the maximal clusterable coordinate of that large variant is encountered. This option will allow users to substantially reduce max heap usage on larger call sets (a single SVRecord can consume ~100MB with 100K samples).

Includes an integration test to cover the --do-not-sort functionality.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me -- it took me a little bit to re-understand the relationship between the "flush" methods in ClosestSVFinder and SVConcordance but it looks good to me.

@mwalker174 mwalker174 merged commit b68fadc into master Dec 14, 2023
20 checks passed
@mwalker174 mwalker174 deleted the mw_sv_concordance_opt branch December 14, 2023 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants