Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement ParseGenotypes in GenotypeComplexVariants #614

Merged
merged 4 commits into from
Jun 25, 2024

Conversation

mwalker174
Copy link
Collaborator

A critical optimization to the ParseGenotypes task that reimplements process_posthoc_cpx_depth_regenotyping.py with greatly accelerated computations.

The previous version of the script used many repetitive quadratic (N^2) commands that caused it to grind to a halt on a 98,000-sample call set in problematic regions, with shards taking >1 week (possibly longer) to complete. A call cached run of the same workflow on chr2 took <24 hr to run and cost $2.73, requiring <15GB memory per shard in ParseGenotypes.

In a few places, I've pointed out some possible bugs with #TODOs. In the interest of preserving exact functionality, I did not attempt to address these (they mostly seemed minor), but they should be revisited in the future.

An issue where two UNRESOLVED filter statuses were being applied to some records was also corrected.

The new implementation was tested on the 1KGP reference panel and produced identical output through CleanVcf, with the exception of one record reflecting a rare edge case that previously resulted in a false positive large CPX CNV (relating to an INS record with END<<POS).

Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very belated thanks for this rewrite and its massive scaling benefits in such a short time frame. This will also be great to have in python, although I must admit it's still pretty difficult to follow (and I imagine even more difficult to write). It would definitely benefit from more comments throughout.

I did leave a bunch of comments and questions and some of them might prompt minor changes (or could inform places needing further documentation), but this has already been scale-tested and used so I'm going to go ahead and approve.

genotype_list_dict[vid] = list()
matching_variant_info = None
for variant_info in genotype_list_dict[vid]:
if variant_info.chrom == chrom and variant_info.start == start and variant_info.end == end:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a couple reads to understand what the inputs look like, and why you end up with multiple variants per VID here, and to distinguish between the multiple different intervals/variants/genotypes from the different inputs that are all called variant or genotype in the code, so I think that would be helpful to document

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few inline comments here

def get_overlap(chrom1, start1, end1, chrom2, start2, end2):
if chrom1 != chrom2:
return 0
if not (start1 <= end2 and start1 <= end1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not (start1 <= end2 and start1 <= end1):
if not (start1 <= end2 and start2 <= end1):

This is what I expected to see if you're just checking whether the intervals overlap?

I don't think it would impact the overall script results though. If the intervals don't overlap and interval 1 < interval 2 then the result here would just be negative

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, luckily yeah I don't think it's of any consequence.

control_cn = [var.genotype_copy_numbers[s] for s in control_samples for var in matching_variants
if var.genotype_copy_numbers.get(s, None) is not None]
if len(control_cn) == 0:
# If no valid carriers, assume default
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If no valid carriers, assume default
# If no valid controls, assume default


# Get valid matching variant genotype records
variant_info_list = genotype_dict[vid]
if end - start < 1000000:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this matches the original script, but why do we require 95% RO for variants <1Mb but one-way 100% overlap for variants >1Mb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure about this

else:
intervals = ["UNK_" + s for s in intervals_list_dict[query_record.merged_vid]]
for interval_str in intervals:
tokens = interval_str.replace(":", "\t").replace("_", "\t").replace("-", "\t").split("\t")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do this in one line with a regex like ([A-Z]*)_(chr[XY0-9]*):([0-9]*)-([0-9]*)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I think the replace() method is clearer and shouldn't make much performance difference

# Default to test for INS-iDEL
def _evaluate_ins_idel(r, records_list, default_sv_type, default_cpx_type):
if r.sink_chrom is not None:
# TODO shell script uses bedtools intersect with -v flag (=>"not") that probably isn't supposed to be used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes

m.cnv_assessment == "DEL" and not has_reciprocal_overlap(
r.sink_chrom, r.sink_start, r.sink_end,
m.chrom, m.start, m.end, 0.95)
for m in records_list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just iterate over sinks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I rewrote this

new_sv_type="CPX",
new_cpx_type="INS_iDEL",
new_cpx_intervals=f"DEL_{r.sink_string()}",
new_svlen=None + r.source_size(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new_svlen=None + r.source_size(),
new_svlen=r.sink_size() + r.source_size(),

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Interesting that this case must have never been hit because this should throw an error.

for m in records_list
)
else:
sink_is_del = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like both sink_is_del and sink_is_not_del could be true, in which case sink_is_del takes precedence... I know this is just matching the behavior of the old script but is that what we want...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the logic does seem inconsistent. I'll add a TODO note.

)
if dup_confirmed:
# If dup confirms, resolve as palindromic inverted duplication
new_cpx_type = "piDUP_FR" if default_cpx_type == "INVERSION_SINGLE_ENDER_" else "piDUP_RF"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that I've never seen a piDUP_FR or piDUP_RF in one of our VCFs, and downstream steps like manual review and annotation do not handle them. Might be worth revisiting to understand better

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO note

Update docker

Fix errors

Update dockers

Fix sort in wdl

Fix alt alleles

Perfect parity

Update dockers

Remove UNRESOLVED filter status

Delete unused scripts
@mwalker174 mwalker174 merged commit 1982f2c into main Jun 25, 2024
9 checks passed
@mwalker174 mwalker174 deleted the mw_posthoc_cpx branch June 25, 2024 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants