Enhance EvidenceQC outputs to better align with sample QC #728

kjaisingh · 2024-09-26T17:17:23Z

Description

This PR is intended to introduce several enhancements to qc_matrix output as part of the EvidenceQC workflow that enable it to align better with the sample QC process, which is orchestrated by the QC Notebook. These enhancements are currently composed of the following:

Set any metrics with values of NaN to be 0.
Rename index columns for tables created to sample_id.
Add column for sex assignment data.

Testing

This Terra job shows an example run of the pipeline for the 1KGP cohort with this change.
This TSV file produced by the job above shows contains an example qc_matrix output from the 1KGP cohort.
Validated all WDLs with womtool.

Pre-Merge Changes Required

Remove automated Dockstore image sync for development branch.

mwalker174 · 2024-09-30T14:06:53Z

What's the reason for changing NaN to 0?

kjaisingh · 2024-09-30T14:57:49Z

What's the reason for changing NaN to 0?

@mwalker174 Some samples have a value of NaN for certain metrics (e.g. high_overall_outliers) - though in reality, these are simply reflect a value of 0. In the QC notebook, we plot distributions of various metrics, but still want to include samples for which the metric value is 0. To achieve this, we currently convert these NaN values to 0 in the QC notebook itself, but the thought here is that we could convert these to 0 in the EvidenceQC output table to minimize notebook-specific table processing.

epiercehoffman

Thanks for making these changes! I have a few implementation notes

epiercehoffman · 2024-09-30T21:01:45Z

src/sv-pipeline/scripts/make_evidence_qc_table.py

           df_manta_low_outlier, df_melt_low_outlier, df_wham_low_outlier, df_total_low_outliers,
           df_melt_insert_size]
    for df in dfs:
        df[ID_COL] = df[ID_COL].astype(object)
    output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
    output_df = output_df[output_df[ID_COL] != EMPTY_OUTLIERS]
+    output_df = output_df.replace([None, np.nan], 0.0)


We do want to set samples who are not outliers in any categories to 0 in the outlier columns, rather than NaN, for interpretability. But I think it makes more sense to do that for just the outlier columns, not the entire dataframe. There may not be other metrics which routinely have NaNs at the moment, but if an NaN did show up in another metric (current or future), it would not necessarily make sense to set it to 0

Updated accordingly - thanks for clarifying.

epiercehoffman · 2024-09-30T21:02:16Z

src/sv-pipeline/scripts/make_evidence_qc_table.py

+        filename: A tab-delimited file containing estimated copy numbers.
+    Returns:
+        A pandas DataFrame containing the following columns:
+        [id, chr1_CopyNumber, ..., chr22_CopyNumber, chrX_CopyNumber, chrY_CopyNumber, chrX_CopyNumber_rounded].


I assume this documentation was a copy/paste error and should be updated to refer to the sex assignment data?

Updated accordingly - thanks for catching this.

epiercehoffman · 2024-09-30T21:04:46Z

src/sv-pipeline/scripts/make_evidence_qc_table.py

           df_manta_low_outlier, df_melt_low_outlier, df_wham_low_outlier, df_total_low_outliers,
           df_melt_insert_size]
    for df in dfs:
        df[ID_COL] = df[ID_COL].astype(object)
    output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
    output_df = output_df[output_df[ID_COL] != EMPTY_OUTLIERS]
+    output_df = output_df.replace([None, np.nan], 0.0)
+    output_df.rename(columns={ID_COL: NEW_ID_COL}, inplace=True)


Is there a reason for making this change after the fact, rather than updating the value of ID_COL?

Side note that it may be worth asking around if people have strong feelings about using # to indicate the header line in this file. I don't think it's necessary for this file, and I think you have the right idea about updating it for concordance with other files and ease of use with pandas, but it is a common convention in the field, which is why the original header had #ID.

Yes, this was done mainly so that upstream files did not have to be changed to the new ID naming standard.

Now that you mention that this is a common convention though, I'm wondering if we can/should just leave it as is?

Updated all upstream tables to use sample_id - have hence removed this renaming from the script.

Initial commit

5040823

kjaisingh added the enhancement New feature or request label Sep 26, 2024

kjaisingh self-assigned this Sep 26, 2024

kjaisingh added 2 commits September 26, 2024 13:36

Cleared python syntax issues

e819249

Rename output column

874e0c4

kjaisingh requested review from mwalker174 and epiercehoffman September 26, 2024 23:47

kjaisingh marked this pull request as ready for review September 26, 2024 23:47

epiercehoffman requested changes Sep 30, 2024

View reviewed changes

kjaisingh added 3 commits September 30, 2024 17:29

Minor edits based on PR feedback

6743cd5

Replaced all references of #ID with sample_id

a7a0855

Further changes to rename columns to sample_id

7b7afd8

kjaisingh requested a review from epiercehoffman October 3, 2024 21:36

Removed preprocessing renaming

b7e9ed4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance EvidenceQC outputs to better align with sample QC #728

Enhance EvidenceQC outputs to better align with sample QC #728

kjaisingh commented Sep 26, 2024 •

edited

Loading

mwalker174 commented Sep 30, 2024

kjaisingh commented Sep 30, 2024

epiercehoffman left a comment

epiercehoffman Sep 30, 2024

kjaisingh Sep 30, 2024

epiercehoffman Sep 30, 2024

kjaisingh Sep 30, 2024

epiercehoffman Sep 30, 2024

kjaisingh Sep 30, 2024

kjaisingh Oct 3, 2024

Enhance EvidenceQC outputs to better align with sample QC #728

Are you sure you want to change the base?

Enhance EvidenceQC outputs to better align with sample QC #728

Conversation

kjaisingh commented Sep 26, 2024 • edited Loading

Description

Testing

Pre-Merge Changes Required

mwalker174 commented Sep 30, 2024

kjaisingh commented Sep 30, 2024

epiercehoffman left a comment

Choose a reason for hiding this comment

epiercehoffman Sep 30, 2024

Choose a reason for hiding this comment

kjaisingh Sep 30, 2024

Choose a reason for hiding this comment

epiercehoffman Sep 30, 2024

Choose a reason for hiding this comment

kjaisingh Sep 30, 2024

Choose a reason for hiding this comment

epiercehoffman Sep 30, 2024

Choose a reason for hiding this comment

kjaisingh Sep 30, 2024

Choose a reason for hiding this comment

kjaisingh Oct 3, 2024

Choose a reason for hiding this comment

kjaisingh commented Sep 26, 2024 •

edited

Loading