Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance EvidenceQC outputs to better align with sample QC #728

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kjaisingh
Copy link
Collaborator

@kjaisingh kjaisingh commented Sep 26, 2024

Description

This PR is intended to introduce several enhancements to qc_matrix output as part of the EvidenceQC workflow that enable it to align better with the sample QC process, which is orchestrated by the QC Notebook. These enhancements are currently composed of the following:

  • Set any metrics with values of NaN to be 0.
  • Rename index columns for tables created to sample_id.
  • Add column for sex assignment data.

Testing

  • This Terra job shows an example run of the pipeline for the 1KGP cohort with this change.
  • This TSV file produced by the job above shows contains an example qc_matrix output from the 1KGP cohort.
  • Validated all WDLs with womtool.

Pre-Merge Changes Required

Remove automated Dockstore image sync for development branch.

@kjaisingh kjaisingh added the enhancement New feature or request label Sep 26, 2024
@kjaisingh kjaisingh self-assigned this Sep 26, 2024
@kjaisingh kjaisingh marked this pull request as ready for review September 26, 2024 23:47
@mwalker174
Copy link
Collaborator

What's the reason for changing NaN to 0?

@kjaisingh
Copy link
Collaborator Author

What's the reason for changing NaN to 0?

@mwalker174 Some samples have a value of NaN for certain metrics (e.g. high_overall_outliers) - though in reality, these are simply reflect a value of 0. In the QC notebook, we plot distributions of various metrics, but still want to include samples for which the metric value is 0. To achieve this, we currently convert these NaN values to 0 in the QC notebook itself, but the thought here is that we could convert these to 0 in the EvidenceQC output table to minimize notebook-specific table processing.

Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes! I have a few implementation notes

df_manta_low_outlier, df_melt_low_outlier, df_wham_low_outlier, df_total_low_outliers,
df_melt_insert_size]
for df in dfs:
df[ID_COL] = df[ID_COL].astype(object)
output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
output_df = output_df[output_df[ID_COL] != EMPTY_OUTLIERS]
output_df = output_df.replace([None, np.nan], 0.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want to set samples who are not outliers in any categories to 0 in the outlier columns, rather than NaN, for interpretability. But I think it makes more sense to do that for just the outlier columns, not the entire dataframe. There may not be other metrics which routinely have NaNs at the moment, but if an NaN did show up in another metric (current or future), it would not necessarily make sense to set it to 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for clarifying.

filename: A tab-delimited file containing estimated copy numbers.
Returns:
A pandas DataFrame containing the following columns:
[id, chr1_CopyNumber, ..., chr22_CopyNumber, chrX_CopyNumber, chrY_CopyNumber, chrX_CopyNumber_rounded].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this documentation was a copy/paste error and should be updated to refer to the sex assignment data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for catching this.

df_manta_low_outlier, df_melt_low_outlier, df_wham_low_outlier, df_total_low_outliers,
df_melt_insert_size]
for df in dfs:
df[ID_COL] = df[ID_COL].astype(object)
output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
output_df = output_df[output_df[ID_COL] != EMPTY_OUTLIERS]
output_df = output_df.replace([None, np.nan], 0.0)
output_df.rename(columns={ID_COL: NEW_ID_COL}, inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for making this change after the fact, rather than updating the value of ID_COL?

Side note that it may be worth asking around if people have strong feelings about using # to indicate the header line in this file. I don't think it's necessary for this file, and I think you have the right idea about updating it for concordance with other files and ease of use with pandas, but it is a common convention in the field, which is why the original header had #ID.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was done mainly so that upstream files did not have to be changed to the new ID naming standard.

Now that you mention that this is a common convention though, I'm wondering if we can/should just leave it as is?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all upstream tables to use sample_id - have hence removed this renaming from the script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants