-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance EvidenceQC outputs to better align with sample QC #728
base: main
Are you sure you want to change the base?
Conversation
What's the reason for changing |
@mwalker174 Some samples have a value of NaN for certain metrics (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes! I have a few implementation notes
df_manta_low_outlier, df_melt_low_outlier, df_wham_low_outlier, df_total_low_outliers, | ||
df_melt_insert_size] | ||
for df in dfs: | ||
df[ID_COL] = df[ID_COL].astype(object) | ||
output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs) | ||
output_df = output_df[output_df[ID_COL] != EMPTY_OUTLIERS] | ||
output_df = output_df.replace([None, np.nan], 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do want to set samples who are not outliers in any categories to 0 in the outlier columns, rather than NaN, for interpretability. But I think it makes more sense to do that for just the outlier columns, not the entire dataframe. There may not be other metrics which routinely have NaNs at the moment, but if an NaN did show up in another metric (current or future), it would not necessarily make sense to set it to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly - thanks for clarifying.
filename: A tab-delimited file containing estimated copy numbers. | ||
Returns: | ||
A pandas DataFrame containing the following columns: | ||
[id, chr1_CopyNumber, ..., chr22_CopyNumber, chrX_CopyNumber, chrY_CopyNumber, chrX_CopyNumber_rounded]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this documentation was a copy/paste error and should be updated to refer to the sex assignment data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly - thanks for catching this.
df_manta_low_outlier, df_melt_low_outlier, df_wham_low_outlier, df_total_low_outliers, | ||
df_melt_insert_size] | ||
for df in dfs: | ||
df[ID_COL] = df[ID_COL].astype(object) | ||
output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs) | ||
output_df = output_df[output_df[ID_COL] != EMPTY_OUTLIERS] | ||
output_df = output_df.replace([None, np.nan], 0.0) | ||
output_df.rename(columns={ID_COL: NEW_ID_COL}, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for making this change after the fact, rather than updating the value of ID_COL?
Side note that it may be worth asking around if people have strong feelings about using # to indicate the header line in this file. I don't think it's necessary for this file, and I think you have the right idea about updating it for concordance with other files and ease of use with pandas, but it is a common convention in the field, which is why the original header had #ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was done mainly so that upstream files did not have to be changed to the new ID naming standard.
Now that you mention that this is a common convention though, I'm wondering if we can/should just leave it as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated all upstream tables to use sample_id
- have hence removed this renaming from the script.
Description
This PR is intended to introduce several enhancements to
qc_matrix
output as part of theEvidenceQC
workflow that enable it to align better with the sample QC process, which is orchestrated by the QC Notebook. These enhancements are currently composed of the following:NaN
to be0
.sample_id
.Testing
qc_matrix
output from the 1KGP cohort.Pre-Merge Changes Required
Remove automated Dockstore image sync for development branch.