Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GVS walker to master [VS-964] #8355

Closed
wants to merge 11 commits into from
Closed

Conversation

mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Jun 8, 2023

Ready for review!

@mcovarr mcovarr force-pushed the vs_964_gvs_walker_to_master branch from 6678efe to 87a2a12 Compare June 28, 2023 20:07
@gatk-bot
Copy link

Github actions tests reported job failures from actions build 5405388563
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 5405388563.3 logs

@mcovarr mcovarr force-pushed the vs_964_gvs_walker_to_master branch from 87a2a12 to ec0ccd8 Compare June 29, 2023 17:37
@mcovarr mcovarr force-pushed the vs_964_gvs_walker_to_master branch from ec0ccd8 to fa74fc7 Compare July 5, 2023 20:51
@mcovarr mcovarr changed the title [DRAFT] GVS walker to master [VS-964] GVS walker to master [VS-964] Jul 6, 2023
@mcovarr mcovarr marked this pull request as ready for review July 6, 2023 14:29
@droazen droazen requested a review from KevinCLydon July 7, 2023 15:50
Copy link
Member

@KevinCLydon KevinCLydon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks just about good to me. I left a couple comments, mostly questions and some very minor changes. Beyond those, I have another question and a couple more suggestions:

  1. Codecov claims a lot of this code doesn't have tests. Am I correct to assume that is because it mostly relies on BigQuery and is therefore prohibitively difficult to write tests for?
  2. The code is mostly stylistically good, but there are a lot of places where variables, especially method parameters, that should be final are not, so it would be good to do a quick pass and update those if possible.
  3. Most of this code could benefit from more documentation. Some of it is very well documented, but some of the files don't seem to have much at all. It would be good to at least have doc comments for the public classes and any public attributes and methods that are not immediately self-explanatory by their names.

Thank you for doing the work to get this merged into master.

@Argument(
fullName = "dataset-id",
doc = "ID of the Google Cloud dataset to use when executing queries",
optional = true // I guess, but won't it break otherwise or require that a dataset be created with the name temp_tables?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be optional or required?

Comment on lines +41 to +47
@Argument(
fullName = "sample-file",
doc = "Alternative to `sample-table`. Pass in a (sample_id,sample_name) CSV that describes the full list of samples to extract. No header",
optional = true,
mutex={"sample-table"}

)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Argument(
fullName = "sample-file",
doc = "Alternative to `sample-table`. Pass in a (sample_id,sample_name) CSV that describes the full list of samples to extract. No header",
optional = true,
mutex={"sample-table"}
)
@Argument(
fullName = "sample-file",
doc = "Alternative to `sample-table`. Pass in a (sample_id,sample_name) CSV that describes the full list of samples to extract. No header",
optional = true,
mutex={"sample-table"}
)

Comment on lines +50 to +53
@Argument(
fullName = "print-debug-information",
doc = "If true, print extra debugging output",
optional = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Argument(
fullName = "print-debug-information",
doc = "If true, print extra debugging output",
optional = true)
@Argument(
fullName = "print-debug-information",
doc = "If true, print extra debugging output",
optional = true
)

Comment on lines +124 to +125
//TODO verify what we really need here
annotationEngine = new VariantAnnotatorEngine(makeVariantAnnotations(), null, Collections.emptyList(), false, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we really need here?

Comment on lines +102 to +105
// We want to find the tranche with the smallest target_truth_sensitivity that is
// equal to or greater than our truthSensitivityThreshold.
// e.g. if truthSensitivitySNPThreshold is 99.8 and we have tranches with target_truth_sensitivities
// of 99.5, 99.7, 99.9, and 100.0, we want the 99.9 sensitivity tranche.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this to a javadoc comment for this method?

Comment on lines +74 to +86
@Argument(
fullName = "vet-avro-file-name",
doc = "Path to data from Vet table in Avro format",
optional = true
)
private GATKPath vetAvroFileName = null;

@Argument(
fullName = "ref-ranges-avro-file-name",
doc = "Path to data from Vet table in Avro format",
optional = true
)
private GATKPath refRangesAvroFileName = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have the same description so is one of them inaccurate?

Comment on lines +244 to +247
/**
* Enforce that if cost information is being recorded to the cost-observability-tablename then *all* recorded
* parameters are set
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems to not fully describe what this method does.

Comment on lines +372 to +378
// if there is an avro file, the BQ specific parameters are unnecessary,
// but they all are required if there is no avro file
// KCIBUL: revisit!!!
// if ((cohortAvroFileName == null && vetAvroFileName == null && refRangesAvroFileName == null) && (projectID == null || (cohortTable == null && vetRangesFQDataSet == null))) {
// throw new UserException("Project id (--project-id) and cohort table (--cohort-extract-table) are required " +
// "if no avro file (--cohort-avro-file-name or --vet-avro-file-name and --ref-ranges-avro-file-name) is provided.");
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed?

Comment on lines +29 to +41
// COHORT_FIELDS
// public static final ImmutableSet<String> COHORT_FIELDS = ImmutableSet.of(
// SchemaUtils.LOCATION_FIELD_NAME,
// SchemaUtils.SAMPLE_ID_FIELD_NAME,
// SchemaUtils.STATE_FIELD_NAME,
// SchemaUtils.REF_ALLELE_FIELD_NAME,
// SchemaUtils.ALT_ALLELE_FIELD_NAME,
// SchemaUtils.CALL_GT,
// SchemaUtils.CALL_GQ,
// SchemaUtils.CALL_RGQ,
// SchemaUtils.QUALapprox,
// SchemaUtils.AS_QUALapprox,
// SchemaUtils.CALL_PL);//, AS_VarDP);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be here?

fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME,
doc = "Output VCF file to which annotated variants should be written."
)
protected String outputVcfPathString = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferred type for file arguments in GATK is GATKPath, so can this be switched to that?

Copy link

@koncheto-broad koncheto-broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am not overwhelmingly happy with needing to drop this many files for our extract process into GATK in order to make this happen, but for now we're going to move these files over and we can consider separating them back out again later when we move to another repo and have things nice and cleanly carved up. This little island of GVS in core GATK should likely be moved back home then

@KevinCLydon
Copy link
Member

LGTM. I am not overwhelmingly happy with needing to drop this many files for our extract process into GATK in order to make this happen, but for now we're going to move these files over and we can consider separating them back out again later when we move to another repo and have things nice and cleanly carved up. This little island of GVS in core GATK should likely be moved back home then

@mcovarr @koncheto-broad Do all these files exist as they are here in the ah_var_store branch currently?

@mcovarr
Copy link
Collaborator Author

mcovarr commented Jul 13, 2023

@mcovarr @koncheto-broad Do all these files exist as they are here in the ah_var_store branch currently?

Yes, delta the refactoring required to separate the VCF writing into a subclass.
Yes!

@KevinCLydon
Copy link
Member

@mcovarr @koncheto-broad Do all these files exist as they are here in the ah_var_store branch currently?

Yes, delta the refactoring required to separate the VCF writing into a subclass.

Would it be easier/preferable for you all to move those changes into the var store branch and I'll work on the PGEN stuff there instead of in master?

@mcovarr
Copy link
Collaborator Author

mcovarr commented Jul 13, 2023

Hi @KevinCLydon, yes actually all of this is on ah_var_store already per an earlier PR. Sorry my earlier comment was incorrect, this was > 1 month ago and I forgot 😅

@droazen
Copy link
Collaborator

droazen commented Jul 13, 2023

@mcovarr @koncheto-broad Had a talk today with @KevinCLydon about this PR, and he made a convincing case that the pgen project is more likely to require additional stuff from the GVS branch than from gatk/master. To avoid a nightmare scenario of constantly having to request additional transplants from the GVS branch, I'm on board with the idea of having Kevin try to work off of the GVS branch for the pgen project for now, provided that we can get timely rebases of ah_var_store onto master if we need them.

@mcovarr
Copy link
Collaborator Author

mcovarr commented Jul 14, 2023

Thanks @droazen, I'll go ahead and close this PR. The equivalent GVS walker code is already on ah_var_store, but the syncing up of ah_var_store to master is still on the TODO list.

@mcovarr mcovarr closed this Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants