-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new argument to control variant output interval filtering. #6388
Conversation
2d46c40
to
7274dcd
Compare
7274dcd
to
20c049d
Compare
@droazen @samuelklee I rebased this. It's ready for review I think. I removed an existing argument in this pr, if we want I can deprecated it instead, it will probably take some extra logic to account for both arguments existing though. |
@droazen any chance we could get this merged? There's been interest again from our MalariaGEN collaborators at Sanger. Happy to rebase and address any comments you might have, if @lbergelson is too busy---just let me know. |
BTW, would be great if we could get this in by the next release. |
@samuelklee Thanks for resurrecting this old PR -- I'll take a look at the state of the branch tomorrow and see whether @lbergelson and I can get it over the finish line. |
@droazen @lbergelson can we get this in soon? We need it for the upcoming MalariaGEN Pf8 release. Thanks! |
6a95b50
to
ceea713
Compare
@droazen @lbergelson could I get an ETA on this? Louis chatted with me in early December and gave me an update, just wanted to make sure everything is on track---or if it's not going to get in in time for Pf8, to let our collaborators know. @kbergin @vruano might also want to keep an eye on this. |
Just chatted with @lbergelson and we’re going to try to get this in over the next week or so, so that @vruano can slot it in between rounds of evaluations that he’s running. Sounded like some tests might need to be added, but I’ll try to do an initial review today—wouldn’t mind if anyone else wants to add comments, either! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @lbergelson! Just a quick, relatively superficial skim. Happy to take a second look after you add the tests you mentioned, but might be better to have at least one more set of eyes from the engine team. I also don't have strong opinions about the breaking changes.
} | ||
|
||
@DataProvider | ||
public Object[][] getIntervalsAndOverlapMode(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to check the actual intervals retained, rather than just the total number?
*/ | ||
STARTS_IN{ | ||
@Override | ||
boolean test(OverlapDetector<? extends Locatable> detector, final VariantContext query) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final OverlapDetector...
}, | ||
|
||
/** | ||
* Always matches, may be used to not perform any filtering, alternatively a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete comment?
* @param intervals the intervals to compare against, note that these are not merged so if they should be merged than the input list should be preprocessed | ||
* @param mode the matching mode to use | ||
*/ | ||
public IntervalFilteringVcfWriter(final VariantContextWriter writer, List<SimpleInterval> intervals, Mode mode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finals
|| variant.getAttributeAsInt(VCFConstants.DEPTH_KEY,0) == 0 | ||
|| (onlyOutputCallsStartingInIntervals && !intervals.stream().anyMatch(interval -> interval.contains(variantStart)))) { | ||
|| variant.getAttributeAsInt(VCFConstants.DEPTH_KEY,0) == 0 ) | ||
// todo this changes is a slight de-optimization since we will now process some sites whihc were previously ignored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whihc -> which. Not sure if you think this TODO will be worth addressing, but thanks for making a note for now.
@@ -45,6 +49,11 @@ | |||
import org.broadinstitute.hellbender.utils.reference.ReferenceUtils; | |||
import org.broadinstitute.hellbender.utils.variant.GATKVariantContextUtils; | |||
import org.broadinstitute.hellbender.utils.variant.writers.ShardingVCFWriter; | |||
import org.broadinstitute.hellbender.utils.variant.writers.IntervalFilteringVcfWriter; | |||
|
|||
//TODO: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this entails, but don't forget it!
…ariants are filtered.
ceea713
to
03cb824
Compare
03cb824
to
578a951
Compare
I've responded to the ancient comments, rebased, and added some test improvements. |
@@ -46,7 +46,7 @@ private StandardArgumentDefinitions(){} | |||
public static final String INVALIDATE_PREVIOUS_FILTERS_LONG_NAME = "invalidate-previous-filters"; | |||
public static final String SORT_ORDER_LONG_NAME = "sort-order"; | |||
public static final String FLOW_ORDER_FOR_ANNOTATIONS = "flow-order-for-annotations"; | |||
|
|||
public static final String VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME = "variant-output-interval-filtering-mode"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is a mess, does anyone have suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just "variant-interval-filtering"?
Github actions tests reported job failures from actions build 10457568370
|
Github actions tests reported job failures from actions build 10457619605
|
Github actions tests reported job failures from actions build 10458689373
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part this looks good and is nice and extensible to all/most cases. I think we need to hunt down cases where the VCFWriter is being constructed outside of this codepath and try to patch them to include this if possible. Beyond that my primary gripe is that there need to be more explicit tests of the bizzaro edge cases that must exist with this code
@@ -46,7 +46,7 @@ private StandardArgumentDefinitions(){} | |||
public static final String INVALIDATE_PREVIOUS_FILTERS_LONG_NAME = "invalidate-previous-filters"; | |||
public static final String SORT_ORDER_LONG_NAME = "sort-order"; | |||
public static final String FLOW_ORDER_FOR_ANNOTATIONS = "flow-order-for-annotations"; | |||
|
|||
public static final String VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME = "variant-output-interval-filtering-mode"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just "variant-interval-filtering"?
...ain/java/org/broadinstitute/hellbender/utils/variant/writers/IntervalFilteringVcfWriter.java
Outdated
Show resolved
Hide resolved
.../org/broadinstitute/hellbender/utils/variant/writers/IntervalFilteringVcfWriterUnitTest.java
Show resolved
Hide resolved
* @return Default interval filtering mode for variant output. Subclasses may override this to set a different default. | ||
*/ | ||
public IntervalFilteringVcfWriter.Mode getDefaultVariantOutputFilterMode(){ | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHould this default to "ANYWHERE" instead?
validateSequenceDictionaries(); | ||
} | ||
|
||
checkToolRequirements(); | ||
|
||
if (outputVariantIntervalFilteringMode != null && userIntervals == null){ | ||
throw new CommandLineException.MissingArgument("-L or -XL", "Intervals are required if --" + StandardArgumentDefinitions.VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME + " was specified."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a tool overrides the default here its going to spit this message if the user doesn't provide intervals at all which i think is confusing about this message.... We should probably disambiguate the default and user supplied here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I'm not sure it's addressed by your changes though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well pulling this down into variant walker base makes this a little more obnoxious to fix... I'm just going to add a comment clarifying but this is a head scratcher
...ain/java/org/broadinstitute/hellbender/utils/variant/writers/IntervalFilteringVcfWriter.java
Outdated
Show resolved
Hide resolved
...ain/java/org/broadinstitute/hellbender/utils/variant/writers/IntervalFilteringVcfWriter.java
Show resolved
Hide resolved
...ain/java/org/broadinstitute/hellbender/utils/variant/writers/IntervalFilteringVcfWriter.java
Outdated
Show resolved
Hide resolved
ENDS_IN("ends within any of the given intervals"){ | ||
@Override | ||
boolean test(final OverlapDetector<? extends Locatable> detector, final VariantContext query) { | ||
final SimpleInterval endPosition = new SimpleInterval(query.getContig(), query.getEnd(), query.getEnd()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the simple interval conversion necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because we want to be sure we're only checking the end position for overlap.
doc = "Restrict the output variants to ones that match the specified intervals according to the specified matching mode.", | ||
optional = true) | ||
@Advanced | ||
public IntervalFilteringVcfWriter.Mode outputVariantIntervalFilteringMode = getDefaultVariantOutputFilterMode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this actually be moved into the VariantWalker codepaths so its not getting applied to EVERY GATKTool even those that are going to cause confusion if its there?
Github actions tests reported job failures from actions build 10908982699
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery
Adding the filter count at the end is definitely a big improvement. I'm not clear on adding a different method at the GATK level and at the VariantWalker level. It seems like just an extra layer of indirection. Also, tests are failing because I believe it needs to be at the VariantWalkerBase level instead.
I like moving the argument down to that level though. Maybe we don't need a way of specifying a default since no one actually wants to, and we just make it so we override the GATKTool method in order to pass back up the argument value. That also removes the problem of "tool specified a filter pattern but no intervals so we output an error message.
I found a few typos of course.
...ain/java/org/broadinstitute/hellbender/utils/variant/writers/IntervalFilteringVcfWriter.java
Outdated
Show resolved
Hide resolved
validateSequenceDictionaries(); | ||
} | ||
|
||
checkToolRequirements(); | ||
|
||
if (outputVariantIntervalFilteringMode != null && userIntervals == null){ | ||
throw new CommandLineException.MissingArgument("-L or -XL", "Intervals are required if --" + StandardArgumentDefinitions.VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME + " was specified."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I'm not sure it's addressed by your changes though?
} | ||
|
||
@Override | ||
public void close() { | ||
writer.close(); | ||
underlyingWriter.close(); | ||
if (filteredCount > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably move the output ahead of the close() since if there's an exception in the tool there's often an exception in the closing it but you might still want that information to debub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough
src/test/java/org/broadinstitute/hellbender/engine/GatkToolIntegrationTest.java
Outdated
Show resolved
Hide resolved
{Arrays.asList(chr1Interval, chr2Interval), new ArrayList<>(), IntervalFilteringVcfWriter.Mode.CONTAINED, List.of(VariantEmitter.INT3, VariantEmitter.INT7)}, | ||
|
||
// Tests specifically aimed at documenting how the --interval-merging-rule argument works in conjunction with the interval filtering | ||
{Arrays.asList(chr1IntervalLeft, chr1IntervalRight), new ArrayList<>(), IntervalFilteringVcfWriter.Mode.CONTAINED, List.of(VariantEmitter.INT2)}, // Default is to merge all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use INTERVAL_MERGING_RULE_LONG_NAME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done but this is about as close to as friviolous as this gets since that argument is bedrock that won't change and it makes this block more unreadable \
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, but I really like to be able to find all the uses of an argument in tests by clicking on find uses.
@@ -949,11 +942,11 @@ public VariantContextWriter createVCFWriter(final Path outPath) { | |||
options.toArray(new Options[0])); | |||
} | |||
|
|||
return outputVariantIntervalFilteringMode== null ? | |||
return getVariantFilteringOutputModeIfApplicable() == IntervalFilteringVcfWriter.Mode.ANYWHERE ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure this rename helps that much, it just adds another layer of indirection. It's then kind of confusing because tools below variantwalker might not be sure which to override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to getVariantOutputFilteringMode
I don't really know what IS better here
src/main/java/org/broadinstitute/hellbender/engine/VariantWalker.java
Outdated
Show resolved
Hide resolved
ENDS_IN("ends within any of the given intervals"){ | ||
@Override | ||
boolean test(final OverlapDetector<? extends Locatable> detector, final VariantContext query) { | ||
final SimpleInterval endPosition = new SimpleInterval(query.getContig(), query.getEnd(), query.getEnd()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because we want to be sure we're only checking the end position for overlap.
@@ -31,6 +33,28 @@ public abstract class VariantWalker extends VariantWalkerBase { | |||
private FeatureDataSource<VariantContext> drivingVariants; | |||
private FeatureInput<VariantContext> drivingVariantsFeatureInput; | |||
|
|||
@Argument(fullName = StandardArgumentDefinitions.VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure all these changes need to go in VariantWalkerBase, not VariantWalker. I think that's why all the tests exploded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responses
@@ -949,11 +942,11 @@ public VariantContextWriter createVCFWriter(final Path outPath) { | |||
options.toArray(new Options[0])); | |||
} | |||
|
|||
return outputVariantIntervalFilteringMode== null ? | |||
return getVariantFilteringOutputModeIfApplicable() == IntervalFilteringVcfWriter.Mode.ANYWHERE ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to getVariantOutputFilteringMode
I don't really know what IS better here
src/main/java/org/broadinstitute/hellbender/engine/VariantWalker.java
Outdated
Show resolved
Hide resolved
{Arrays.asList(chr1Interval, chr2Interval), new ArrayList<>(), IntervalFilteringVcfWriter.Mode.CONTAINED, List.of(VariantEmitter.INT3, VariantEmitter.INT7)}, | ||
|
||
// Tests specifically aimed at documenting how the --interval-merging-rule argument works in conjunction with the interval filtering | ||
{Arrays.asList(chr1IntervalLeft, chr1IntervalRight), new ArrayList<>(), IntervalFilteringVcfWriter.Mode.CONTAINED, List.of(VariantEmitter.INT2)}, // Default is to merge all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done but this is about as close to as friviolous as this gets since that argument is bedrock that won't change and it makes this block more unreadable \
} | ||
|
||
@Override | ||
public void close() { | ||
writer.close(); | ||
underlyingWriter.close(); | ||
if (filteredCount > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough
validateSequenceDictionaries(); | ||
} | ||
|
||
checkToolRequirements(); | ||
|
||
if (outputVariantIntervalFilteringMode != null && userIntervals == null){ | ||
throw new CommandLineException.MissingArgument("-L or -XL", "Intervals are required if --" + StandardArgumentDefinitions.VARIANT_OUTPUT_INTERVAL_FILTERING_MODE_LONG_NAME + " was specified."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well pulling this down into variant walker base makes this a little more obnoxious to fix... I'm just going to add a comment clarifying but this is a head scratcher
Github actions tests reported job failures from actions build 10910519980
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me after our discussion.
/** | ||
* @return Default interval filtering mode for variant output. Subclasses may override this to set a different default. | ||
*/ | ||
public IntervalFilteringVcfWriter.Mode getDefaultVariantOutputFilterMode(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused now. Should we delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review done with @lbergelson
--variant-output-interval-filtering-mode
which allows filtering output variants according to the input interval list. This replaces--only-output-calls-starting-in-intervals
which was available in GenotypeGvcfs and GnarlyGenotyper.It works by adding a filtering decorator to the vcf writers created through
GATKTool.createVCFWriter
.There are several different filtering modes:
STARTS_IN
,ENDS_IN
,OVERLAPS
,CONTAINED
, andANYWHERE
The default for tools is not to apply the decorator, but they may optionally change that behavior by overriding the new
getDefaultVariantOutputFilterMode
.--variant-output-interval-filtering-mode STARTS_IN
is equivalent to the previous behavior of--only-output-calls-starting-in-intervals true
MockVcfWriter is now a testUtils class. The naming is a bit awkward so improvements would be helpful.
This doesn't fix the weird behavior in HaplotypeCaller but does allow subsetting unique shards with SelectVariants and other variant outputting tools. We could adapt this to apply to bam outputs as well if that seems useful.