Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AH - implement changes for mitochondrial pipeline #6399

Merged
merged 85 commits into from
Apr 22, 2020
Merged

Conversation

ahaessly
Copy link
Contributor

@ahaessly ahaessly commented Jan 21, 2020

This PR converts the Mutect2Filtering engine to be allele specific. This required changes to SomaticClusteringModel and ThresholdCalculator as well as ErrorProbabilities and of course the filters themselves. There are some filters which have not yet been converted, but I am prioritizing the ones in this PR for Sarah Calvo and the mitochondria pipeline. This provides the implementation for dsp-spec-ops tickets 166, 168, 169

@gatk-bot
Copy link

gatk-bot commented Jan 21, 2020

Travis reported job failures from build 28724
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 28724.13 logs
unit openjdk8 28724.3 logs

@gatk-bot
Copy link

gatk-bot commented Jan 21, 2020

Travis reported job failures from build 28733
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 28733.13 logs
unit openjdk8 28733.3 logs

@gatk-bot
Copy link

gatk-bot commented Jan 22, 2020

Travis reported job failures from build 28741
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 28741.13 logs
unit openjdk8 28741.3 logs

@gatk-bot
Copy link

gatk-bot commented Jan 22, 2020

Travis reported job failures from build 28743
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 28743.13 logs
unit openjdk8 28743.3 logs

@gatk-bot
Copy link

gatk-bot commented Jan 22, 2020

Travis reported job failures from build 28760
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 28760.13 logs
unit openjdk8 28760.3 logs

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reviewing first half of code.

@@ -20,11 +21,11 @@ public DuplicatedAltReadFilter(final int uniqueAltReadCount) {
public ErrorType errorType() { return ErrorType.ARTIFACT; }

@Override
public boolean isArtifact(final VariantContext vc, final Mutect2FilteringEngine filteringEngine) {
return vc.getAttributeAsInt(UniqueAltReadCount.KEY, 1) <= uniqueAltReadCount;
public List<Boolean> areAllelesArtifacts(final VariantContext vc, final Mutect2FilteringEngine filteringEngine, ReferenceContext referenceContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always returns a singleton list, even if there is more than one alt allele. . .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. So, UniqueAltReadCount is actually assuming a single alt allele. I guess I should change that class to write an allele specific annotation that can be used in an allele specific way in the filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

trueProbability *= (1 - errorProb);
// if vc has symbolic allele, remove it
if (vc.hasSymbolicAlleles()) {
// can we assume it's the last allele?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VCF spec doesn't require it, so let's be cautious and not assume.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to you, @ahaessly. It's fundamentally sound.

args -> {
intervals.stream().map(SimpleInterval::new).forEach(args::addInterval);
return args;
});

// add tests for DUPLICATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the status of this comment?

@davidbenjamin davidbenjamin self-assigned this Jan 23, 2020
@ahaessly ahaessly force-pushed the ah_SO166_morefilters branch from ecdbbe7 to 1946785 Compare January 24, 2020 15:12
@gatk-bot
Copy link

gatk-bot commented Jan 24, 2020

Travis reported job failures from build 28795
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling openjdk8 28795.4 logs
integration oraclejdk8 28795.11 logs
unit openjdk8 28795.3 logs
integration openjdk8 28795.2 logs

@gatk-bot
Copy link

Travis reported job failures from build 28802
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk8 28802.3 logs

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the annotation aspects and left the Mutect2-specific parts to David. I pushed a commit with some changes I should have made a long time ago and a fix for the AFs in the multi-allelic splitting. Let me know what you think. (There are probably a lot more methods we could move from RefConfMerger to AlleleSubsettingUtils, but I just moved the ones I used here.)

@@ -0,0 +1,44 @@
package org.broadinstitute.hellbender.tools.walkers.annotator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about putting this in the allelespecific subpackage. On the one hand, it is allele-specific, but on the other hand, all those classes have interfaces and/or parent classes that this one doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, now that I look at that class again, I think I'll add the AlleleSpecificAnnotation interface to it. but, yeah, I wasn't really sure which package it belonged in.


public static Map<String, Object> computeSBAnnotation(VariantContext vc, AlleleLikelihoods<GATKRead, Allele> likelihoods, String key) {
// calculate the annotation from the likelihoods
// likelihoods can come from HaplotypeCaller call to VariantAnnotatorEngine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sometimes likelihoods come from Mutect2, right?

return vcb.make();
}

/**
* Creates a comma separated string of all the filters that apply to the allele.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really create a comma separated string? Also, it looks like the only duplicate filter you're checking for is PASS, so I would call the method "removeExtraPassFilters" or something less general than what's there now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - didn't update the doc after a refactor.
the call to distinct removes any duplicate filters (of which the only possibility should be PASS). the first if statement removes the PASS if there are filters that did not pass.

}

@Override
protected List<String> requiredAnnotations() { return Collections.emptyList(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does have required annotations, they're just not info annotations. How important is this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well the code that uses this sees if the Variant Context has the annotations - so this wouldn't work here because it's on the genotype. If there are no required annotations the check passes and the error probabilities are calculated. I could rename the method to requiredInfoAnnotations or requiredVariantAnnotations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the upstream method has a check for INFO annotations, then I'm happy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - renaming the method to requiredInfoAnnotations so it's more explicit

@gatk-bot
Copy link

gatk-bot commented Feb 3, 2020

Travis reported job failures from build 28960
Failures in the following jobs:

Test Type JDK Job ID Logs
unit openjdk11 28960.13 logs
unit openjdk8 28960.3 logs

@davidbenjamin davidbenjamin mentioned this pull request Feb 6, 2020
@davidbenjamin
Copy link
Contributor

@ahaessly Here's a representative line in the Mutect2IntegrationTest output:

20      38245275        .       TAC     T,TACAC .       germline;multiallelic;normal_artifact;slippage  AS_FilterStatus=.|weak_evidence|PASS

Comments:

  1. Could we make AS_FilterStatus an alt-only field, since the ref isn't filtered?
  2. I don't feel good about an AS_FilterStatus of PASS when the site fails the variant filters, especially because the next step (which I intend to do soon after your PR) is to make every filter allele-specific so that a filtered variant just means that every allele had a filter in common. I think of the allele filters as additive on top of the variant filters, so that a better representation would be to add "." If you want to leave this change for later that's fine with me.
  3. I think some users have complained about INFO fields with underscores violating the VCF spec. Let's rename this to someone like ASFILTER ("Status" can be omitted).

Otherwise, looks like you have addressed everything.

@ahaessly
Copy link
Contributor Author

@davidbenjamin AS_FilterStatus was already a field used in the INFO section of a VCF (@ldgauthier added it I believe). Also, it seems like there are several other INFO field attributes that use the _ if you take a look at GATKVCFConstants file. I'm fine changing it just wanted to hear from laura as well.

@davidbenjamin
Copy link
Contributor

I double-checked the VCF spec and it turns out that while FORMAT keys must be alphanumeric, INFO keys may have underscores. I still prefer ASFILTER for the sake of brevity and hope the existing constant is not set in stone yet, though there may be compelling reasons not to change it. Let's let Laura weigh in.

@ldgauthier
Copy link
Contributor

@davidbenjamin I'm willing to concede number 3, and for number 2 I grudgingly agree. A dot in the FILTER field means filters haven't been applied, so by extension a dot in the ASFILTER field would mean that filters haven't been applied to that allele, which isn't true. But in the autosomal pipeline, a variant that fails a site filter before VQSR wouldn't be seen or filtered by VQSR, which I suppose does mean that filters have not been applied. I guess my objection boils down to the fact that by outputting . instead of PASS you're throwing away information.

@ahaessly ahaessly force-pushed the ah_SO166_morefilters branch from 3f2937e to f07b809 Compare February 19, 2020 22:37
@gatk-bot
Copy link

Travis reported job failures from build 29219
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling openjdk8 29219.4 logs

@gatk-bot
Copy link

gatk-bot commented Feb 20, 2020

Travis reported job failures from build 29224
Failures in the following jobs:

Test Type JDK Job ID Logs
integration oraclejdk8 29224.12 logs
unit openjdk11 29224.14 logs
variantcalling openjdk8 29224.4 logs
integration openjdk11 29224.13 logs
integration openjdk8 29224.2 logs

@gatk-bot
Copy link

gatk-bot commented Feb 21, 2020

Travis reported job failures from build 29232
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling openjdk8 29232.4 logs
unit openjdk8 29232.3 logs

@gatk-bot
Copy link

gatk-bot commented Feb 21, 2020

Travis reported job failures from build 29236
Failures in the following jobs:

Test Type JDK Job ID Logs
variantcalling openjdk8 29236.4 logs
unit openjdk8 29236.3 logs

@davidbenjamin
Copy link
Contributor

davidbenjamin commented Feb 23, 2020

I guess my objection boils down to the fact that by outputting . Instead of PASS you're throwing away information.

I agree that we need to address that. What if an allele whose failing filters were all passed to the site got a concise tag like site to indicate that it has no additional filters, is not a passing allele, and that allele-specific filters were applied?

Besides the aesthetics of the word "PASS" being associated in any way with something failing, I'm worried about users shooting themselves in the foot with pipelines that only check the allele filters. You could imagine cases where failing to check the site filters wouldn't be so egregious as to be obvious.

@ldgauthier
Copy link
Contributor

I could be onboard with site. It's not entirely intuitive, but it will get a header line, right?

@davidbenjamin
Copy link
Contributor

It will get a header line.

@ldgauthier
Copy link
Contributor

@ahaessly after reviewing the allele-specific filtering discussion in hts-specs (samtools/hts-specs#115), I think the delimiters for the ASFILTER values should be switched, i.e. data for each allele should be separated by commas (ALLELE_SPECIFIC_PRINT_DELIM?) and the filters for each allele should be separated by pipes (ALLELE_SPECIFIC_RAW_DELIM).

@ahaessly ahaessly requested a review from meganshand March 25, 2020 19:48
Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WDLs look good to me (that's all I looked at). I learned some new tricks! Only have a couple of minor comments/questions. Thanks!

@@ -1,6 +1,6 @@
version 1.0

import "AlignmentPipeline.wdl" as AlignAndMarkDuplicates
import "https://api.firecloud.org/ga4gh/v1/tools/mitochondria:AlignmentPipeline/versions/1/plain-WDL/descriptor" as AlignAndMarkDuplicates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the tests to work this needs to be the file name "AlignmentPipeline.wdl". If there's a better solution for moving things back and forth between Terra and the gatk repo though I'm all for it.

scripts/mitochondria_m2_wdl/AlignAndCall.wdl Outdated Show resolved Hide resolved
scripts/mitochondria_m2_wdl/AlignAndCall.wdl Outdated Show resolved Hide resolved
awk '{print $6}' output-data > major_hg.txt
awk '{print $8}' output-data > minor_hg.txt
awk '{print $14}' output-data > mean_het_major.txt
awk '{print $15}' output-data > mean_het_minor.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth adding any sanity checks here to make sure this doesn't break in the future since it's relying on column order to extract the values?

scripts/mitochondria_m2_wdl/AlignAndCall.wdl Outdated Show resolved Hide resolved
@gatk-bot
Copy link

gatk-bot commented Apr 1, 2020

Travis reported job failures from build 29837
Failures in the following jobs:

Test Type JDK Job ID Logs
integration oraclejdk8 29837.12 logs
integration openjdk11 29837.13 logs
integration openjdk8 29837.2 logs

@gatk-bot
Copy link

gatk-bot commented Apr 1, 2020

Travis reported job failures from build 29839
Failures in the following jobs:

Test Type JDK Job ID Logs
integration oraclejdk8 29839.12 logs
integration openjdk11 29839.13 logs
integration openjdk8 29839.2 logs

@gatk-bot
Copy link

gatk-bot commented Apr 1, 2020

Travis reported job failures from build 29842
Failures in the following jobs:

Test Type JDK Job ID Logs
integration oraclejdk8 29842.12 logs
integration openjdk11 29842.13 logs
integration openjdk8 29842.2 logs

@@ -470,7 +498,7 @@ task M2 {
--max-mnp-distance 0
>>>
runtime {
docker: "us.gcr.io/broad-gatk/gatk:4.1.1.0"
docker: select_first([gatk_docker_override, "us.gcr.io/broad-gatk/gatk:4.1.1.0"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the solution here is, but does keeping this as the default version mean that we'll have to keep updating it when there are new releases? That's the way it already was, so it's not necessarily a problem, I just want to remember that once this PR is merged and released we should go back and update this default.

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one small question/comment but otherwise the WDL and WDL tests look good to me! Thanks for adding them.

Copy link
Contributor

@davidbenjamin davidbenjamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahaessly Looks like you have resolved everything. Don't bother sending it back to me on account my one trivial comment, which you can take or leave at your discretion.

// set tumorAD to 0 for symbolic alleles so it won't contribute to overall AD
List<Allele> symbolicAlleles = vc.getAlternateAlleles().stream().filter(allele -> allele.isSymbolic()).collect(Collectors.toList());
// convert allele index to alt allele index
List<Integer> symIndexes = vc.getAlleleIndices(symbolicAlleles).stream().map(i -> i-1).collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go directly compute symbolic indices via new IndexRange(0, vc.getNAlleles()).filter(n -> vc.getAllele(n).isSymbolic());?

@@ -73,7 +73,7 @@


/**
*
* Returns a list for each alternate allele which is the probability that the allele should be filtered out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a list with one element for each allele"?

@ahaessly ahaessly merged commit 3021e69 into master Apr 22, 2020
@ahaessly ahaessly deleted the ah_SO166_morefilters branch April 22, 2020 17:41
jonn-smith pushed a commit that referenced this pull request Jul 14, 2020
new refactor adding allele specific filters in mutect2
fix headers
* Fixed AF and SB splitting; also some javadoc I should have done in the
past and some refactoring I should have done in the past
* make unique alt read count allele specific
* fix genotypes not included in vcf
* changed getRequiredAnnotations to getRequiredInfoAnnotations to be more explicit
* fix split multiallelics to work for all info fields
* change count type for RPA
* update example inputs to use broad public bucket for refs

Co-authored-by: Laura Gauthier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants