More flexible matching of dbSNP variants #6626

kachulis · 2020-05-29T22:02:14Z

Addresses two user requests to more flexibly and accurately match dbSNP variants.

https://gatk.broadinstitute.org/hc/en-us/community/posts/360066006472-gatk-4-1-4-1-HaplotypeCaller-D-parameter-for-MIXED-type

https://gatk.broadinstitute.org/hc/en-us/community/posts/360062537671-GATK-4-1-7-0-does-not-annotate-ID-using-dbSNP-build-153-VCF

The first change is to add all dbsnp id's which match a particular variant to the variant's id, instead of just the first one found in the dbsnp vcf.

The second change is to be less brittle to variant normalization issues, and match differing variant representations of the same underlying variant. This is implemented by splitting and trimming multiallelics before checking for a match, which I suspect are the predominant cause of these types of matching failures.

gatk-bot · 2020-05-29T22:49:34Z

Travis reported job failures from build 30450
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	30450.11	logs
integration	openjdk8	30450.2	logs

droazen · 2020-06-03T17:39:15Z

@jamesemery you might be a good reviewer for this one

jamesemery

Can you explain to me what happens for deletions of various lengths? It seems like there might be some more messiness in this class that is not your fault but we should probably address.

jamesemery · 2020-06-03T20:24:38Z

...main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantOverlapAnnotator.java

-            if ( ! vcComp.getContig().equals(vcToAnnotate.getContig()) || vcComp.getStart() != vcToAnnotate.getStart() ) {
-                throw new IllegalArgumentException("source rsID VariantContext " + vcComp + " doesn't start at the same position as vcToAnnotate " + vcToAnnotate);
+            if (!vcCompSource.getContig().equals(vcToAnnotate.getContig())) {
+                throw new IllegalArgumentException("source rsID VariantContext " + vcCompSource + " is not on same chromosome as vcToAnnotate " + vcToAnnotate);


Hmm... splitVariantContextToBiallelics() claims that under some circumstances it results in variants getting pushed forwards on the genome. This is all fine and well but it leaves a hole here where there could be a variant in the input (or the DBSNP) that gets moved and would otherwise match with the next variant in the other file. Not that the old code could handle this case any better... This is a very niche circumstance and probably doesn't warrant worrying about. I do think we should probably document this fact in a comment somewhere.

I agree that there are still cases that this solution will not catch, but I think they are related to left alignment of indels or MNP vs multiple SNP/ complex snp/indel combination issues, not the allele movement that occurs in splitVariantContextToBiallelics. Since there is no reference passed to splitVariantContextToBiallelics() beyond what is stored in the reference allele of the variant context, the alleles that are returned by this method cannot overlap any reference bases that were not overlapped by the original alleles. So I think any match that would be found if the splitting and trimming were performed before grabbing the overlaps will also be found by this code.

But I agree with the general point, so I have added a comment that in rare cases we may fail add annotations to a variant that "should" have been added.

jamesemery · 2020-06-03T20:28:07Z

...main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantOverlapAnnotator.java

+            boolean addThisID = false;
+            for (final VariantContext vcComp : vcCompList) {
+                for (final VariantContext vcToAnnotateBi : vcAnnotateList) {
+                    if (vcComp.getStart() == vcToAnnotateBi.getStart() && vcToAnnotateBi.getReference().equals(vcComp.getReference()) && vcComp.getAlternateAlleles().equals(vcToAnnotateBi.getAlternateAlleles())) {


Wait a minute.... is vcToAnnotateBi.getAlternateAlleles() really adequate here? This means that two deletions of different lengths are exactly the same as far as our DBSNP annotations are concerned? That doesn't seem right...

But it would care about insertions being an exact match?

The reference allele is also checked, so deletions of different lengths shouldn't be considered a match. I will add tests to confirm this though.

jamesemery · 2020-06-03T20:30:00Z

...a/org/broadinstitute/hellbender/tools/walkers/annotator/VariantOverlapAnnotatorUnitTest.java

+        final VariantContext dbSNP_complex_mixed_site = makeVC("DBSNP", "rsID1", Arrays.asList("TTCCTCCTCCTCCTCCTCC", "T", "TTCCTCCTCCTCCTCCTCCTCC"));
+
+        tests.add(new Object[]{callNOID_T_TTCC, Arrays.asList(dbSNP_complex_mixed_site), "rsID1", true});
+


Can you add a test with a DBSNP annotation with a particular length deletion and a variant that is also a deletion but with a different length to the annotation deletion? I have a hunch it will end up adding the annotation erroneously.

Also test for insertions of various lengths.

added tests for both, they seem to be behaving correctly.

jamesemery · 2020-06-03T20:33:55Z

...main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantOverlapAnnotator.java

@@ -156,26 +158,39 @@ public VariantContext annotateOverlap(final List<VariantContext> overlapTestVCs,
    private static String getRsID(final List<VariantContext> rsIDSourceVCs, final VariantContext vcToAnnotate) {
        Utils.nonNull(rsIDSourceVCs, "rsIDSourceVCs cannot be null");
        Utils.nonNull(vcToAnnotate, "vcToAnnotate cannot be null");
+        final List<String> rsids = new ArrayList<>();


Update the @return line of the comments.

kachulis · 2020-06-04T17:57:33Z

@jamesemery thanks for the review. I don't think indels of various lengths should cause any problems, because the reference alleles are required to match in addition to the alt alleles. I added some tests to confirm this and things appear to be behaving correctly.

back to you!

jamesemery

You are right, I was missing that both the reference and the alt alleles were being checked. Your new tests look like good proof that I was mistaken. Feel free to merge when the tests pass.

gatk-bot · 2020-06-04T18:39:55Z

Travis reported job failures from build 30509
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	30509.11	logs
integration	openjdk8	30509.2	logs

kachulis · 2020-06-04T21:18:35Z

@jamesemery in order to get the tests to pass I had to regenerate one of the expected output vcf's for a GenotypGvcfs integration test, which makes sense because I'm changing the way we annotate variant id's. Can I just get a quick thumbs up if you are comfortable with this additional change before I merge, assuming tests now pass.

jamesemery · 2020-06-05T15:01:37Z

@kachulis Interesting, this change has highlighted the fact that DBSNP (or the one we have checked in anyway) is full of duplicate variants with different IDs. It looks like most of the changes in that file are appending both duplicated IDs to the variant which certainly sounds more correct than before. I'm a little concerned about the Qual score changes, do you know what caused them? It looks like the tests don't actually check those scores exactly, it is probably unrelated now that I see we compare QUAL scores by:
private static final double DEFAULT_FLOAT_TOLERANCE = 1e-1; but it seems that there is a consistent 0.04 QUAL score change at almost every site in that file....
I would recommend trying to regenerate that file without your code change and if you see the same QUAL behavior then go ahead and merge (I just want to be sure that the splitting code is not destructive in some unforeseen way).

kachulis · 2020-06-09T16:46:27Z

@jamesemery I regenerated the file based on master, and see the same shift in QUAL scores. I managed to track this QUAL score shift back to #6401, which makes sense. If you look at the files changed in that PR, there are a number of expected test output files changed with this same QUAL score shift. I guess this particular file just got missed, and since there's enough leeway tests still passed. So I'm going to go ahead and merge.

Thanks again for the review!

…6626)

droazen requested a review from jamesemery June 3, 2020 17:39

droazen assigned jamesemery Jun 3, 2020

jamesemery requested changes Jun 3, 2020

View reviewed changes

jamesemery approved these changes Jun 4, 2020

View reviewed changes

kachulis added 4 commits June 4, 2020 17:02

multiple dbsnp annotations

0817dea

more flexible dbsnp matching

52aaac6

review response

1f0b73a

updating rsids in expected output to make tests pass

3a7b57a

kachulis force-pushed the ck_dbsnp_annotations branch from 63155f6 to 3a7b57a Compare June 4, 2020 21:10

kachulis merged commit 3ad0eca into master Jun 9, 2020

kachulis deleted the ck_dbsnp_annotations branch June 9, 2020 16:48

droazen mentioned this pull request Jul 2, 2020

Multiple rsID in Haplotype Caller results #6690

Closed

jonn-smith pushed a commit that referenced this pull request Jul 14, 2020

More flexible matching of dbSNP variants in VariantOverlapAnnotator (#…

d0ed218

…6626)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More flexible matching of dbSNP variants #6626

More flexible matching of dbSNP variants #6626

kachulis commented May 29, 2020

gatk-bot commented May 29, 2020 •

edited

Loading

droazen commented Jun 3, 2020

jamesemery left a comment

jamesemery Jun 3, 2020

kachulis Jun 4, 2020

jamesemery Jun 3, 2020

jamesemery Jun 3, 2020

kachulis Jun 3, 2020

jamesemery Jun 3, 2020

jamesemery Jun 3, 2020

kachulis Jun 4, 2020

jamesemery Jun 3, 2020

kachulis Jun 4, 2020

kachulis commented Jun 4, 2020

jamesemery left a comment

gatk-bot commented Jun 4, 2020 •

edited

Loading

kachulis commented Jun 4, 2020

jamesemery commented Jun 5, 2020 •

edited

Loading

kachulis commented Jun 9, 2020

		final VariantContext dbSNP_complex_mixed_site = makeVC("DBSNP", "rsID1", Arrays.asList("TTCCTCCTCCTCCTCCTCC", "T", "TTCCTCCTCCTCCTCCTCCTCC"));

		tests.add(new Object[]{callNOID_T_TTCC, Arrays.asList(dbSNP_complex_mixed_site), "rsID1", true});

More flexible matching of dbSNP variants #6626

More flexible matching of dbSNP variants #6626

Conversation

kachulis commented May 29, 2020

gatk-bot commented May 29, 2020 • edited Loading

droazen commented Jun 3, 2020

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kachulis commented Jun 4, 2020

jamesemery left a comment

Choose a reason for hiding this comment

gatk-bot commented Jun 4, 2020 • edited Loading

kachulis commented Jun 4, 2020

jamesemery commented Jun 5, 2020 • edited Loading

kachulis commented Jun 9, 2020

gatk-bot commented May 29, 2020 •

edited

Loading

gatk-bot commented Jun 4, 2020 •

edited

Loading

jamesemery commented Jun 5, 2020 •

edited

Loading