-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More flexible matching of dbSNP variants #6626
Conversation
@jamesemery you might be a good reviewer for this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain to me what happens for deletions of various lengths? It seems like there might be some more messiness in this class that is not your fault but we should probably address.
if ( ! vcComp.getContig().equals(vcToAnnotate.getContig()) || vcComp.getStart() != vcToAnnotate.getStart() ) { | ||
throw new IllegalArgumentException("source rsID VariantContext " + vcComp + " doesn't start at the same position as vcToAnnotate " + vcToAnnotate); | ||
if (!vcCompSource.getContig().equals(vcToAnnotate.getContig())) { | ||
throw new IllegalArgumentException("source rsID VariantContext " + vcCompSource + " is not on same chromosome as vcToAnnotate " + vcToAnnotate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... splitVariantContextToBiallelics() claims that under some circumstances it results in variants getting pushed forwards on the genome. This is all fine and well but it leaves a hole here where there could be a variant in the input (or the DBSNP) that gets moved and would otherwise match with the next variant in the other file. Not that the old code could handle this case any better... This is a very niche circumstance and probably doesn't warrant worrying about. I do think we should probably document this fact in a comment somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there are still cases that this solution will not catch, but I think they are related to left alignment of indels or MNP vs multiple SNP/ complex snp/indel combination issues, not the allele movement that occurs in splitVariantContextToBiallelics. Since there is no reference passed to splitVariantContextToBiallelics() beyond what is stored in the reference allele of the variant context, the alleles that are returned by this method cannot overlap any reference bases that were not overlapped by the original alleles. So I think any match that would be found if the splitting and trimming were performed before grabbing the overlaps will also be found by this code.
But I agree with the general point, so I have added a comment that in rare cases we may fail add annotations to a variant that "should" have been added.
boolean addThisID = false; | ||
for (final VariantContext vcComp : vcCompList) { | ||
for (final VariantContext vcToAnnotateBi : vcAnnotateList) { | ||
if (vcComp.getStart() == vcToAnnotateBi.getStart() && vcToAnnotateBi.getReference().equals(vcComp.getReference()) && vcComp.getAlternateAlleles().equals(vcToAnnotateBi.getAlternateAlleles())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a minute.... is vcToAnnotateBi.getAlternateAlleles() really adequate here? This means that two deletions of different lengths are exactly the same as far as our DBSNP annotations are concerned? That doesn't seem right...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it would care about insertions being an exact match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference allele is also checked, so deletions of different lengths shouldn't be considered a match. I will add tests to confirm this though.
final VariantContext dbSNP_complex_mixed_site = makeVC("DBSNP", "rsID1", Arrays.asList("TTCCTCCTCCTCCTCCTCC", "T", "TTCCTCCTCCTCCTCCTCCTCC")); | ||
|
||
tests.add(new Object[]{callNOID_T_TTCC, Arrays.asList(dbSNP_complex_mixed_site), "rsID1", true}); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test with a DBSNP annotation with a particular length deletion and a variant that is also a deletion but with a different length to the annotation deletion? I have a hunch it will end up adding the annotation erroneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test for insertions of various lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added tests for both, they seem to be behaving correctly.
@@ -156,26 +158,39 @@ public VariantContext annotateOverlap(final List<VariantContext> overlapTestVCs, | |||
private static String getRsID(final List<VariantContext> rsIDSourceVCs, final VariantContext vcToAnnotate) { | |||
Utils.nonNull(rsIDSourceVCs, "rsIDSourceVCs cannot be null"); | |||
Utils.nonNull(vcToAnnotate, "vcToAnnotate cannot be null"); | |||
final List<String> rsids = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the @return line of the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@jamesemery thanks for the review. I don't think indels of various lengths should cause any problems, because the reference alleles are required to match in addition to the alt alleles. I added some tests to confirm this and things appear to be behaving correctly. back to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I was missing that both the reference and the alt alleles were being checked. Your new tests look like good proof that I was mistaken. Feel free to merge when the tests pass.
63155f6
to
3a7b57a
Compare
@jamesemery in order to get the tests to pass I had to regenerate one of the expected output vcf's for a GenotypGvcfs integration test, which makes sense because I'm changing the way we annotate variant id's. Can I just get a quick thumbs up if you are comfortable with this additional change before I merge, assuming tests now pass. |
@kachulis Interesting, this change has highlighted the fact that DBSNP (or the one we have checked in anyway) is full of duplicate variants with different IDs. It looks like most of the changes in that file are appending both duplicated IDs to the variant which certainly sounds more correct than before. I'm a little concerned about the Qual score changes, do you know what caused them? It looks like the tests don't actually check those scores exactly, it is probably unrelated now that I see we compare QUAL scores by: |
@jamesemery I regenerated the file based on master, and see the same shift in QUAL scores. I managed to track this QUAL score shift back to #6401, which makes sense. If you look at the files changed in that PR, there are a number of expected test output files changed with this same QUAL score shift. I guess this particular file just got missed, and since there's enough leeway tests still passed. So I'm going to go ahead and merge. Thanks again for the review! |
Addresses two user requests to more flexibly and accurately match dbSNP variants.
https://gatk.broadinstitute.org/hc/en-us/community/posts/360066006472-gatk-4-1-4-1-HaplotypeCaller-D-parameter-for-MIXED-type
https://gatk.broadinstitute.org/hc/en-us/community/posts/360062537671-GATK-4-1-7-0-does-not-annotate-ID-using-dbSNP-build-153-VCF
The first change is to add all dbsnp id's which match a particular variant to the variant's id, instead of just the first one found in the dbsnp vcf.
The second change is to be less brittle to variant normalization issues, and match differing variant representations of the same underlying variant. This is implemented by splitting and trimming multiallelics before checking for a match, which I suspect are the predominant cause of these types of matching failures.