Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexible matching of dbSNP variants #6626

Merged
merged 4 commits into from
Jun 9, 2020
Merged

Conversation

kachulis
Copy link
Contributor

Addresses two user requests to more flexibly and accurately match dbSNP variants.

https://gatk.broadinstitute.org/hc/en-us/community/posts/360066006472-gatk-4-1-4-1-HaplotypeCaller-D-parameter-for-MIXED-type

https://gatk.broadinstitute.org/hc/en-us/community/posts/360062537671-GATK-4-1-7-0-does-not-annotate-ID-using-dbSNP-build-153-VCF

The first change is to add all dbsnp id's which match a particular variant to the variant's id, instead of just the first one found in the dbsnp vcf.

The second change is to be less brittle to variant normalization issues, and match differing variant representations of the same underlying variant. This is implemented by splitting and trimming multiallelics before checking for a match, which I suspect are the predominant cause of these types of matching failures.

@gatk-bot
Copy link

gatk-bot commented May 29, 2020

Travis reported job failures from build 30450
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30450.11 logs
integration openjdk8 30450.2 logs

@droazen
Copy link
Contributor

droazen commented Jun 3, 2020

@jamesemery you might be a good reviewer for this one

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain to me what happens for deletions of various lengths? It seems like there might be some more messiness in this class that is not your fault but we should probably address.

if ( ! vcComp.getContig().equals(vcToAnnotate.getContig()) || vcComp.getStart() != vcToAnnotate.getStart() ) {
throw new IllegalArgumentException("source rsID VariantContext " + vcComp + " doesn't start at the same position as vcToAnnotate " + vcToAnnotate);
if (!vcCompSource.getContig().equals(vcToAnnotate.getContig())) {
throw new IllegalArgumentException("source rsID VariantContext " + vcCompSource + " is not on same chromosome as vcToAnnotate " + vcToAnnotate);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... splitVariantContextToBiallelics() claims that under some circumstances it results in variants getting pushed forwards on the genome. This is all fine and well but it leaves a hole here where there could be a variant in the input (or the DBSNP) that gets moved and would otherwise match with the next variant in the other file. Not that the old code could handle this case any better... This is a very niche circumstance and probably doesn't warrant worrying about. I do think we should probably document this fact in a comment somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there are still cases that this solution will not catch, but I think they are related to left alignment of indels or MNP vs multiple SNP/ complex snp/indel combination issues, not the allele movement that occurs in splitVariantContextToBiallelics. Since there is no reference passed to splitVariantContextToBiallelics() beyond what is stored in the reference allele of the variant context, the alleles that are returned by this method cannot overlap any reference bases that were not overlapped by the original alleles. So I think any match that would be found if the splitting and trimming were performed before grabbing the overlaps will also be found by this code.

But I agree with the general point, so I have added a comment that in rare cases we may fail add annotations to a variant that "should" have been added.

boolean addThisID = false;
for (final VariantContext vcComp : vcCompList) {
for (final VariantContext vcToAnnotateBi : vcAnnotateList) {
if (vcComp.getStart() == vcToAnnotateBi.getStart() && vcToAnnotateBi.getReference().equals(vcComp.getReference()) && vcComp.getAlternateAlleles().equals(vcToAnnotateBi.getAlternateAlleles())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a minute.... is vcToAnnotateBi.getAlternateAlleles() really adequate here? This means that two deletions of different lengths are exactly the same as far as our DBSNP annotations are concerned? That doesn't seem right...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it would care about insertions being an exact match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference allele is also checked, so deletions of different lengths shouldn't be considered a match. I will add tests to confirm this though.

final VariantContext dbSNP_complex_mixed_site = makeVC("DBSNP", "rsID1", Arrays.asList("TTCCTCCTCCTCCTCCTCC", "T", "TTCCTCCTCCTCCTCCTCCTCC"));

tests.add(new Object[]{callNOID_T_TTCC, Arrays.asList(dbSNP_complex_mixed_site), "rsID1", true});

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test with a DBSNP annotation with a particular length deletion and a variant that is also a deletion but with a different length to the annotation deletion? I have a hunch it will end up adding the annotation erroneously.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test for insertions of various lengths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests for both, they seem to be behaving correctly.

@@ -156,26 +158,39 @@ public VariantContext annotateOverlap(final List<VariantContext> overlapTestVCs,
private static String getRsID(final List<VariantContext> rsIDSourceVCs, final VariantContext vcToAnnotate) {
Utils.nonNull(rsIDSourceVCs, "rsIDSourceVCs cannot be null");
Utils.nonNull(vcToAnnotate, "vcToAnnotate cannot be null");
final List<String> rsids = new ArrayList<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the @return line of the comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kachulis
Copy link
Contributor Author

kachulis commented Jun 4, 2020

@jamesemery thanks for the review. I don't think indels of various lengths should cause any problems, because the reference alleles are required to match in addition to the alt alleles. I added some tests to confirm this and things appear to be behaving correctly.

back to you!

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I was missing that both the reference and the alt alleles were being checked. Your new tests look like good proof that I was mistaken. Feel free to merge when the tests pass.

@gatk-bot
Copy link

gatk-bot commented Jun 4, 2020

Travis reported job failures from build 30509
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30509.11 logs
integration openjdk8 30509.2 logs

@kachulis kachulis force-pushed the ck_dbsnp_annotations branch from 63155f6 to 3a7b57a Compare June 4, 2020 21:10
@kachulis
Copy link
Contributor Author

kachulis commented Jun 4, 2020

@jamesemery in order to get the tests to pass I had to regenerate one of the expected output vcf's for a GenotypGvcfs integration test, which makes sense because I'm changing the way we annotate variant id's. Can I just get a quick thumbs up if you are comfortable with this additional change before I merge, assuming tests now pass.

@jamesemery
Copy link
Collaborator

jamesemery commented Jun 5, 2020

@kachulis Interesting, this change has highlighted the fact that DBSNP (or the one we have checked in anyway) is full of duplicate variants with different IDs. It looks like most of the changes in that file are appending both duplicated IDs to the variant which certainly sounds more correct than before. I'm a little concerned about the Qual score changes, do you know what caused them? It looks like the tests don't actually check those scores exactly, it is probably unrelated now that I see we compare QUAL scores by:
private static final double DEFAULT_FLOAT_TOLERANCE = 1e-1; but it seems that there is a consistent 0.04 QUAL score change at almost every site in that file....
I would recommend trying to regenerate that file without your code change and if you see the same QUAL behavior then go ahead and merge (I just want to be sure that the splitting code is not destructive in some unforeseen way).

@kachulis
Copy link
Contributor Author

kachulis commented Jun 9, 2020

@jamesemery I regenerated the file based on master, and see the same shift in QUAL scores. I managed to track this QUAL score shift back to #6401, which makes sense. If you look at the files changed in that PR, there are a number of expected test output files changed with this same QUAL score shift. I guess this particular file just got missed, and since there's enough leeway tests still passed. So I'm going to go ahead and merge.

Thanks again for the review!

@kachulis kachulis merged commit 3ad0eca into master Jun 9, 2020
@kachulis kachulis deleted the ck_dbsnp_annotations branch June 9, 2020 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants