-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed an edge case that could result in negative GQs when in USE_POSTERIOR_PROBABILITIES mode #7120
Conversation
@ldgauthier Would you like to review this one as well, since you reviewed #7148? It's just some added |
@@ -340,22 +341,23 @@ public static void makeGenotypeCall(final int ploidy, | |||
} | |||
} | |||
|
|||
private static double getGQLog10FromPosteriors(final int bestGenotypeIndex, final double[] /**/log10Posteriors) { | |||
@VisibleForTesting | |||
static double getGQLog10FromPosteriors(final int bestGenotypeIndex, final double[] /**/log10Posteriors) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vruano James and I are both wondering why we're allowing a bestGenotypeIndex
that isn't the array max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a optimization..... looking where this method is used, is often the case that that max index has already been calculated earlier on the very same probs/lks array for another reason and by taken it as an argument we are saving ourselves the recalculation cost.
Have you seen anywhere where this rule is broken (i.e. the index was calculated from a different array or the array was altered after the index was calculated?)
I can see that it is a code smell and so If you prefer to recalculate i'm ok with that. Perhaps we should consider to encapsulate prob/lk array in an object that would keep track of best and second best but that perhaps is over-kill for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen any inconsistent cases, but can we at least specify in some javadoc that the index should agree with the minimum of the posterior array being passed in?
log10Posteriors[ bestGenotypeIndex == 0 ? 2 : bestGenotypeIndex - 1], | ||
log10Posteriors[ bestGenotypeIndex == 2 ? 0 : bestGenotypeIndex + 1])); | ||
default: | ||
if (bestGenotypeIndex == 0) { | ||
return MathUtils.log10SumLog10(log10Posteriors, 1, log10Posteriors.length); | ||
return Math.min(0.0, MathUtils.log10SumLog10(log10Posteriors, 1, log10Posteriors.length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue for @vruano : If we have to throw a min on this, then the normalization isn't happening in the right place. According to my calculations (and as seen in James's unit test) the common case where this comes into play is where there are more than two equally likely genotypes, so the log10 probabilities get normalized to [0, 0, 0, -something, more negative somethings], the best is a zero and then there are two not bests that are zero and the log10SumLog10(0,0) makes a log10(2) ~ 0.3 > 0 which leads to a negative GQ in Phred-space. Elsewhere we use the MathUtils.secondSmallestMinusSmallest, which doesn't have this problem. I don't think the log10SumLog10 is the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I don't know why I did it this way... we could do what we do with PL or perhaps even better would be to do the proper calculation that is sum (Z) of all the posteriors (before doing the -max trasformation) and report 1-Prob(GT)/Z
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the definition of GQ in the VCF we are supposed to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we output integer GQ values (as in the spec), the "smallest minus second smallest" will be the same as 1-Prob(GT)/Z unless the PLs are [0,0,0] right?
@ldgauthier @vruano Given that this is still fixing a rather glaring bug in the current behavior and we should probably get a fix of some ilk in before the next release of GATK I'm going to push the decision to rework the GT calculation to an open issue ticket in favor of getting this branch merged. |
I have merged many a PR by opening new issues, so I would be a hypocrite to object. Sounds good to me @jamesemery. |
@jamesemery What's the status of this one as well? Close to a merge? |
…ERIOR_PROBABILITIES mode
83b1c8d
to
464a40d
Compare
@ldgauthier @vruano can i get a passing review on this so it can be merged? the failing tests were transient errors. |
Pushed the primary complaint into an issue for farther discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections
Resolves one of the issues encountered when merging the DRAGEN-GATK gvcfs the #7108.