Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed an edge case that could result in negative GQs when in USE_POSTERIOR_PROBABILITIES mode #7120

Merged
merged 1 commit into from
Jun 1, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package org.broadinstitute.hellbender.utils.variant;

import com.google.common.annotations.VisibleForTesting;
import htsjdk.samtools.SAMSequenceDictionary;
import htsjdk.samtools.util.Locatable;
import htsjdk.tribble.TribbleException;
Expand Down Expand Up @@ -340,22 +341,23 @@ public static void makeGenotypeCall(final int ploidy,
}
}

private static double getGQLog10FromPosteriors(final int bestGenotypeIndex, final double[] /**/log10Posteriors) {
@VisibleForTesting
static double getGQLog10FromPosteriors(final int bestGenotypeIndex, final double[] /**/log10Posteriors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vruano James and I are both wondering why we're allowing a bestGenotypeIndex that isn't the array max.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a optimization..... looking where this method is used, is often the case that that max index has already been calculated earlier on the very same probs/lks array for another reason and by taken it as an argument we are saving ourselves the recalculation cost.

Have you seen anywhere where this rule is broken (i.e. the index was calculated from a different array or the array was altered after the index was calculated?)

I can see that it is a code smell and so If you prefer to recalculate i'm ok with that. Perhaps we should consider to encapsulate prob/lk array in an object that would keep track of best and second best but that perhaps is over-kill for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any inconsistent cases, but can we at least specify in some javadoc that the index should agree with the minimum of the posterior array being passed in?

if (bestGenotypeIndex < 0) {
return CommonInfo.NO_LOG10_PERROR;
} else {
switch (log10Posteriors.length) {
case 0:
case 1: return CommonInfo.NO_LOG10_PERROR;
case 2: return bestGenotypeIndex == 0 ? log10Posteriors[1] : log10Posteriors[0];
case 3: return Math.min(0, MathUtils.log10SumLog10(
case 3: return Math.min(0.0, MathUtils.log10SumLog10(
log10Posteriors[ bestGenotypeIndex == 0 ? 2 : bestGenotypeIndex - 1],
log10Posteriors[ bestGenotypeIndex == 2 ? 0 : bestGenotypeIndex + 1]));
default:
if (bestGenotypeIndex == 0) {
return MathUtils.log10SumLog10(log10Posteriors, 1, log10Posteriors.length);
return Math.min(0.0, MathUtils.log10SumLog10(log10Posteriors, 1, log10Posteriors.length));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue for @vruano : If we have to throw a min on this, then the normalization isn't happening in the right place. According to my calculations (and as seen in James's unit test) the common case where this comes into play is where there are more than two equally likely genotypes, so the log10 probabilities get normalized to [0, 0, 0, -something, more negative somethings], the best is a zero and then there are two not bests that are zero and the log10SumLog10(0,0) makes a log10(2) ~ 0.3 > 0 which leads to a negative GQ in Phred-space. Elsewhere we use the MathUtils.secondSmallestMinusSmallest, which doesn't have this problem. I don't think the log10SumLog10 is the right thing to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I don't know why I did it this way... we could do what we do with PL or perhaps even better would be to do the proper calculation that is sum (Z) of all the posteriors (before doing the -max trasformation) and report 1-Prob(GT)/Z

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the definition of GQ in the VCF we are supposed to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we output integer GQ values (as in the spec), the "smallest minus second smallest" will be the same as 1-Prob(GT)/Z unless the PLs are [0,0,0] right?

} else if (bestGenotypeIndex == log10Posteriors.length - 1) {
return MathUtils.log10SumLog10(log10Posteriors, 0, bestGenotypeIndex);
return Math.min(0.0, MathUtils.log10SumLog10(log10Posteriors, 0, bestGenotypeIndex));
} else {
return Math.min(0.0, MathUtils.log10SumLog10(
MathUtils.log10sumLog10(log10Posteriors, 0, bestGenotypeIndex),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -726,6 +726,39 @@ public Object[][] makeSplitBiallelics() throws CloneNotSupportedException {
return tests.toArray(new Object[][]{});
}


@DataProvider(name = "GQLog10PosteriorsTest")
public Object[][] makeGQLog10PosteriorsTest() {
List<Object[]> tests = new ArrayList<>();

// testing the 3 allele case
tests.add(new Object[]{0, new double[]{-1.0, -2.0, -2.2}, -1.787});
tests.add(new Object[]{1, new double[]{-1.0, -2.0, -2.2}, -0.973});
tests.add(new Object[]{2, new double[]{-1.0, -2.0, -2.2}, -0.958});

// testing in the 3 allele case where the choice between two genotypes is ambiguous
tests.add(new Object[]{0, new double[]{0.0, 0.0, -0.2}, 0});
tests.add(new Object[]{1, new double[]{0.0, 0.0, -0.2}, 0});
tests.add(new Object[]{2, new double[]{0.0, 0.0, -0.2}, 0});

// testing in the 4+ allele case where the choice between two genotypes is ambiguous (if not careful this might have resulted in a negative GQ)
tests.add(new Object[]{0, new double[]{0.0, 0.0, -0.2, -0.2, -0.2, 0.0}, 0});
tests.add(new Object[]{1, new double[]{0.0, 0.0, -0.2, -0.2, -0.2, 0.0}, 0});
tests.add(new Object[]{2, new double[]{0.0, 0.0, -0.2, -0.2, -0.2, 0.0}, 0});
tests.add(new Object[]{3, new double[]{0.0, 0.0, -0.2, -0.2, -0.2, 0.0}, 0});
tests.add(new Object[]{4, new double[]{0.0, 0.0, -0.2, -0.2, -0.2, 0.0}, 0});
tests.add(new Object[]{5, new double[]{0.0, 0.0, -0.2, -0.2, -0.2, 0.0}, 0});


return tests.toArray(new Object[][]{});
}

@Test(dataProvider = "GQLog10PosteriorsTest")
public void testGetGQLog10FromPosteriors(final int bestGenotypeIndex, final double[] genotypeArray, final double expectedResult) {
final double actualResult = GATKVariantContextUtils.getGQLog10FromPosteriors(bestGenotypeIndex, genotypeArray);
assertEqualsDoubleSmart(actualResult, expectedResult);
}

// --------------------------------------------------------------------------------
//
// Test repeats
Expand Down