Fixed bugs and simplified AlleleLikelihoods evidence-to-index cache #6593

davidbenjamin · 2020-05-11T03:25:15Z

AlleleLikelihoods caches the evidence-to-index Map. The previous implementation tried to update this map on the fly whenever evidence was removed. The new approach is to simply invalidate the cache and allow the existing code to generate it to run later.

I don't expect this to cause performance problems for a few reasons:

It only applies when we're doing contamination downsampling.
It may save time whenever evidence is removed and we don't need the evidence-to-index map later.
Regenerating the cache is O(N), but so is updating on-the-fly even when only one read is removed.

…e-to-index cache

droazen · 2020-05-11T14:48:26Z

@jamesemery and @vruano , could you two review when you get a chance? Thanks!

jamesemery

Thank you for the fix. I agree that invalidating the broken cache is probably the simplest and safest thing to do here. I do think we should go a step farther and add an invalidateCache() call to all methods that modify the contents of the object.

As for testing I think we need something, I trust that you have checked the originally broken site but I suspect it will be difficult to construct a test triggering the exact downsampling case that clued us in to this issue. At the very least we should add a unit test that generates the evidenceIndexBySampleIndex cache, then calls marginalize() (both types) and asserts that we have emptied the cache. I would do the same for appendEvidence() and addMissingAlleles(). We can add a dummy exposed method "hasFilledCache()" to facilitate this test. This isn't perfect, but this test plus some comment warnings about the importance of cache invalidation all over the class might be enough to prevent issues in the future.

jamesemery · 2020-05-11T16:38:55Z

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

+        numberOfEvidences[sampleIndex] = newEvidenceCount;
+
+        //  invalidate the cached evidence to index map
+        evidenceIndexBySampleIndex.set(sampleIndex, null);


Make a method .invalidateCache() that gets called here and everywhere else we edit the evidences lists.

Furthermore we should call invalidateCache() for all operations that mutate the sample arrays (so for adding samples and removing samples)

Done, but I didn't call the invalidate method for the case of adding evidence, for which updating the cache on the fly is easy.

And I could just be really tired but I don't think there are any methods to add or remove samples.

jamesemery · 2020-05-11T19:35:39Z

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

@@ -1129,60 +1129,45 @@ private void removeEvidence(final int sampleIndex, final Collection<EVIDENCE> ev
        removeEvidenceByIndex(sampleIndex, indexesToRemove);
    }

+    // remove evidence and unset the {@code evidenceIndexBySampleIndex} Map for this sample


Add some comments to evidenceIndexBySampleIndex.set(sampleIndex, null); and the class javadocs explaining how this cache works and what it accomplishes.

davidbenjamin · 2020-05-24T05:14:06Z

At the very least we should add a unit test that generates the evidenceIndexBySampleIndex cache, then calls marginalize() (both types) and asserts that we have emptied the cache. I would do the same for appendEvidence() and addMissingAlleles()

It's simpler than this because allele operations such as marginalize() and addMissingAlleles don't modify the evidence list. While they require care with the likelihoods arrays they don't require anything at all from the evidence-to-index caches. As I mentioned above, I left the cache updating in appendEvidence as it was because it was so simple.

I will try to write the test for removing evidence tomorrow. Tempting to try tonight, but I'm trying to accept the reality that working until 2 am is a bad idea.

davidbenjamin · 2020-05-25T02:55:48Z

Back to @jamesemery.

vruano

I suggest a few changes but I don't see any potential bugs and I'm not sure the changes I proposed would result in any CPU gain so I'm happy for you to ignore them and go ahead with the PR.

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

vruano · 2020-05-25T04:11:29Z

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

            } else {
                sampleEvidenceIndex.put(newEvidence, previousValue); // revert
            }
        }
+
        numberOfEvidences[sampleIndex] = sampleEvidence.size();


nextIndex contains the value you need here.
perhaps it would be better to call this variable currentSize

I agree, but now that you mention it I think it's even better to drop nextIndex entirely and replace it with sampleEvidence.size().

vruano · 2020-05-25T04:12:17Z

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

        numberOfEvidences[sampleIndex] = sampleEvidence.size();
-        return actuallyAdded;
+        return sampleEvidence.size() - previousEvidenceCount;


return nextIndex - previousEvidenceCount;

True, but upon further thought I realized this method should be void because its return value is not really needed.

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

vruano · 2020-05-25T04:27:38Z

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

+        // update the list of evidence and evidence count
+        final List<EVIDENCE> oldEvidence = evidenceBySampleIndex.get(sampleIndex);
+        final List<EVIDENCE> newEvidence = new ArrayList<>(newEvidenceCount);
+        for (int n = 0, numRemoved = 0; n < oldEvidenceCount; n++) {


You can be a bit more efficient by putting the next to skip "on-deck":

for (int n = 0, numRemoved = 0, nextToRemove = evidencesToRemove[0]; n < oldEvidenceCount; n++) { if (n == nextToRemove) { nextToRemove = ++numRemoved < evidencesToRemove.length ? evidencesToRemove[numRemoved] : -1; } else { newEvidence.add(oldEvidence.get(n)); } }

That's pretty neat but I think I'll keep the loop simpler as-is.

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

vruano · 2020-05-25T04:45:39Z

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

+        final Object2IntMap<EVIDENCE> index = new Object2IntOpenHashMap<>(sampleEvidenceCount);
+        index.defaultReturnValue(MISSING_INDEX);
+        evidenceIndexBySampleIndex.set(sampleIndex, index);
+        for (int r = 0; r < sampleEvidenceCount; r++) {


what about using a foreach:

int nextIdx = 0; for( final EVIDENCE evi : sampleEvidence) { index.put(evi, nextIdx++); }

I like the foreach but my distaste for introducing a variable outside the scope of the for loop is stronger.

vruano · 2020-05-25T04:47:48Z

src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java

+    }
+
+    @VisibleForTesting
+    boolean evidenceToIndexCacheIsFilled(final int sampleIndex) {


I like more ...IsPresent rather than ...isFilled, but my mother's tongle is not English.

vruano · 2020-05-25T04:54:01Z

src/test/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoodsUnitTest.java

@@ -174,11 +174,19 @@ public void testFilterPoorlyModeledReads(final String[] samples, final Allele[]
        final AlleleLikelihoods<GATKRead, Allele> original = makeGoodAndBadLikelihoods(samples, alleles, reads);

        final AlleleLikelihoods<GATKRead, Allele> result = makeGoodAndBadLikelihoods(samples, alleles, reads);
+
+        // fill the evidence-to-index cache now to check that it is invalidated below


IMO this cache is a implementation business and the using code is not supposed to know or care about it. I don't think there is the need to check the inner state of the likelihood collection this way which results in exposing its workings for the sake of this testing.

Instead you could focus that the index are consistent after several mutating operations.

davidbenjamin · 2020-05-26T07:01:34Z

Done with Valentin's comments.

jamesemery

These changes look good to me and I like the sanity check test being added to the other tests.

…6593)

Fixed bugs and simplified implementation of AlleleLikelihoods evidenc…

fb006fd

…e-to-index cache

davidbenjamin added bug HaplotypeCaller PRIORITY_HIGH labels May 11, 2020

davidbenjamin requested a review from droazen May 11, 2020 03:25

davidbenjamin assigned droazen May 11, 2020

droazen requested review from jamesemery and vruano and removed request for droazen May 11, 2020 14:48

droazen assigned jamesemery and vruano and unassigned droazen May 11, 2020

jamesemery reviewed May 14, 2020

View reviewed changes

droazen assigned davidbenjamin May 18, 2020

Review edits

72f6f4f

davidbenjamin unassigned vruano May 25, 2020

vruano reviewed May 25, 2020

View reviewed changes

Valentin edits

9494c4e

jamesemery approved these changes May 26, 2020

View reviewed changes

davidbenjamin merged commit ee56f27 into master May 26, 2020

jonn-smith pushed a commit that referenced this pull request Jul 14, 2020

Fixed bugs and simplified AlleleLikelihoods evidence-to-index cache (#…

2874ee8

…6593)

davidbenjamin deleted the db_allele_likelihoods_cache branch September 3, 2020 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed bugs and simplified AlleleLikelihoods evidence-to-index cache #6593

Fixed bugs and simplified AlleleLikelihoods evidence-to-index cache #6593

davidbenjamin commented May 11, 2020

droazen commented May 11, 2020

jamesemery left a comment

jamesemery May 11, 2020

jamesemery May 11, 2020

davidbenjamin May 24, 2020

jamesemery May 11, 2020

davidbenjamin May 24, 2020

davidbenjamin commented May 24, 2020

davidbenjamin commented May 25, 2020

vruano left a comment

vruano May 25, 2020

davidbenjamin May 26, 2020

vruano May 25, 2020

davidbenjamin May 26, 2020

vruano May 25, 2020

davidbenjamin May 26, 2020

vruano May 25, 2020

davidbenjamin May 26, 2020

vruano May 25, 2020

vruano May 25, 2020

davidbenjamin May 26, 2020

davidbenjamin commented May 26, 2020

jamesemery left a comment

Fixed bugs and simplified AlleleLikelihoods evidence-to-index cache #6593

Fixed bugs and simplified AlleleLikelihoods evidence-to-index cache #6593

Conversation

davidbenjamin commented May 11, 2020

droazen commented May 11, 2020

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbenjamin commented May 24, 2020

davidbenjamin commented May 25, 2020

vruano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbenjamin commented May 26, 2020

jamesemery left a comment

Choose a reason for hiding this comment