Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed bugs and simplified AlleleLikelihoods evidence-to-index cache #6593

Merged
merged 3 commits into from
May 26, 2020

Conversation

davidbenjamin
Copy link
Contributor

Closes #6586. @droazen

AlleleLikelihoods caches the evidence-to-index Map. The previous implementation tried to update this map on the fly whenever evidence was removed. The new approach is to simply invalidate the cache and allow the existing code to generate it to run later.

I don't expect this to cause performance problems for a few reasons:

  1. It only applies when we're doing contamination downsampling.
  2. It may save time whenever evidence is removed and we don't need the evidence-to-index map later.
  3. Regenerating the cache is O(N), but so is updating on-the-fly even when only one read is removed.

@droazen
Copy link
Contributor

droazen commented May 11, 2020

@jamesemery and @vruano , could you two review when you get a chance? Thanks!

@droazen droazen requested review from jamesemery and vruano and removed request for droazen May 11, 2020 14:48
@droazen droazen assigned jamesemery and vruano and unassigned droazen May 11, 2020
Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix. I agree that invalidating the broken cache is probably the simplest and safest thing to do here. I do think we should go a step farther and add an invalidateCache() call to all methods that modify the contents of the object.

As for testing I think we need something, I trust that you have checked the originally broken site but I suspect it will be difficult to construct a test triggering the exact downsampling case that clued us in to this issue. At the very least we should add a unit test that generates the evidenceIndexBySampleIndex cache, then calls marginalize() (both types) and asserts that we have emptied the cache. I would do the same for appendEvidence() and addMissingAlleles(). We can add a dummy exposed method "hasFilledCache()" to facilitate this test. This isn't perfect, but this test plus some comment warnings about the importance of cache invalidation all over the class might be enough to prevent issues in the future.

numberOfEvidences[sampleIndex] = newEvidenceCount;

// invalidate the cached evidence to index map
evidenceIndexBySampleIndex.set(sampleIndex, null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a method .invalidateCache() that gets called here and everywhere else we edit the evidences lists.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore we should call invalidateCache() for all operations that mutate the sample arrays (so for adding samples and removing samples)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I didn't call the invalidate method for the case of adding evidence, for which updating the cache on the fly is easy.

And I could just be really tired but I don't think there are any methods to add or remove samples.

@@ -1129,60 +1129,45 @@ private void removeEvidence(final int sampleIndex, final Collection<EVIDENCE> ev
removeEvidenceByIndex(sampleIndex, indexesToRemove);
}

// remove evidence and unset the {@code evidenceIndexBySampleIndex} Map for this sample
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments to evidenceIndexBySampleIndex.set(sampleIndex, null); and the class javadocs explaining how this cache works and what it accomplishes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@davidbenjamin
Copy link
Contributor Author

At the very least we should add a unit test that generates the evidenceIndexBySampleIndex cache, then calls marginalize() (both types) and asserts that we have emptied the cache. I would do the same for appendEvidence() and addMissingAlleles()

It's simpler than this because allele operations such as marginalize() and addMissingAlleles don't modify the evidence list. While they require care with the likelihoods arrays they don't require anything at all from the evidence-to-index caches. As I mentioned above, I left the cache updating in appendEvidence as it was because it was so simple.

I will try to write the test for removing evidence tomorrow. Tempting to try tonight, but I'm trying to accept the reality that working until 2 am is a bad idea.

@davidbenjamin
Copy link
Contributor Author

Back to @jamesemery.

Copy link
Contributor

@vruano vruano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a few changes but I don't see any potential bugs and I'm not sure the changes I proposed would result in any CPU gain so I'm happy for you to ignore them and go ahead with the PR.

} else {
sampleEvidenceIndex.put(newEvidence, previousValue); // revert
}
}

numberOfEvidences[sampleIndex] = sampleEvidence.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextIndex contains the value you need here.
perhaps it would be better to call this variable currentSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but now that you mention it I think it's even better to drop nextIndex entirely and replace it with sampleEvidence.size().

numberOfEvidences[sampleIndex] = sampleEvidence.size();
return actuallyAdded;
return sampleEvidence.size() - previousEvidenceCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nextIndex - previousEvidenceCount;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but upon further thought I realized this method should be void because its return value is not really needed.

// update the list of evidence and evidence count
final List<EVIDENCE> oldEvidence = evidenceBySampleIndex.get(sampleIndex);
final List<EVIDENCE> newEvidence = new ArrayList<>(newEvidenceCount);
for (int n = 0, numRemoved = 0; n < oldEvidenceCount; n++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can be a bit more efficient by putting the next to skip "on-deck":

for (int n = 0, numRemoved = 0, nextToRemove = evidencesToRemove[0]; n < oldEvidenceCount; n++) {
     if (n == nextToRemove) {
         nextToRemove = ++numRemoved < evidencesToRemove.length ? evidencesToRemove[numRemoved] : -1;
     } else {
         newEvidence.add(oldEvidence.get(n));
     }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty neat but I think I'll keep the loop simpler as-is.

final Object2IntMap<EVIDENCE> index = new Object2IntOpenHashMap<>(sampleEvidenceCount);
index.defaultReturnValue(MISSING_INDEX);
evidenceIndexBySampleIndex.set(sampleIndex, index);
for (int r = 0; r < sampleEvidenceCount; r++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using a foreach:

int nextIdx = 0;
for( final EVIDENCE evi : sampleEvidence) {
    index.put(evi, nextIdx++);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the foreach but my distaste for introducing a variable outside the scope of the for loop is stronger.

}

@VisibleForTesting
boolean evidenceToIndexCacheIsFilled(final int sampleIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like more ...IsPresent rather than ...isFilled, but my mother's tongle is not English.

@@ -174,11 +174,19 @@ public void testFilterPoorlyModeledReads(final String[] samples, final Allele[]
final AlleleLikelihoods<GATKRead, Allele> original = makeGoodAndBadLikelihoods(samples, alleles, reads);

final AlleleLikelihoods<GATKRead, Allele> result = makeGoodAndBadLikelihoods(samples, alleles, reads);

// fill the evidence-to-index cache now to check that it is invalidated below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this cache is a implementation business and the using code is not supposed to know or care about it. I don't think there is the need to check the inner state of the likelihood collection this way which results in exposing its workings for the sake of this testing.

Instead you could focus that the index are consistent after several mutating operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@davidbenjamin
Copy link
Contributor Author

Done with Valentin's comments.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me and I like the sanity check test being added to the other tests.

@davidbenjamin davidbenjamin merged commit ee56f27 into master May 26, 2020
@davidbenjamin davidbenjamin deleted the db_allele_likelihoods_cache branch September 3, 2020 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in HaplotypeCaller 'evidence provided is not in sample'
4 participants