-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed bugs and simplified AlleleLikelihoods evidence-to-index cache #6593
Conversation
@jamesemery and @vruano , could you two review when you get a chance? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix. I agree that invalidating the broken cache is probably the simplest and safest thing to do here. I do think we should go a step farther and add an invalidateCache() call to all methods that modify the contents of the object.
As for testing I think we need something, I trust that you have checked the originally broken site but I suspect it will be difficult to construct a test triggering the exact downsampling case that clued us in to this issue. At the very least we should add a unit test that generates the evidenceIndexBySampleIndex
cache, then calls marginalize() (both types) and asserts that we have emptied the cache. I would do the same for appendEvidence() and addMissingAlleles(). We can add a dummy exposed method "hasFilledCache()" to facilitate this test. This isn't perfect, but this test plus some comment warnings about the importance of cache invalidation all over the class might be enough to prevent issues in the future.
numberOfEvidences[sampleIndex] = newEvidenceCount; | ||
|
||
// invalidate the cached evidence to index map | ||
evidenceIndexBySampleIndex.set(sampleIndex, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a method .invalidateCache()
that gets called here and everywhere else we edit the evidences lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore we should call invalidateCache() for all operations that mutate the sample arrays (so for adding samples and removing samples)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I didn't call the invalidate method for the case of adding evidence, for which updating the cache on the fly is easy.
And I could just be really tired but I don't think there are any methods to add or remove samples.
@@ -1129,60 +1129,45 @@ private void removeEvidence(final int sampleIndex, final Collection<EVIDENCE> ev | |||
removeEvidenceByIndex(sampleIndex, indexesToRemove); | |||
} | |||
|
|||
// remove evidence and unset the {@code evidenceIndexBySampleIndex} Map for this sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments to evidenceIndexBySampleIndex.set(sampleIndex, null); and the class javadocs explaining how this cache works and what it accomplishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
It's simpler than this because allele operations such as I will try to write the test for removing evidence tomorrow. Tempting to try tonight, but I'm trying to accept the reality that working until 2 am is a bad idea. |
Back to @jamesemery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest a few changes but I don't see any potential bugs and I'm not sure the changes I proposed would result in any CPU gain so I'm happy for you to ignore them and go ahead with the PR.
src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java
Outdated
Show resolved
Hide resolved
} else { | ||
sampleEvidenceIndex.put(newEvidence, previousValue); // revert | ||
} | ||
} | ||
|
||
numberOfEvidences[sampleIndex] = sampleEvidence.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextIndex
contains the value you need here.
perhaps it would be better to call this variable currentSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but now that you mention it I think it's even better to drop nextIndex
entirely and replace it with sampleEvidence.size()
.
numberOfEvidences[sampleIndex] = sampleEvidence.size(); | ||
return actuallyAdded; | ||
return sampleEvidence.size() - previousEvidenceCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nextIndex - previousEvidenceCount;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but upon further thought I realized this method should be void
because its return value is not really needed.
src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java
Outdated
Show resolved
Hide resolved
// update the list of evidence and evidence count | ||
final List<EVIDENCE> oldEvidence = evidenceBySampleIndex.get(sampleIndex); | ||
final List<EVIDENCE> newEvidence = new ArrayList<>(newEvidenceCount); | ||
for (int n = 0, numRemoved = 0; n < oldEvidenceCount; n++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can be a bit more efficient by putting the next to skip "on-deck":
for (int n = 0, numRemoved = 0, nextToRemove = evidencesToRemove[0]; n < oldEvidenceCount; n++) {
if (n == nextToRemove) {
nextToRemove = ++numRemoved < evidencesToRemove.length ? evidencesToRemove[numRemoved] : -1;
} else {
newEvidence.add(oldEvidence.get(n));
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty neat but I think I'll keep the loop simpler as-is.
src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/genotyper/AlleleLikelihoods.java
Outdated
Show resolved
Hide resolved
final Object2IntMap<EVIDENCE> index = new Object2IntOpenHashMap<>(sampleEvidenceCount); | ||
index.defaultReturnValue(MISSING_INDEX); | ||
evidenceIndexBySampleIndex.set(sampleIndex, index); | ||
for (int r = 0; r < sampleEvidenceCount; r++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about using a foreach
:
int nextIdx = 0;
for( final EVIDENCE evi : sampleEvidence) {
index.put(evi, nextIdx++);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the foreach
but my distaste for introducing a variable outside the scope of the for
loop is stronger.
} | ||
|
||
@VisibleForTesting | ||
boolean evidenceToIndexCacheIsFilled(final int sampleIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like more ...IsPresent
rather than ...isFilled
, but my mother's tongle is not English.
@@ -174,11 +174,19 @@ public void testFilterPoorlyModeledReads(final String[] samples, final Allele[] | |||
final AlleleLikelihoods<GATKRead, Allele> original = makeGoodAndBadLikelihoods(samples, alleles, reads); | |||
|
|||
final AlleleLikelihoods<GATKRead, Allele> result = makeGoodAndBadLikelihoods(samples, alleles, reads); | |||
|
|||
// fill the evidence-to-index cache now to check that it is invalidated below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this cache is a implementation business and the using code is not supposed to know or care about it. I don't think there is the need to check the inner state of the likelihood collection this way which results in exposing its workings for the sake of this testing.
Instead you could focus that the index are consistent after several mutating operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Done with Valentin's comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me and I like the sanity check test being added to the other tests.
Closes #6586. @droazen
AlleleLikelihoods
caches the evidence-to-indexMap
. The previous implementation tried to update this map on the fly whenever evidence was removed. The new approach is to simply invalidate the cache and allow the existing code to generate it to run later.I don't expect this to cause performance problems for a few reasons: