-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the ability for Indels to be recovered from dangling heads #6113
Added the ability for Indels to be recovered from dangling heads #6113
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly little stuff, with one comment that might simplify the prefix code.
...institute/hellbender/tools/walkers/haplotypecaller/readthreading/ReadThreadingAssembler.java
Outdated
Show resolved
Hide resolved
...roadinstitute/hellbender/tools/walkers/haplotypecaller/readthreading/ReadThreadingGraph.java
Outdated
Show resolved
Hide resolved
...roadinstitute/hellbender/tools/walkers/haplotypecaller/readthreading/ReadThreadingGraph.java
Outdated
Show resolved
Hide resolved
...roadinstitute/hellbender/tools/walkers/haplotypecaller/readthreading/ReadThreadingGraph.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/walkers/mutect/Mutect2IntegrationTest.java
Outdated
Show resolved
Hide resolved
...roadinstitute/hellbender/tools/walkers/haplotypecaller/readthreading/ReadThreadingGraph.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
// /** | ||
// * Determine the maximum number of mismatches permitted on the branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete?
...roadinstitute/hellbender/tools/walkers/haplotypecaller/readthreading/ReadThreadingGraph.java
Outdated
Show resolved
Hide resolved
...roadinstitute/hellbender/tools/walkers/haplotypecaller/readthreading/ReadThreadingGraph.java
Outdated
Show resolved
Hide resolved
} | ||
// if we got here then we hit the max index | ||
return lastGoodIndex; | ||
if (matchesSinceLastMismatch >= getMinMatchingBases) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of tracking matchesSinceLastMismatch
as you go, why not just find the two offsets and then backtrack from there in both sequences, looking for the required number of exact matches?
890383d
to
39a0d04
Compare
@davidbenjamin I have refactored this branch to account for changes to the codebase adjacent to this code. In the interest of not possibly harming any of the old results I have made this a toggle and I have also made the setting apply symmetrically to tails and heads and added a few simple tests in the existing framework. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose a simplification to the code. Do you think it works?
...itute/hellbender/tools/walkers/haplotypecaller/readthreading/AbstractReadThreadingGraph.java
Show resolved
Hide resolved
...itute/hellbender/tools/walkers/haplotypecaller/readthreading/AbstractReadThreadingGraph.java
Outdated
Show resolved
Hide resolved
...itute/hellbender/tools/walkers/haplotypecaller/readthreading/AbstractReadThreadingGraph.java
Outdated
Show resolved
Hide resolved
/** | ||
* Finds the index of the best extent of the prefix match between the provided paths, for dangling head merging. | ||
* Requires that at a minimum there are at least #getMinMatchingBases() matches between the reference and the read | ||
* at the end in order to emit an alignmet offset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
* @return an integer pair object where the key is the offset into path1 and the value is offset into path2 (both -1 if no path is found) | ||
*/ | ||
private Pair<Integer, Integer> bestPrefixMatch(final List<CigarElement> cigarElements, final byte[] path1, final byte[] path2) { | ||
final int getMinMatchingBases = getMinMatchingBases(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline this?
// we want to paste the base before the last read base, thus we need to handle it in the next iteration when we find a mismatch | ||
boolean wasLastIndexAMismatch = false; | ||
|
||
for (final CigarElement ce : cigarElements) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this can be made much simpler by traversing from the end of the paths. I think the pseudocode in this case becomes:
- go backwards from the right until you hit a mismatch or indel.
- if this covered fewer than
getMinMatchingBases()
return (-1,-1). - otherwise return the read and ref position you got to.
In code this looks something like
refIdx = path1.length - 1;
readIdx = path2.length - 1;
for (final CigarElement ce : Lists.reverse(cigarElements)) {
if (!(ce.consumesReadBases() && ce.consumesRefBases()) {
break;
}
for (int j = 0; j < elementLength; j++, refIdx—; readIndx—) {
if (path1[refIdx] != path2[readIdx]) {
break;
}
}
}
final int matches = path2.length - readIdx;
return matches < minimum ? (-1,-1) : (readIdx, refIdx);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, your previous comment got lost in the rebase. I just checked and there is a problem with this, namely that there is no guarantee that the ref path and read path match in length. Thus in order to match the bases at the end of the dangling path I have at the very least parse through the cigar once forwards and then start walking it backwards... Perhaps that is still cleaner than what I wrote though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we excise the trailing deletions that might be useable for quickly calculating the end base this at on this line: return new DanglingChainMergeHelper(altPath, refPath, altBases, refBases, AlignmentUtils.removeTrailingDeletions(alignment.getCigar()));
Back to @jamesemery |
1ee14e0
to
e8c4972
Compare
e8c4972
to
fa3240a
Compare
@@ -93,18 +95,25 @@ public ReadThreadingAssembler(final int maxAllowedPathsForReadThreadingAssembler | |||
chainPruner = useAdaptivePruning ? new AdaptiveChainPruner<>(initialErrorRateForPruning, pruningLogOddsThreshold, pruningSeedingLogOddsThreshold, maxUnprunedVariants) : | |||
new LowWeightChainPruner<>(pruneFactor); | |||
numBestHaplotypesPerGraph = maxAllowedPathsForReadThreadingAssembler; | |||
this.minMachingBasesToDanglngEndRecovery = minMachingBasesToDanglngEndRecovery; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix spelling. change to minMatchingBasesToDanglingEndRecovery
@davidbenjamin Do you think you can take another look at this? It would be nice to finally merge this code into the next release next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery Looks good.
In helping @bhanugandham figure out why a particular site was failing it became apparent that merging dangling head code was failing to recover deletions in the dangling head. Furthermore there is some code in the dangling end recovery code that asserts a certain high standard of matching (usually 1 but sometimes dangling branch length/kmersize)
getMaxMismatches(final int lengthOfDanglingBranch)
. Both of these facts seem likely to cause dangling heads to be dropped despite their being still potentially informative, particularly the indel code.I have added the ability for the index recovery code to account for the cigar string when merging dangling ends. Addtionally rather than counting mismatches to reject the branch it simply requires a minimum matching end (which can be changed, I suspect this is where the lionshare of the differences come from). Unfortunately changing the tests is non-trivial (as this happened to change the integration test results for HaplotypeCaller at a few sites) so I wanted to get this branch up to solicit advice a to whether it is worth pursuing this fix. @davidbenjamin @ldgauthier @droazen