Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Read Giraffe #3700

Merged
merged 104 commits into from
Jul 29, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
d64ee68
Merge remote-tracking branch 'xian/distance-minimizer-caching' into l…
adamnovak May 23, 2022
d2e7d17
Add MinimizerMapper::chain_extension_group that does half of the chai…
adamnovak May 23, 2022
e3df479
Rearrange and correct test cases and add debugging
adamnovak May 24, 2022
3f6d5ae
Fix bug where overlaps were scored 1 too low
adamnovak May 24, 2022
a93fa0d
Fix more test cases and adjust debugging
adamnovak May 24, 2022
967418d
Note that we don't actually enforce a containment ban
adamnovak May 24, 2022
cc396fd
Implement some measurement and scoring of graph distances for chaining
adamnovak May 25, 2022
603335a
Add some easy unit tests
adamnovak May 25, 2022
a676d61
Allow building individual dynamic unit test binaries
adamnovak May 26, 2022
5d8ffa3
Add another test
adamnovak May 26, 2022
41d93a7
Swap over to traits so we can define the DP once as a template
adamnovak May 27, 2022
17ee4e1
Hook up chaining into single-end Giraffe and get ready for the chain …
adamnovak May 27, 2022
84b2bc6
Get build working with chaining
adamnovak May 27, 2022
021bb4b
Merge Jouni's WFA-over-GBWT algorithm
adamnovak May 27, 2022
614abc5
Merge remote-tracking branch 'upstream/distance-minimizer-caching' in…
adamnovak May 31, 2022
3ba122c
Add some tests for long read Giraffe that don't pass yet
adamnovak May 31, 2022
1f96567
Copy minimizer total limit and overlap detection over to old distance…
adamnovak May 31, 2022
7aa80b5
Start implementing aligning between gapless extensions
adamnovak Jun 1, 2022
b3e1672
Start on actually implementing the helpers and welding
adamnovak Jun 1, 2022
49a95a6
Implement WFAAlignment to vg Alignment conversion
adamnovak Jun 2, 2022
12fd227
Act more like the existing alignment conversion logic
adamnovak Jun 2, 2022
a4e4b93
Get chaining to be attempted
adamnovak Jun 2, 2022
acee183
Add some unit tests for alignment welding
adamnovak Jun 3, 2022
c12d9a6
Make easy join test pass
adamnovak Jun 3, 2022
4ad22af
Make work-showing useful for long reads
adamnovak Jun 3, 2022
45df6d5
Start at the right extension and make dropping overlapping extensions…
adamnovak Jun 3, 2022
2767fcc
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Jun 3, 2022
9272e62
Start trimming
adamnovak Jun 6, 2022
4b0e00b
Merge remote-tracking branch 'jouni/master' into lr-giraffe
adamnovak Jun 7, 2022
6f68599
Rename some variables to make making extensions optional make sense
adamnovak Jun 7, 2022
e0cffb1
Template out the extension logic
adamnovak Jun 7, 2022
2e44d7a
Unify seed finding with traits
adamnovak Jun 7, 2022
9258a00
Unify cluster scoring
adamnovak Jun 7, 2022
264e584
Templatize chaining
adamnovak Jun 7, 2022
9dd7f44
Templatize best chain function
adamnovak Jun 7, 2022
404484c
Start trying to chain raw seeds
adamnovak Jun 7, 2022
1297d05
Start factoring chaining into its own file
adamnovak Jun 7, 2022
23ff865
Hook up chaining as a parallel path to gapless extensions
adamnovak Jun 7, 2022
ff5bb5d
Get the new chaining space system to compile
adamnovak Jun 7, 2022
0dd2f60
Split out chaining unit tests
adamnovak Jun 8, 2022
d778476
Add cpp file for chaining code to hold non-templates
adamnovak Jun 8, 2022
d2b5231
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Jun 8, 2022
5871d22
Templatize and deduplicate annotate_with_minimizer_statistics
adamnovak Jun 8, 2022
53b87a4
Fix up gap length difference scoring
adamnovak Jun 8, 2022
9ca509c
Flip around tail alignments to go the right way, but they're too slow…
adamnovak Jun 8, 2022
c8daa74
Chain with a finite lookback to previous items
adamnovak Jun 13, 2022
ae7d90f
Get chaining mostly working; splicing is still broken because connect…
adamnovak Jun 13, 2022
be49804
Trim trailing bases so we can splice alignments by abutment, but some…
adamnovak Jun 13, 2022
db59f8e
Get seed to match run logic in the chaining space working properly fo…
adamnovak Jun 14, 2022
e44d782
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Jun 17, 2022
6718d0c
Stop popping bases and also pass old_seeds where needed
adamnovak Jun 17, 2022
975bc79
Go back to the original sweep-line code for gapless extension set sco…
adamnovak Jun 17, 2022
16c9602
Handle partial alignments from WFAAligner
adamnovak Jun 17, 2022
d5127c9
Try to debug a busted read
adamnovak Jun 21, 2022
16a27e0
Start brute-forcing path conversion
adamnovak Jun 22, 2022
dcdb1db
Check generated path sequence
adamnovak Jun 22, 2022
fcd0c72
Pass randomized tests for alignment round-trip
adamnovak Jun 22, 2022
bdface7
Simplify conversion function slightly
adamnovak Jun 22, 2022
e868de7
Improve comment and simplify error message
adamnovak Jun 22, 2022
966309a
Don't do pathologically long tail or connection alignments that can m…
adamnovak Jun 23, 2022
32b9988
Shorten max connecting length and get Giraffe Facts to run when there…
adamnovak Jun 23, 2022
17e2b6b
Get Giraffe Facts to report a speed
adamnovak Jun 23, 2022
66776f1
Handle empty tails
adamnovak Jun 23, 2022
8ec6cea
Merge remote-tracking branch 'courtyard/lr-giraffe' into lr-giraffe
adamnovak Jun 27, 2022
3f83f8b
Change unit tests to match the current lenient chaining
adamnovak Jun 27, 2022
a3af8ca
Add tests for distractability
adamnovak Jun 27, 2022
6b8f49f
Don't skip reachable things
adamnovak Jun 27, 2022
84fe01c
Allow tuning connections and tails separately
adamnovak Jun 28, 2022
8e090e6
Use a hashtable instead of a sorted list to find points by exact scor…
adamnovak Jun 29, 2022
082d8de
Start on combining nodes into runs
adamnovak Jul 6, 2022
0f273a4
Add a benchmark for connect()
adamnovak Jul 6, 2022
743bc73
Remember to start with empty alignment in test
adamnovak Jul 6, 2022
3d48df3
Build positions in the right order
adamnovak Jul 6, 2022
76c0696
Benchmark quietly
adamnovak Jul 6, 2022
3162685
Test 10 node sequence
adamnovak Jul 6, 2022
0c3374d
Merge commit '8e090e64fba31bd89f4d1c9bfef6dbe9284273d6' into lr-giraffe
adamnovak Jul 6, 2022
4ef2734
Merge remote-tracking branch 'courtyard/lr-giraffe' into lr-giraffe
adamnovak Jul 6, 2022
fd59d80
Get coalesced WFANodes to build
adamnovak Jul 6, 2022
85d17c7
Add some debugging prints
adamnovak Jul 7, 2022
0644c09
Revert "Add some debugging prints"
adamnovak Jul 7, 2022
153a921
Add single test for working in a multi-node run
adamnovak Jul 7, 2022
5626e82
Revert "Revert "Add some debugging prints""
adamnovak Jul 7, 2022
92fa053
Add more debugging
adamnovak Jul 7, 2022
6202109
Pass single node perfect match test
adamnovak Jul 8, 2022
ec62c3b
Change bound type to handle exact hits
adamnovak Jul 8, 2022
db7cd1c
Protect node_offset_of
adamnovak Jul 8, 2022
807bef6
Fix some incorrect conversion of destination offset
adamnovak Jul 8, 2022
9052dda
Fix expected edit counts
adamnovak Jul 8, 2022
fb0e93d
Remove duplicate test cases
adamnovak Jul 8, 2022
3da9edc
Turn off debug logging
adamnovak Jul 8, 2022
d7014bd
Change to ImmutableList which is also taking all our time
adamnovak Jul 8, 2022
19572da
Use a weird shared-middle list for paths
adamnovak Jul 8, 2022
ba7f449
Try inlining 4 items in the path type
adamnovak Jul 8, 2022
cbb5607
Set length limits to 5 kb
adamnovak Jul 8, 2022
61dcb9c
Add error model and low default limits for WFAExtender score
adamnovak Jul 11, 2022
4a0f240
Merge remote-tracking branch 'origin/lr-giraffe' into lr-giraffe
adamnovak Jul 11, 2022
3f3b5d3
Cap WFA scores aggressively
adamnovak Jul 11, 2022
0e3a8c2
Remove debug prints
adamnovak Jul 11, 2022
453349b
Don't look for unittest support headers that don't exist
adamnovak Jul 11, 2022
a2bb88a
Raise limits for benchmarking so it runs
adamnovak Jul 13, 2022
1bc6d87
Make sure to adopt the node offset when adopting the initial node
adamnovak Jul 15, 2022
45ff47d
Merge remote-tracking branch 'upstream/master' into lr-giraffe
adamnovak Jul 21, 2022
360b0ff
Drop unused print
adamnovak Jul 21, 2022
694f472
Add a test for non-diverging multi-node cycles
adamnovak Jul 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Start trimming
  • Loading branch information
adamnovak committed Jun 6, 2022
commit 9272e62b243694b352532734c4e15ce3d921494c
3 changes: 3 additions & 0 deletions src/gbwt_extender.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,9 @@ struct GaplessExtension

/// Number of shared (read position, graph position) pairs in the extensions.
size_t overlap(const HandleGraph& graph, const GaplessExtension& another) const;

/// Take a prefix of the extension as a new GaplessExtension object.
GaplessExtension prefix(size_t prefix_length) const;

/// Convert the extension into a Path.
Path to_path(const HandleGraph& graph, const std::string& sequence) const;
Expand Down
37 changes: 30 additions & 7 deletions src/minimizer_mapper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3948,6 +3948,12 @@ size_t MinimizerMapper::get_graph_overlap(const GaplessExtension& left,

}

size_t MinimizerMapper::get_read_overlap(const GaplessExtension& left,
const GaplessExtension& right) {

return left.read_interval.second - right.read_interval.first;
}

size_t MinimizerMapper::get_graph_distance(const GaplessExtension& left,
const GaplessExtension& right,
const SnarlDistanceIndex* distance_index,
Expand Down Expand Up @@ -4140,7 +4146,7 @@ S MinimizerMapper::chain_extension_group_dp(vector<S>& best_chain_score,
// See whether and how much they overlap in the graph
size_t graph_overlap = get_graph_overlap(extended_seeds[ST::source(overlap_option)], extended_seeds[unentered], graph);
// And also get the overlap in the read.
size_t read_overlap = extended_seeds[ST::source(overlap_option)].read_interval.second - extended_seeds[unentered].read_interval.first;
size_t read_overlap = get_read_overlap(extended_seeds[ST::source(overlap_option)], extended_seeds[unentered]);
if (graph_overlap > read_overlap) {
// They overlap in the graph more than they do in the read.
// We don't want to charge for the overlap in the read, we just want to charge for the additional overlap in the graph.
Expand Down Expand Up @@ -4500,13 +4506,14 @@ Alignment MinimizerMapper::find_chain_alignment(const Alignment& aln, const vect

const GaplessExtension* next = &extended_seeds[*next_it];

// Get the read string between them
while (next_it != chain.end() && here->read_interval.second >= next->read_interval.first) {
// Actually there's an overlap; just skip the overlapping extension.
// How much does this extension overlap into the next one in the read?
size_t overlap = get_read_overlap(*here, *next);
while (next_it != chain.end() && overlap >= std::max(here->length(), next->length())) {
// There's enough overlap for one of these to contain the other. Keep here and skip next.
if (show_work) {
#pragma omp critical (cerr)
{
cerr << log_name() << "Don't try and connect " << *here_it << " to " << *next_it << " because they overlap" << endl;
cerr << log_name() << "Don't try and connect " << *here_it << " to " << *next_it << " because of containment or backtracking over an entire gapless extension" << endl;
}
}
// TODO: If we just drop the whole extension we can get into a situation where we need to do a lot of DP to replace it, and we might not get anything that follows the GBWT.
Expand All @@ -4516,21 +4523,37 @@ Alignment MinimizerMapper::find_chain_alignment(const Alignment& aln, const vect
break;
}
next = &extended_seeds[*next_it];
overlap = get_read_overlap(*here, *next);
}
if (next_it == chain.end()) {
break;
}

// Now they may still overlap, but each has a distinct read base.

if (show_work) {
#pragma omp critical (cerr)
{
cerr << log_name() << "Add extension " << *here_it << " of " << here->length() << " with score of " << here->score << endl;
}
}

if (overlap > 0) {
// We need to trim to avoid overlap in the read.
// If we trim from here it is more efficient because we are
// trimming from the end of its internal vectors.

if (show_work) {
#pragma omp critical (cerr)
{
cerr << log_name() << "Trim " << overlap << " bases from end of extension" << endl;
}
}
}

// Make an alignment for the bases used in this GaplessExtension, and
// concatenate it in on the shared match.
aligned.join_on_shared_match(WFAAlignment::from_extension(*here), aligner.match);
// concatenate it in on the shared match. Make sure to trim the overlap off the end.
aligned.join_on_shared_match(WFAAlignment::from_extension(*here, here->length() - overlap), aligner.match);

// Pull out the intervening string. Leave room for the anchoring matches.
string linking_bases = aln.sequence().substr(here->read_interval.second - 1, next->read_interval.first - here->read_interval.second + 2);
Expand Down
15 changes: 13 additions & 2 deletions src/minimizer_mapper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -550,8 +550,8 @@ class MinimizerMapper : public AlignerClient {

/**
* Return the amount by which the end of the left gapless extension is
* past the position before the start of the right gapless extension.
* Returns 0 if they do not actually overlap.
* past the position before the start of the right gapless extension, in
* the graph. Returns 0 if they do not actually overlap.
*
* Doesn't actually match the whole graph paths against each other; if
* there's a handle where the left one ends and then the right one starts,
Expand All @@ -563,6 +563,17 @@ class MinimizerMapper : public AlignerClient {
static size_t get_graph_overlap(const GaplessExtension& left,
const GaplessExtension& right,
const HandleGraph* graph);

/**
* Return the amount by which the end of the left gapless extension is
* past the position before the start of the right gapless extension, in
* the read. Returns 0 if they do not actually overlap.
*
* Can return a number larger than the length of one extension if it is
* contained in the other.
*/
static size_t get_read_overlap(const GaplessExtension& left,
const GaplessExtension& right);

/**
* Get the minimum graph distance between the end of the left gapless
Expand Down