Long Read Giraffe #3700

adamnovak · 2022-07-11T21:15:30Z

Changelog Entry

To be copied to the draft changelog by merger:

vg giraffe can now --align-from-chains to do long-read alignments
Makefile now supports make bin/unittest/<test_file_name> to build a dynamically-linked binary for just one file of unit tests, for faster iteration.

Description

This is my sketch for integrating @xchang1's distance index 2 and @StephenHwang's minimizer selection for long reads, with some chaining logic that uses @jltsiren's WFA-against-GBWT aligner to connect the minimizers together.

The chaining logic is somewhat trivial and refuses to skip even a single minimizer hit, and the WFA-against-GBWT alignment is being limited to at most a relatively small score even if the sequence being aligned is quite long.

Using this, I managed to get through 10k reads from the HiFi read set we've been working with in about 15 minutes. Most of the reads took about 0.02 thread-seconds each, but the slowest 50 were:

8.618331251	m64011_190830_220126/461125/ccs
9.396e-05	m64011_190830_220126/460538/ccs
9.514389841	m64011_190830_220126/525540/ccs
9.73168963	m64011_190830_220126/1048647/ccs
10.056107133	m64011_190830_220126/198240/ccs
10.286447607	m64011_190830_220126/1180267/ccs
10.473475374	m64011_190830_220126/2690/ccs
10.756872153	m64011_190830_220126/1114718/ccs
10.821179828	m64011_190830_220126/132701/ccs
10.902801503	m64011_190830_220126/984791/ccs
12.382437693	m64011_190830_220126/329691/ccs
12.406518122	m64011_190830_220126/132379/ccs
12.647579014	m64011_190830_220126/983219/ccs
13.186383601	m64011_190830_220126/1179826/ccs
15.442856546	m64011_190830_220126/983932/ccs
15.598967196	m64011_190830_220126/1180435/ccs
16.846121033	m64011_190830_220126/328422/ccs
17.069679146	m64011_190830_220126/393977/ccs
20.204970769	m64011_190830_220126/788498/ccs
24.339630128	m64011_190830_220126/66301/ccs
27.889559158	m64011_190830_220126/66298/ccs
29.519215479	m64011_190830_220126/1180052/ccs
32.885811606	m64011_190830_220126/1114439/ccs
33.619887132	m64011_190830_220126/524776/ccs
37.721842504	m64011_190830_220126/131912/ccs
38.246688859	m64011_190830_220126/722387/ccs
42.405513372	m64011_190830_220126/918093/ccs
46.032722843	m64011_190830_220126/459694/ccs
47.165207406	m64011_190830_220126/853701/ccs
47.344950028	m64011_190830_220126/262147/ccs
52.195793082	m64011_190830_220126/65908/ccs
55.281145333	m64011_190830_220126/524995/ccs
68.929016647	m64011_190830_220126/525396/ccs
70.967101053	m64011_190830_220126/1115348/ccs
71.961231031	m64011_190830_220126/1181548/ccs
74.598282346	m64011_190830_220126/657570/ccs
82.139838261	m64011_190830_220126/657686/ccs
86.893103747	m64011_190830_220126/984102/ccs
102.310523759	m64011_190830_220126/592123/ccs
104.611061058	m64011_190830_220126/723341/ccs
108.652890817	m64011_190830_220126/264611/ccs
108.756120559	m64011_190830_220126/589970/ccs
115.372498098	m64011_190830_220126/393982/ccs
117.884962444	m64011_190830_220126/1116719/ccs
133.442092941	m64011_190830_220126/461459/ccs
134.376388748	m64011_190830_220126/592493/ccs
140.887197072	m64011_190830_220126/656439/ccs
173.146125327	m64011_190830_220126/854016/ccs
293.673232448	m64011_190830_220126/789175/ccs
293.797770105	m64011_190830_220126/655993/ccs

@xchang1 Does this need to be updated with more of your DI2 code before it gets merged?

@jltsiren I touched a bunch of the data structures in the WFAExtender trying to improve the algorithmics and address where profiling saw all my time going. In addition to just coalescing multiple graph nodes into one WFANode, I changed from a sorted list to a hashtable to hold the wavefronts on each WFANode (for O(1) insert/lookup when it gets huge), and I changed the way WFAPoint stores its path to try and avoid constantly allocating and deallocating deque stuff in the std::stack that was there before (which was taking up almost all the time according to my profile). Have I messed anything up that you want me to try and roll back?

…r-giraffe

…ning problem

…to alignment elaboration

Merge commit 'bffdd27a300a2669df6469025b5660077666def8' into lr-giraffe

…to lr-giraffe

… index codepath

… clear

adamnovak · 2022-07-11T21:23:55Z

This is updating vcflib right now; I have to check if that really makes sense or if I just caught it accidentally.

adamnovak · 2022-07-13T16:16:39Z

The vcflib changes are intentional; I was having trouble with some SVs that had been messed up by a litfover process, and I fixed vcflib to detect this and refuse to try and canonicalize them when they can't be sensibly interpreted.

adamnovak · 2022-07-14T15:26:36Z

I managed to crash this on read pair ERR3239454.221193804 with vg giraffe -t 64 --align-from-chains --progress -Z /public/groups/cgl/graph-genomes/xhchang/hprc_graph/GRCh38-f1g-90-mc-aug11-clip.d9.m1000.D10M.m1000.giraffe.gbz -d /public/groups/cgl/graph-genomes/anovak/trash/GRCh38-f1g-90-mc-aug11-clip.d9.m1000.D10M.m1000.dist -m /public/groups/cgl/graph-genomes/anovak/trash/GRCh38-f1g-90-mc-aug11-clip.d9.m1000.D10M.m1000.min -f /nanopore/cgl/data/giraffe/mapping/reads/real/NA19239/novaseq6000-ERR3239454-shuffled-1m.fq.gz -i >/dev/null. It comes up with a WFAAlignment of { path = [ (9849702, 1) ], edits = [ 3I35M1X1M2X9M1D10M2X2M1X1M2X18M63I ], node offset = 1005, sequence range = [0, 150), score = 30 } but it can't turn it into a path since it runs off the end of a 1024-bp node and doesn't actually have any more nodes in it.

adamnovak · 2022-07-22T01:26:05Z

The build doesn't work on the MacOS 11 image because of missing Protobuf symbols.

-- Found PkgConfig: /usr/local/bin/pkg-config (found version "0.29.2") 
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:343 ] Protobuf_USE_STATIC_LIBS = OFF
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:479 ] requested version of Google Protobuf is 
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:487 ] location of common.h: /usr/local/include/google/protobuf/stubs/common.h
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:505 ] /usr/local/include/google/protobuf/stubs/common.h reveals protobuf 3.21.2
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:519 ] /usr/local/bin/protoc reveals version 3.21.2
-- Found Protobuf: /usr/local/lib/libprotobuf.dylib (found version "3.21.2") 
Protobuf will be /usr/local/lib/libprotobuf.dylib for PIC dynamic code and /usr/local/lib/libprotobuf.a for non-PIC static code

[ 95%] Linking CXX shared library libvgio.dylib
/usr/local/Cellar/cmake/3.23.2/bin/cmake -E cmake_link_script CMakeFiles/vgio.dir/link.txt --verbose=1
/Applications/Xcode_13.2.1.app/Contents/Developer/usr/bin/g++ -O3 -g -O3 -Werror=return-type -std=c++14 -ggdb -g  -Xpreprocessor -fopenmp -march=native -isysroot /Applications/Xcode_13.2.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk -mmacosx-version-min=11.6 -dynamiclib -Wl,-headerpad_max_install_names -L/Users/runner/work/vg/vg/lib -L/usr/local/lib -o libvgio.dylib -install_name @rpath/libvgio.dylib CMakeFiles/vgio.dir/vg.pb.cc.o CMakeFiles/vgio.dir/src/alignment_emitter.cpp.o CMakeFiles/vgio.dir/src/alignment_io.cpp.o CMakeFiles/vgio.dir/src/basic_stream.cpp.o CMakeFiles/vgio.dir/src/blocked_gzip_input_stream.cpp.o CMakeFiles/vgio.dir/src/blocked_gzip_output_stream.cpp.o CMakeFiles/vgio.dir/src/edit.cpp.o CMakeFiles/vgio.dir/src/hfile_cppstream.cpp.o CMakeFiles/vgio.dir/src/json2pb.cpp.o CMakeFiles/vgio.dir/src/message_emitter.cpp.o CMakeFiles/vgio.dir/src/message_iterator.cpp.o CMakeFiles/vgio.dir/src/registry.cpp.o CMakeFiles/vgio.dir/src/stream.cpp.o CMakeFiles/vgio.dir/src/stream_multiplexer.cpp.o CMakeFiles/vgio.dir/src/vpkg.cpp.o   -L/usr/local/Cellar/jansson/2.14/lib  -Wl,-rpath,/usr/local/Cellar/jansson/2.14/lib /usr/local/lib/libprotobuf.dylib -lhts -ljansson handlegraph-prefix/lib/libhandlegraph.dylib /usr/local/lib/libomp.dylib 
Undefined symbols for architecture x86_64:
  "google::protobuf::internal::InternalMetadata::~InternalMetadata()", referenced from:
      google::protobuf::Message::~Message() in vg.pb.cc.o
      vg::Graph::~Graph() in vg.pb.cc.o
      vg::Graph::~Graph() in vg.pb.cc.o
      vg::Graph::~Graph() in vg.pb.cc.o
      vg::Node::~Node() in vg.pb.cc.o
      vg::Edge::Edge(vg::Edge const&) in vg.pb.cc.o
      vg::Edge::Edge(vg::Edge const&) in vg.pb.cc.o
      ...
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[3]: *** [libvgio.dylib] Error 1

It could be that there's somehow a Protobuf version mismatch, even though we should be using an OS-version-specific cache here. We're supposedly building against Protobuf 3.21.2 (which Brew is calling 21.2), but Protobug 3.21.3 came out yesterday and the same day someone on StackOverflow started reporting this error

jeizenga · 2022-07-22T01:28:51Z

Protobug = Freudian slip?

adamnovak · 2022-07-22T01:46:20Z

I think we hit protocolbuffers/protobuf#9947 where this symbol doesn't appear in release (-DNDEBUG) Protobuf libraries, but wants to be linked against by headers included in builds that don't use -DNDEBUG.

I think maybe in the last couple days the bad Protobuf releases hit Homebrew, everybody suddenly cared, and the bug was actually fixed.

The 3.21.3 release is supposed to fix this problem, so we need Protobuf 3.21.3, or else an old ~3.19 one as in tuplex/tuplex#119. 3.21.3 is now in Homebrew according to Homebrew/homebrew-core#106252 so I think I might just need to rerun?

adamnovak · 2022-07-22T01:47:40Z

🐛

adamnovak · 2022-07-25T20:48:38Z

I'm breaking off the Mac CI changes into #3708, since they seem to actually be hard and the 10.15 brownout has been stopped.

adamnovak added 30 commits May 23, 2022 10:25

Merge remote-tracking branch 'xian/distance-minimizer-caching' into l…

d64ee68

…r-giraffe

Add MinimizerMapper::chain_extension_group that does half of the chai…

d2e7d17

…ning problem

Rearrange and correct test cases and add debugging

e3df479

Fix bug where overlaps were scored 1 too low

3f6d5ae

Fix more test cases and adjust debugging

a93fa0d

Note that we don't actually enforce a containment ban

967418d

Implement some measurement and scoring of graph distances for chaining

cc396fd

Add some easy unit tests

603335a

Allow building individual dynamic unit test binaries

a676d61

Add another test

5d8ffa3

Swap over to traits so we can define the DP once as a template

41d93a7

Hook up chaining into single-end Giraffe and get ready for the chain …

17ee4e1

…to alignment elaboration

Get build working with chaining

84b2bc6

Merge Jouni's WFA-over-GBWT algorithm

021bb4b

Merge commit 'bffdd27a300a2669df6469025b5660077666def8' into lr-giraffe

Merge remote-tracking branch 'upstream/distance-minimizer-caching' in…

614abc5

…to lr-giraffe

Add some tests for long read Giraffe that don't pass yet

3ba122c

Copy minimizer total limit and overlap detection over to old distance…

1f96567

… index codepath

Start implementing aligning between gapless extensions

7aa80b5

Start on actually implementing the helpers and welding

b3e1672

Implement WFAAlignment to vg Alignment conversion

49a95a6

Act more like the existing alignment conversion logic

12fd227

Get chaining to be attempted

a4e4b93

Add some unit tests for alignment welding

acee183

Make easy join test pass

c12d9a6

Make work-showing useful for long reads

4ad22af

Start at the right extension and make dropping overlapping extensions…

45df6d5

… clear

Merge remote-tracking branch 'upstream/master' into lr-giraffe

2767fcc

Start trimming

9272e62

Merge remote-tracking branch 'jouni/master' into lr-giraffe

4b0e00b

Rename some variables to make making extensions optional make sense

6f68599

adamnovak added 14 commits July 8, 2022 13:47

Protect node_offset_of

db7cd1c

Fix some incorrect conversion of destination offset

807bef6

Fix expected edit counts

9052dda

Remove duplicate test cases

fb0e93d

Turn off debug logging

3da9edc

Change to ImmutableList which is also taking all our time

d7014bd

Use a weird shared-middle list for paths

19572da

Try inlining 4 items in the path type

ba7f449

Set length limits to 5 kb

cbb5607

Add error model and low default limits for WFAExtender score

61dcb9c

Merge remote-tracking branch 'origin/lr-giraffe' into lr-giraffe

4a0f240

Cap WFA scores aggressively

3f3b5d3

Remove debug prints

0e3a8c2

Don't look for unittest support headers that don't exist

453349b

Raise limits for benchmarking so it runs

a2bb88a

Make sure to adopt the node offset when adopting the initial node

1bc6d87

adamnovak force-pushed the lr-giraffe branch from 1b51465 to 1bc6d87 Compare July 15, 2022 20:22

adamnovak added 3 commits July 21, 2022 15:11

Merge remote-tracking branch 'upstream/master' into lr-giraffe

45ff47d

Drop unused print

360b0ff

Add a test for non-diverging multi-node cycles

694f472

adamnovak force-pushed the lr-giraffe branch from 46af1e3 to 694f472 Compare July 25, 2022 20:48

adamnovak merged commit fd97f17 into master Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Read Giraffe #3700

Long Read Giraffe #3700

adamnovak commented Jul 11, 2022 •

edited

Loading

adamnovak commented Jul 11, 2022

adamnovak commented Jul 13, 2022

adamnovak commented Jul 14, 2022

adamnovak commented Jul 22, 2022

jeizenga commented Jul 22, 2022

adamnovak commented Jul 22, 2022

adamnovak commented Jul 22, 2022

adamnovak commented Jul 25, 2022

Long Read Giraffe #3700

Long Read Giraffe #3700

Conversation

adamnovak commented Jul 11, 2022 • edited Loading

Changelog Entry

Description

adamnovak commented Jul 11, 2022

adamnovak commented Jul 13, 2022

adamnovak commented Jul 14, 2022

adamnovak commented Jul 22, 2022

jeizenga commented Jul 22, 2022

adamnovak commented Jul 22, 2022

adamnovak commented Jul 22, 2022

adamnovak commented Jul 25, 2022

adamnovak commented Jul 11, 2022 •

edited

Loading