Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Read Giraffe #3700

Merged
merged 104 commits into from
Jul 29, 2022
Merged

Long Read Giraffe #3700

merged 104 commits into from
Jul 29, 2022

Conversation

adamnovak
Copy link
Member

@adamnovak adamnovak commented Jul 11, 2022

Changelog Entry

To be copied to the draft changelog by merger:

  • vg giraffe can now --align-from-chains to do long-read alignments
  • Makefile now supports make bin/unittest/<test_file_name> to build a dynamically-linked binary for just one file of unit tests, for faster iteration.

Description

This is my sketch for integrating @xchang1's distance index 2 and @StephenHwang's minimizer selection for long reads, with some chaining logic that uses @jltsiren's WFA-against-GBWT aligner to connect the minimizers together.

The chaining logic is somewhat trivial and refuses to skip even a single minimizer hit, and the WFA-against-GBWT alignment is being limited to at most a relatively small score even if the sequence being aligned is quite long.

Using this, I managed to get through 10k reads from the HiFi read set we've been working with in about 15 minutes. Most of the reads took about 0.02 thread-seconds each, but the slowest 50 were:

8.618331251	m64011_190830_220126/461125/ccs
9.396e-05	m64011_190830_220126/460538/ccs
9.514389841	m64011_190830_220126/525540/ccs
9.73168963	m64011_190830_220126/1048647/ccs
10.056107133	m64011_190830_220126/198240/ccs
10.286447607	m64011_190830_220126/1180267/ccs
10.473475374	m64011_190830_220126/2690/ccs
10.756872153	m64011_190830_220126/1114718/ccs
10.821179828	m64011_190830_220126/132701/ccs
10.902801503	m64011_190830_220126/984791/ccs
12.382437693	m64011_190830_220126/329691/ccs
12.406518122	m64011_190830_220126/132379/ccs
12.647579014	m64011_190830_220126/983219/ccs
13.186383601	m64011_190830_220126/1179826/ccs
15.442856546	m64011_190830_220126/983932/ccs
15.598967196	m64011_190830_220126/1180435/ccs
16.846121033	m64011_190830_220126/328422/ccs
17.069679146	m64011_190830_220126/393977/ccs
20.204970769	m64011_190830_220126/788498/ccs
24.339630128	m64011_190830_220126/66301/ccs
27.889559158	m64011_190830_220126/66298/ccs
29.519215479	m64011_190830_220126/1180052/ccs
32.885811606	m64011_190830_220126/1114439/ccs
33.619887132	m64011_190830_220126/524776/ccs
37.721842504	m64011_190830_220126/131912/ccs
38.246688859	m64011_190830_220126/722387/ccs
42.405513372	m64011_190830_220126/918093/ccs
46.032722843	m64011_190830_220126/459694/ccs
47.165207406	m64011_190830_220126/853701/ccs
47.344950028	m64011_190830_220126/262147/ccs
52.195793082	m64011_190830_220126/65908/ccs
55.281145333	m64011_190830_220126/524995/ccs
68.929016647	m64011_190830_220126/525396/ccs
70.967101053	m64011_190830_220126/1115348/ccs
71.961231031	m64011_190830_220126/1181548/ccs
74.598282346	m64011_190830_220126/657570/ccs
82.139838261	m64011_190830_220126/657686/ccs
86.893103747	m64011_190830_220126/984102/ccs
102.310523759	m64011_190830_220126/592123/ccs
104.611061058	m64011_190830_220126/723341/ccs
108.652890817	m64011_190830_220126/264611/ccs
108.756120559	m64011_190830_220126/589970/ccs
115.372498098	m64011_190830_220126/393982/ccs
117.884962444	m64011_190830_220126/1116719/ccs
133.442092941	m64011_190830_220126/461459/ccs
134.376388748	m64011_190830_220126/592493/ccs
140.887197072	m64011_190830_220126/656439/ccs
173.146125327	m64011_190830_220126/854016/ccs
293.673232448	m64011_190830_220126/789175/ccs
293.797770105	m64011_190830_220126/655993/ccs

@xchang1 Does this need to be updated with more of your DI2 code before it gets merged?

@jltsiren I touched a bunch of the data structures in the WFAExtender trying to improve the algorithmics and address where profiling saw all my time going. In addition to just coalescing multiple graph nodes into one WFANode, I changed from a sorted list to a hashtable to hold the wavefronts on each WFANode (for O(1) insert/lookup when it gets huge), and I changed the way WFAPoint stores its path to try and avoid constantly allocating and deallocating deque stuff in the std::stack that was there before (which was taking up almost all the time according to my profile). Have I messed anything up that you want me to try and roll back?

adamnovak added 30 commits May 23, 2022 10:25
Merge commit 'bffdd27a300a2669df6469025b5660077666def8' into lr-giraffe
@adamnovak
Copy link
Member Author

This is updating vcflib right now; I have to check if that really makes sense or if I just caught it accidentally.

@adamnovak
Copy link
Member Author

The vcflib changes are intentional; I was having trouble with some SVs that had been messed up by a litfover process, and I fixed vcflib to detect this and refuse to try and canonicalize them when they can't be sensibly interpreted.

@adamnovak
Copy link
Member Author

I managed to crash this on read pair ERR3239454.221193804 with vg giraffe -t 64 --align-from-chains --progress -Z /public/groups/cgl/graph-genomes/xhchang/hprc_graph/GRCh38-f1g-90-mc-aug11-clip.d9.m1000.D10M.m1000.giraffe.gbz -d /public/groups/cgl/graph-genomes/anovak/trash/GRCh38-f1g-90-mc-aug11-clip.d9.m1000.D10M.m1000.dist -m /public/groups/cgl/graph-genomes/anovak/trash/GRCh38-f1g-90-mc-aug11-clip.d9.m1000.D10M.m1000.min -f /nanopore/cgl/data/giraffe/mapping/reads/real/NA19239/novaseq6000-ERR3239454-shuffled-1m.fq.gz -i >/dev/null. It comes up with a WFAAlignment of { path = [ (9849702, 1) ], edits = [ 3I35M1X1M2X9M1D10M2X2M1X1M2X18M63I ], node offset = 1005, sequence range = [0, 150), score = 30 } but it can't turn it into a path since it runs off the end of a 1024-bp node and doesn't actually have any more nodes in it.

@adamnovak
Copy link
Member Author

The build doesn't work on the MacOS 11 image because of missing Protobuf symbols.

-- Found PkgConfig: /usr/local/bin/pkg-config (found version "0.29.2") 
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:343 ] Protobuf_USE_STATIC_LIBS = OFF
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:479 ] requested version of Google Protobuf is 
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:487 ] location of common.h: /usr/local/include/google/protobuf/stubs/common.h
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:505 ] /usr/local/include/google/protobuf/stubs/common.h reveals protobuf 3.21.2
-- [ /usr/local/Cellar/cmake/3.23.2/share/cmake/Modules/FindProtobuf.cmake:519 ] /usr/local/bin/protoc reveals version 3.21.2
-- Found Protobuf: /usr/local/lib/libprotobuf.dylib (found version "3.21.2") 
Protobuf will be /usr/local/lib/libprotobuf.dylib for PIC dynamic code and /usr/local/lib/libprotobuf.a for non-PIC static code
[ 95%] Linking CXX shared library libvgio.dylib
/usr/local/Cellar/cmake/3.23.2/bin/cmake -E cmake_link_script CMakeFiles/vgio.dir/link.txt --verbose=1
/Applications/Xcode_13.2.1.app/Contents/Developer/usr/bin/g++ -O3 -g -O3 -Werror=return-type -std=c++14 -ggdb -g  -Xpreprocessor -fopenmp -march=native -isysroot /Applications/Xcode_13.2.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk -mmacosx-version-min=11.6 -dynamiclib -Wl,-headerpad_max_install_names -L/Users/runner/work/vg/vg/lib -L/usr/local/lib -o libvgio.dylib -install_name @rpath/libvgio.dylib CMakeFiles/vgio.dir/vg.pb.cc.o CMakeFiles/vgio.dir/src/alignment_emitter.cpp.o CMakeFiles/vgio.dir/src/alignment_io.cpp.o CMakeFiles/vgio.dir/src/basic_stream.cpp.o CMakeFiles/vgio.dir/src/blocked_gzip_input_stream.cpp.o CMakeFiles/vgio.dir/src/blocked_gzip_output_stream.cpp.o CMakeFiles/vgio.dir/src/edit.cpp.o CMakeFiles/vgio.dir/src/hfile_cppstream.cpp.o CMakeFiles/vgio.dir/src/json2pb.cpp.o CMakeFiles/vgio.dir/src/message_emitter.cpp.o CMakeFiles/vgio.dir/src/message_iterator.cpp.o CMakeFiles/vgio.dir/src/registry.cpp.o CMakeFiles/vgio.dir/src/stream.cpp.o CMakeFiles/vgio.dir/src/stream_multiplexer.cpp.o CMakeFiles/vgio.dir/src/vpkg.cpp.o   -L/usr/local/Cellar/jansson/2.14/lib  -Wl,-rpath,/usr/local/Cellar/jansson/2.14/lib /usr/local/lib/libprotobuf.dylib -lhts -ljansson handlegraph-prefix/lib/libhandlegraph.dylib /usr/local/lib/libomp.dylib 
Undefined symbols for architecture x86_64:
  "google::protobuf::internal::InternalMetadata::~InternalMetadata()", referenced from:
      google::protobuf::Message::~Message() in vg.pb.cc.o
      vg::Graph::~Graph() in vg.pb.cc.o
      vg::Graph::~Graph() in vg.pb.cc.o
      vg::Graph::~Graph() in vg.pb.cc.o
      vg::Node::~Node() in vg.pb.cc.o
      vg::Edge::Edge(vg::Edge const&) in vg.pb.cc.o
      vg::Edge::Edge(vg::Edge const&) in vg.pb.cc.o
      ...
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[3]: *** [libvgio.dylib] Error 1

It could be that there's somehow a Protobuf version mismatch, even though we should be using an OS-version-specific cache here. We're supposedly building against Protobuf 3.21.2 (which Brew is calling 21.2), but Protobug 3.21.3 came out yesterday and the same day someone on StackOverflow started reporting this error

@jeizenga
Copy link
Contributor

Protobug = Freudian slip?

@adamnovak
Copy link
Member Author

I think we hit protocolbuffers/protobuf#9947 where this symbol doesn't appear in release (-DNDEBUG) Protobuf libraries, but wants to be linked against by headers included in builds that don't use -DNDEBUG.

I think maybe in the last couple days the bad Protobuf releases hit Homebrew, everybody suddenly cared, and the bug was actually fixed.

The 3.21.3 release is supposed to fix this problem, so we need Protobuf 3.21.3, or else an old ~3.19 one as in tuplex/tuplex#119. 3.21.3 is now in Homebrew according to Homebrew/homebrew-core#106252 so I think I might just need to rerun?

@adamnovak
Copy link
Member Author

🐛

@adamnovak
Copy link
Member Author

I'm breaking off the Mac CI changes into #3708, since they seem to actually be hard and the 10.15 brownout has been stopped.

@adamnovak adamnovak merged commit fd97f17 into master Jul 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants