Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Giraffe Fully Simple #3126

Open
adamnovak opened this issue Dec 14, 2020 · 9 comments
Open

Make Giraffe Fully Simple #3126

adamnovak opened this issue Dec 14, 2020 · 9 comments

Comments

@adamnovak
Copy link
Member

@benedictpaten wants Giraffe to be "fully simple". This has two elements, as I understand it:

  • We should have standards and specs for input formats we want from HPRC, to run on
  • We should have a pipeline suitable for a BME230 (2nd year graduate course) student to run Giraffe on an input (extended) GFA, without them complaining that they need a giant-memory machine, or that it takes forever to go through all the connected components, or that they can't understand how to build the indexes from the tool help.

So far, we have identified three things we need to do to set this up:

  • Support some form or forms of GFA-with-haplotypes as a single-file input format for indexing for Giraffe. We should be able to go straight from this (blunt but otherwise unprocessed file, possibly with large nodes or string/0 node names) to the indexes Giraffe needs. We want to take this as input for a new tool-oriented indexing approach (maybe vg index --giraffe), with help that points to to that and doesn't overwhelm the user with 15 different possible index formats they could make. This would require a bit of workflow work to make it so the one command can decide the right number of connected components to build indexes for in memory at a time, without running out of memory or taking forever in serial.

  • Combine the GBWT and GBWTGraph into one file. This would reduce the number of files that Giraffe needs to three (GBWT/graph, distance index, minimizer index). @jltsiren thinks this will be straightforward.

  • Abstract away node chopping and node boundaries, with some kind of system and specified format for translating between GFA coordinates and chopped-graph, numbered-node vg coordinates. The new grad students and other new users keep complaining that vg has chopped up their graph and that the node IDs are "wrong". If we have a way to tell them the coordinate translation we are using (or to maybe even output mappings in the original GFA coordinates?), they will hopefully be mollified. Eventually, we might like a way for the HandleGraph API to let you ignore where node boundaries fall.

@ekg What do you think? We also probably would need vg index --map and vg index --mpmap, right? And as for GFA-with-haplotypes, do we have a consensus on what style or styles we should accept for paths with haplotype semantics?

@jeizenga
Copy link
Contributor

jeizenga commented Dec 14, 2020

In https://github.com/vgteam/GetBlunted Ryan and I output a text-based translation table that identifies each node with the original sequence(s) that contributed to them. The tables look something like this:

#blunted_seq_name	original_subseqs
s1	read1[0:4]+,read2[3:7]+
s2	read1[4:7]+,read2[0:3]+

I think something similar might be useful for managing the chopping, along with some helpful stderr output alerting the user that the node IDs have been changed, why they have been changed, and that the translations can be found in the table.

@jltsiren
Copy link
Contributor

Do we want a general graph translation format that can handle alignments, different levels of detail, and so on? Or do we just need a way to state that segment s1 in the original GFA corresponds to node sequence 1, 2, 3 in our graph? For the latter, we could use GFA P-lines in a separate file:

P	s1	1+,2+,3+
P	s2	4+,5+

Or we could create a new line type to avoid confusion:

T	s1	1,2,3
T	s2	4,5

We need some changes to libvgio to combine GBWT and GBWTGraph into a single file. We want to be able to say something like:

vg::io::VPKG::save(gbwt_index, filename, vg::io::VPKG::OVERWRITE_FILE);
vg::io::VPKG::save(gbwt_graph, filename, vg::io::VPKG::APPEND_TO_FILE);

GBWT/GBWTGraph construction from any GFA-like format is straightforward once we have figured out what to do with segments and path names. The difficult part is parsing GBWT metadata from path names and tags. I assume we are going to need many options and presets for the metadata.

@adamnovak
Copy link
Member Author

I think if we're using the VPKG encapsulation on the GBWT and GBWTGraph we can just open a stream and VPKG save both of them to the same stream. I was picturing having something in GBWTGraph that serialized and loaded the two together, automatically linking them up, but we could also do it via VPKG with manual link-up.

@ASLeonard
Copy link

This looks to be resolved now (potentially needing a comment from #3356 regarding the --request XG). It took a fair bit of issue trawling to find this, so might be useful to update the giraffe wiki if people end up needing to relate mapping node IDs to gfa node IDs. If I get this working in the end, I'd be happy to set up a pull request with the commands used.

@adamnovak
Copy link
Member Author

@ASLeonard We'd be delighted to have you update the wiki or make a documentation PR with your findings.

We abandoned the VPKG GBWT/GBWTGraph idea, but now we have GBZ which represents that combined data type.

We do still need the XG for surject, which is hard to get autoindex to make, and could be easier. But GBZ just this past week can hold paths, so we will eventually want to build out support for that in vg.

@briannadon
Copy link

Just want to chime in since my issue #3356 was mentioned:

After changing jobs and coming back to VG and graphs after a several months and a few VG updates, I tried again to use vg autoindex with vg giraffe and some other things to investigate alignments to graphs generated from PGGB. It seems the same issue of the giraffe GAM alignments not agreeing with the original GFA graph due to the reference-vs-haplotype issue remains. This was verified with vg validate -a , showing the graph was valid but the GAM was not. Again, adding --request XG to the vg autoindex command indeed does still entirely fix the issue. Perhaps this could be added as a default behavior for vg autoindex -w giraffe?

Separately, I would like to know if I'm losing some kind of valuable information by doing it this way. I'm working with a system wherein I don't want there to be one authoritative reference, and want every haplotype treated equally, if that makes sense. If this is a purely academic distinction I suppose it doesn't matter.

By the way, giraffe's runtime with the GBZ instead of separate GBWT and gg files is incredibly fast now, so kudos for that.

@adamnovak
Copy link
Member Author

We're working on a way to store the original GFA's node names and boundaries alongside the re-chopped vg node IDs, so soon you will be able to get GAM or GAF output in the original GFA's coordinates.

Right now, even if you make the XG and use that for downstream analysis of the alignments, you still lose the GFA's coordinate space, so you either have to always do your downstream work in terms of vg's assigned node IDs, or use embedded paths (which could be an authoritative reference, but could also be any GFA P lines) to translate your results into coordinates someone else can understand.

@briannadon
Copy link

Can you expand a bit on what exactly "embedded paths" are? I've seen it mentioned and I'm not sure what that is. Sorry if that's available somewhere in the wiki, couldn't find it.

Also, what is the specific difference between the graph structure output by vg autoindex [...] --request XG and just chopping the graph yourself with vg mod -X 1000, besides lacking paths? Is there any reason I couldn't just chop the graph myself and the nodes would just match up?

@jltsiren
Copy link
Contributor

jltsiren commented Jan 7, 2022

There are two kinds of paths in VG: embedded paths (or just paths) that are stored in the graph itself, and lightweight paths (threads) stored in a GBWT index. Embedded paths are typically used for representing reference sequences and variants, while threads are better suited for storing a large collection of haplotypes.

When vg autoindex builds GBWT from GFA, it uses the chopping algorithm from GBWTGraph. That algorithm processes the segments in the order they appear in the GFA file, chops them into <= 1024 bp nodes, and assigns node ids starting from 1.

vg mod uses the chopping algorithm from libhandlegraph. It tries to process the nodes in the order defined by their original ids, and it tries to assign the chopped nodes new ids starting from 1, but neither is guaranteed. The exact behavior depends on graph implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants