-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Giraffe Fully Simple #3126
Comments
In https://github.com/vgteam/GetBlunted Ryan and I output a text-based translation table that identifies each node with the original sequence(s) that contributed to them. The tables look something like this:
I think something similar might be useful for managing the chopping, along with some helpful stderr output alerting the user that the node IDs have been changed, why they have been changed, and that the translations can be found in the table. |
Do we want a general graph translation format that can handle alignments, different levels of detail, and so on? Or do we just need a way to state that segment
Or we could create a new line type to avoid confusion:
We need some changes to libvgio to combine GBWT and GBWTGraph into a single file. We want to be able to say something like:
GBWT/GBWTGraph construction from any GFA-like format is straightforward once we have figured out what to do with segments and path names. The difficult part is parsing GBWT metadata from path names and tags. I assume we are going to need many options and presets for the metadata. |
I think if we're using the VPKG encapsulation on the GBWT and GBWTGraph we can just open a stream and VPKG save both of them to the same stream. I was picturing having something in GBWTGraph that serialized and loaded the two together, automatically linking them up, but we could also do it via VPKG with manual link-up. |
This looks to be resolved now (potentially needing a comment from #3356 regarding the |
@ASLeonard We'd be delighted to have you update the wiki or make a documentation PR with your findings. We abandoned the VPKG GBWT/GBWTGraph idea, but now we have GBZ which represents that combined data type. We do still need the XG for surject, which is hard to get autoindex to make, and could be easier. But GBZ just this past week can hold paths, so we will eventually want to build out support for that in vg. |
Just want to chime in since my issue #3356 was mentioned: After changing jobs and coming back to VG and graphs after a several months and a few VG updates, I tried again to use Separately, I would like to know if I'm losing some kind of valuable information by doing it this way. I'm working with a system wherein I don't want there to be one authoritative reference, and want every haplotype treated equally, if that makes sense. If this is a purely academic distinction I suppose it doesn't matter. By the way, giraffe's runtime with the GBZ instead of separate GBWT and gg files is incredibly fast now, so kudos for that. |
We're working on a way to store the original GFA's node names and boundaries alongside the re-chopped vg node IDs, so soon you will be able to get GAM or GAF output in the original GFA's coordinates. Right now, even if you make the XG and use that for downstream analysis of the alignments, you still lose the GFA's coordinate space, so you either have to always do your downstream work in terms of vg's assigned node IDs, or use embedded paths (which could be an authoritative reference, but could also be any GFA P lines) to translate your results into coordinates someone else can understand. |
Can you expand a bit on what exactly "embedded paths" are? I've seen it mentioned and I'm not sure what that is. Sorry if that's available somewhere in the wiki, couldn't find it. Also, what is the specific difference between the graph structure output by |
There are two kinds of paths in VG: embedded paths (or just paths) that are stored in the graph itself, and lightweight paths (threads) stored in a GBWT index. Embedded paths are typically used for representing reference sequences and variants, while threads are better suited for storing a large collection of haplotypes. When
|
@benedictpaten wants Giraffe to be "fully simple". This has two elements, as I understand it:
So far, we have identified three things we need to do to set this up:
Support some form or forms of GFA-with-haplotypes as a single-file input format for indexing for Giraffe. We should be able to go straight from this (blunt but otherwise unprocessed file, possibly with large nodes or string/0 node names) to the indexes Giraffe needs. We want to take this as input for a new tool-oriented indexing approach (maybe
vg index --giraffe
), with help that points to to that and doesn't overwhelm the user with 15 different possible index formats they could make. This would require a bit of workflow work to make it so the one command can decide the right number of connected components to build indexes for in memory at a time, without running out of memory or taking forever in serial.Combine the GBWT and GBWTGraph into one file. This would reduce the number of files that Giraffe needs to three (GBWT/graph, distance index, minimizer index). @jltsiren thinks this will be straightforward.
Abstract away node chopping and node boundaries, with some kind of system and specified format for translating between GFA coordinates and chopped-graph, numbered-node vg coordinates. The new grad students and other new users keep complaining that vg has chopped up their graph and that the node IDs are "wrong". If we have a way to tell them the coordinate translation we are using (or to maybe even output mappings in the original GFA coordinates?), they will hopefully be mollified. Eventually, we might like a way for the HandleGraph API to let you ignore where node boundaries fall.
@ekg What do you think? We also probably would need
vg index --map
andvg index --mpmap
, right? And as for GFA-with-haplotypes, do we have a consensus on what style or styles we should accept for paths with haplotype semantics?The text was updated successfully, but these errors were encountered: