-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split up taxonomy annotations into multiple columns, one for each level #305
Comments
Good idea - I do this in my own analysis to alleviate problems like you mentioned. Were you envisioning that the user would input the level "delimiter" or did you have something else in mind? I think GG & Silva both use semicolons but is this standardized enough to assume it for all cases? |
I think semicolons are widely accepted enough that just assuming those are the inputs is probably ok (and we can add stuff to the docs that mentions something along the lines of "hey if you're using backslashes or whatever for your taxonomy delimiters please don't"). It'd be possible to add a |
One other nice feature that could come out of this is having taxonomy barplots that summarizes the number of species, genera, etc detected in the numerator and the denominator. It's currently tedious summarizing taxonomies, see here for an early attempt summarizing taxa within a log-ratio -- but this isn't ideal when there are a ton of taxa within a single log-ratio. |
We're doing this for Empress (biocore/empress#130), and I'm realizing this might be useful to have in Qurro as well. This would mean converting the feature metadata from something like
into something like
... We should be able to do this entirely on the python side of things. The advantage of this is that this'd allow searching by just genera / etc., which saves you from some problems where the same string is used in different levels (e.g.
proteobacteria
being present in both p__Proteobacteria and c__Gammaproteobacteria, not that it makes a huge difference for the above example).There are ofc some problems with this, for example what happens when features have different numbers of "levels" (which is the case in the MetaPhlAn2 (?) taxonomy information for the Byrd dataset -- e.g. there are
Viruses
with 4 levels andBacteria
with 7 in the same dataset). But these problem should be surmountable; for this particular problem we could, say, "pad" missing levels with nulls or whatever.I'm putting this on the backburner now while we do this in Empress, but at some point in the future it may be nice to port that back over to here.
Edit: also, now that I think of it, having this for biplots in Emperor could be really nice?
The text was updated successfully, but these errors were encountered: