Split up taxonomy annotations into multiple columns, one for each level #305

fedarko · 2020-05-25T06:36:39Z

We're doing this for Empress (biocore/empress#130), and I'm realizing this might be useful to have in Qurro as well. This would mean converting the feature metadata from something like

Feature ID	Taxonomy
asdf	k__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__
ghjk	k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Pasteurellales; f__Pasteurellaceae; g__; s__

into something like

Feature ID	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7
asdf	k__Bacteria	p__Bacteroidetes	c__Bacteroidia	o__Bacteroidales	f__Bacteroidaceae	g__Bacteroides	s__
ghjk	k__Bacteria	p__Proteobacteria	c__Gammaproteobacteria	o__Pasteurellales	f__Pasteurellaceae	g__	s__

... We should be able to do this entirely on the python side of things. The advantage of this is that this'd allow searching by just genera / etc., which saves you from some problems where the same string is used in different levels (e.g. proteobacteria being present in both p__Proteobacteria and c__Gammaproteobacteria, not that it makes a huge difference for the above example).

There are ofc some problems with this, for example what happens when features have different numbers of "levels" (which is the case in the MetaPhlAn2 (?) taxonomy information for the Byrd dataset -- e.g. there are Viruses with 4 levels and Bacteria with 7 in the same dataset). But these problem should be surmountable; for this particular problem we could, say, "pad" missing levels with nulls or whatever.

I'm putting this on the backburner now while we do this in Empress, but at some point in the future it may be nice to port that back over to here.

Edit: also, now that I think of it, having this for biplots in Emperor could be really nice?

The text was updated successfully, but these errors were encountered:

gibsramen · 2020-05-25T15:16:22Z

Good idea - I do this in my own analysis to alleviate problems like you mentioned. Were you envisioning that the user would input the level "delimiter" or did you have something else in mind? I think GG & Silva both use semicolons but is this standardized enough to assume it for all cases?

fedarko · 2020-05-25T20:14:35Z

I think semicolons are widely accepted enough that just assuming those are the inputs is probably ok (and we can add stuff to the docs that mentions something along the lines of "hey if you're using backslashes or whatever for your taxonomy delimiters please don't"). It'd be possible to add a --p-taxonomy-level-delimiter command-line argument or something like that, but I'd prefer to avoid introducing extra complexity like that when I'm not sure it'd be useful in the majority of cases.

mortonjt · 2023-05-23T14:53:12Z

One other nice feature that could come out of this is having taxonomy barplots that summarizes the number of species, genera, etc detected in the numerator and the denominator. It's currently tedious summarizing taxonomies, see here for an early attempt summarizing taxa within a log-ratio -- but this isn't ideal when there are a ton of taxa within a single log-ratio.

fedarko · 2023-05-23T15:55:35Z

That would be a cool idea! I like the idea of taxonomy barplots (like the QIIME 2 ones) that the user can view within a Qurro visualization after selecting a log-ratio. I don't think I will have time to get to actually integrating this into the Qurro visualization interface for quite a while, but there are a few other ways to get similar functionality in the meantime.

One silly idea: we can create a fake "feature table" containing two samples (numerator, denominator) based on a log-ratio exported from Qurro (where, if a feature is used in the [numerator | denominator] of the log-ratio, then it gets a count of 1 for the [numerator | denominator] sample and 0 otherwise). We can then pass this sort of table (in addition to the taxonomy file) into the QIIME 2 taxonomy barplots visualizer.

I spent a few minutes fiddling around and was able to get this to work -- here's the class-level barplot shown for a log-ratio selected from the moving pictures dataset:

Qurro visualization	Q2 barplot visualization

And here's the Python code I used (after exporting the log-ratio from Qurro using the Export currently selected features button to selected_features.tsv):

import pandas as pd
import biom
from biom.util import biom_open

df = pd.read_csv("selected_features.tsv", sep="\t", index_col=0)
df["Numerator"] = 0
df["Denominator"] = 0
# NOTE: this is obscenely slow + ugly and should ideally be replaced with
# vectorization or apply or something
for f in df.index:
    if df["Log_Ratio_Classification"][f] == "Denominator":
        df["Denominator"][f] = 1
    elif df["Log_Ratio_Classification"][f] == "Numerator":
        df["Numerator"][f] = 1
    elif df["Log_Ratio_Classification"][f] == "Both":
        df["Numerator"][f] = df["Denominator"][f] = 1
    else:
        raise ValueError("call a priest")
df = df.drop(columns=["Log_Ratio_Classification"])
df.to_csv("fake_table.tsv", sep="\t")
with open("fake_table.tsv") as fh:
    tbl = biom.Table.from_tsv(fh, None, None, None)
with biom_open("fake_table.biom", "w") as fh:
    tbl.to_hdf5(fh, "qurro trickery")

After doing this, you can then visualize a barplot using the following QIIME 2 commands:

# NOTE: I'm pretty sure FeatureTable[PresenceAbsence] would be a better semantic type,
# but the barplot visualizer requires we give it a FeatureTable[Frequency] artifact.
qiime tools import --type "FeatureTable[Frequency]" \
    --input-path fake_table.biom \
    --output-path fake_table.qza

qiime taxa barplot --i-table fake_table.qza \
    --i-taxonomy [path to your taxonomy.qza file goes here] \
    --o-visualization barplot.qzv

Not sure if this is similar to what you had in mind, but hopefully it's fun to play around with at least ;)

fedarko added enhancement New feature or request backburner Low-priority things that are still good to keep track of labels May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up taxonomy annotations into multiple columns, one for each level #305

Split up taxonomy annotations into multiple columns, one for each level #305

fedarko commented May 25, 2020 •

edited

Loading

gibsramen commented May 25, 2020

fedarko commented May 25, 2020

mortonjt commented May 23, 2023

fedarko commented May 23, 2023 •

edited

Loading

Split up taxonomy annotations into multiple columns, one for each level #305

Split up taxonomy annotations into multiple columns, one for each level #305

Comments

fedarko commented May 25, 2020 • edited Loading

gibsramen commented May 25, 2020

fedarko commented May 25, 2020

mortonjt commented May 23, 2023

fedarko commented May 23, 2023 • edited Loading

fedarko commented May 25, 2020 •

edited

Loading

fedarko commented May 23, 2023 •

edited

Loading