-
Notifications
You must be signed in to change notification settings - Fork 108
Get the best domain for each sequence PICRUSt2‐MPGA database
This script is new for PICRUSt2 v2.6.0 with the new PICRUSt2-MPGA database. You can see full details on the updates made to the PICRUSt2 database here.
This script takes in trees and NSTI tables for two domains (by default bacteria and archaea) and determines the best reference fit (lowest Nearest Sequences Taxon Index [NSTI]) for each of the study sequences. It then filters the trees and NSTI files for each domain so as they contain only the sequences that fit best within that domain.
It can be run like this:
pick_best_domain.py \
-n1 bacteria \
-n2 archaea \
--tree_dom1 bac_placed_seqs.tre \
--tree_dom2 arc_placed_seqs.tre \
--tree_out_dom1 bac_reduced_placed_seqs.tre \
--tree_out_dom2 arc_reduced_placed_seqs.tre \
--nsti_table_dom1 bac_marker_nsti_predicted.tsv.gz \
--nsti_table_dom2 arc_marker_nsti_predicted.tsv.gz \
--nsti_table_out_dom1 bac_reduced_marker_nsti_predicted.tsv.gz \
--nsti_table_out_dom2 arc_reduced_marker_nsti_predicted.tsv.gz \
--nsti_table_out_combined combined_marker_nsti_predicted.tsv.gz
The input arguments/options are:
-
-n1
/--name_dom1 NAME
- the name of the domain associated with the first tree and NSTI table, used for adding to the output file (default: bacteria) -
-n2
/--name_dom2 NAME
- as above for-n1
but for the second domain (default: archaea) -
--tree_dom1 TREEFILE
- Newick tree with study sequences placed amongst reference sequences for the first domain (if using the default options, this should be for bacteria) -
--tree_dom2 TREEFILE
- Newick tree with study sequences placed amongst reference sequences for the second domain (if using the default options, this should be for archaea) -
--tree_out_dom1 TREEFILE
- Output tree for first domain with reference sequences and only the study sequences that fit best into this domain (if using the default options, this should be for bacteria) -
--tree_out_dom2 TREEFILE
- as above for--tree_out_dom1
but for the second domain (if using the default options, this should be for archaea) -
--nsti_table_dom1 MARKER_PREDICTED.tsv.gz
- table containing NSTI values for the first domain in tab-delimited format (if using the default options, this should be for bacteria) -
--nsti_table_dom2 MARKER_PREDICTED.tsv.gz
- as above for--nsti_table_dom1
but for the second domain (if using the default options, this should be for archaea) -
--nsti_table_out_dom1 MARKER_PREDICTED.tsv.gz
- Output table containing NSTI values for the first domain for only the study sequences that fit best into this domain (if using the default options, this should be for bacteria) -
--nsti_table_out_dom2 MARKER_PREDICTED.tsv.gz
- as above for--nsti_table_out_dom1
but for the second domain (if using the default options, this should be for archaea) -
--nsti_table_out_combined MARKER_PREDICTED.tsv.gz
- Output table containing NSTI values for all study sequences. This will also contain information on the domain that was the best match for each sequence (the NSTI value will correspond to this) and the genome that was the closest match. -
--ref_fasta_dom1 FASTA
- The reference fasta file for the first domain. If this is not given then by default the fasta file for bacteria will be used. -
--ref_fasta_dom2 FASTA
- The reference fasta file for the second domain. If this is not given then by default the fasta file for bacteria will be used.
Please first check our FAQ if you have any questions about PICRUSt2.
For other general questions and comments about PICRUSt2 please search the PICRUSt google group. If the question has not been previously answered then please make a new thread.
To report a bug or to make a feature request please make a new issue at the top of this page.