-
Notifications
You must be signed in to change notification settings - Fork 108
Full pipeline script
The entire PICRUSt2 pipeline can be run using a single script, called picrust2_pipeline.py
. This script will run each of the 4 key steps outlined on this wiki: (1) sequence placement, (2) hidden-state prediction of genomes, (3) metagenome prediction, (4) pathway-level predictions. Descriptions of gene families will be added to the output files by default.
The option of this program are the same as for each individual scripts overall.
The standard pipeline will generate metagenome predictions for 16S rRNA gene data. The input files should be a FASTA of amplicon sequences variants (ASVs; i.e. your representative sequences, not your raw reads, which is study_seqs.fna
below) and a BIOM table of the abundance of each ASV across each sample (study_seqs.biom
below). Note that a tab-delimited table with ASV ids as the first column and sample abundances as all subsequent columns will also work.
The below command will run the full default pipeline on the two input files. EC number and KO metagenomes are predicted as well as MetaCyc pathway abundances and coverages predicted based on the predicted EC number abundances. The -n
option specifies that the nearest-sequenced taxon index (NSTI) will be calculated for each input ASV and by default any ASVs with NSTI > 2 will be excluded from the output. Stratified output will only be calculated when the --stratified
option is set, which can greatly increase run-time.
picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline --threads 10 -n
All of the output files produced by the pipeline (including intermediate files, which can be useful for troubleshooting issues), will be output in picrust2_out_pipeline
:
-
-s
PATH - FASTA of unaligned study sequences -
-i
PATH - Input table of sequence abundances (BIOM or TSV format) -
-o
PATH - Output folder -
--threads
INT - Number of threads to use (default: 1). -
-r
PATH - FASTA of aligned reference sequences (default: picrust2/default_files/prokaryotic/reference.fna). -
-t
PATH - Input tree based on aligned reference sequences (default: picrust2/default_files/prokaryotic/reference.tre). -
--hmm
PATH - Hidden markov model of reference MSA (default: picrust2/default_files/prokaryotic/reference.hmm). -
--in_traits
IN_TRAITS - Comma-delimited list (with no spaces) of which gene families to predict from this set: COG, EC, KO, PFAM, TIGRFAM. Note that E.C. numbers will always be predicted unless --no_pathways is set (default: EC,KO). -
--custom_trait_tables
PATH - Optional path to custom trait tables with gene families as columns and genomes as rows (overrides --in_traits setting) to be used for hidden-state prediction. Multiple tables can be specified by delimiting filenames by commas. Importantly, the first custom table specified will be used for inferring pathway abundances. Typically this command would be used with a custom marker gene table (--marker_gene_table) as well -
--marker_gene_table
PATH - Path to marker gene copy number table (16S copy numbers by default). -
--pathway_map
MAP - MinPath mapfile. The default mapfile maps MetaCyc reactions to prokaryotic pathways (default: picrust2/default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_pro.txt). -
--no_pathways
- Flag to indicate that pathways should NOT be inferred (otherwise they will be inferred by default). Predicted E.C. number abundances are used to infer pathways when default reference files are used. -
--regroup_map
ID_MAP - Mapfile of ids to regroup gene families to before running MinPath. The default mapfile is for regrouping E. C. numbers to MetaCyc reactions (default: picrust2/default_files/pathway_mapfiles/ec_level4_to_metacyc_rxn.tsv). -
--no_regroup
- Do not regroup input gene families to reactions as specified in the regrouping mapfile. This option should only be used if you are using custom reference and/or mapping files. -
--stratified
- Flag to indicate that stratified tables should be generated at all steps (will increase run-time). -
-a
{hmmalign,papara} - Which program to use for aligning query sequences to reference MSA prior to EPA-NG step (default: hmmalign). -
--max_nsti
INT - Sequences with NSTI values above this value will be excluded (default: 2). -
--min_reads
INT - Minimum number of reads across all samples for each input ASV. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output (default: 1). -
--min_samples
INT - Minimum number of samples that an ASV needs to be identfied within. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output (default: 1). -
-m
{mp,emp_prob,pic,scp,subtree_average} - HSP method to use."mp": predict discrete traits using max parsimony. "emp_prob": predict discrete traits based on empirical state probabilities across tips. "subtree_average": predict continuous traits using subtree averaging. "pic": predict continuous traits with phylogentic independent contrast. "scp": reconstruct continuous traits using squared-change parsimony (default: mp). -
-n
- Calculate NSTI and add to output file. -
-c
Output 95 percent confidence intervals (only possible for mk_model, emp_prob, and mp settings). -
--seed
SEED - Seed to make output reproducible, which is necessary for the emp_prob method (default: 100). -
--no_gap_fill
- Do not perform gap filling before predicting pathway abundances (Gap filling is on otherwise by default). -
--per_sequence_contrib
- Run MinPath on the gene families contributed by each sequence (i.e. a predicted genome) individually. This will only matter --per_sequence_contrib is set. Note this will GREATLY increase the runtime, but will output the predicted pathway abundance contributed by the predicted gene families in each predicted genome alone (i.e. not the contribution to the community-wide abundance). Pathway coverage stratified by contributing sequence will also be output when this option is set (default: 0). -
--no_descrip
- Do not add function descriptions to output tables. -
--verbose
- If specified, print out wrapped commands to screen.
Please first check our FAQ if you have any questions about PICRUSt2.
For other general questions and comments about PICRUSt2 please search the PICRUSt google group. If the question has not been previously answered then please make a new thread.
To report a bug or to make a feature request please make a new issue at the top of this page.