Skip to content
/ crest4 Public

The `crest4` python package can automatically assign taxonomic names to DNA sequences obtained from environmental sequencing.

License

Notifications You must be signed in to change notification settings

xapple/crest4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI version Pytest passing

CREST version 4.3.7

crest4 is a python package for automatically assigning taxonomic names to DNA sequences obtained from environmental sequencing.

CREST Logo

More specifically, the acronym CREST stands for "Classification Resources for Environmental Sequence Tags" and is a collection of software and databases for taxonomic classification of environmental marker genes obtained from community sequencing studies. Such studies are also known as "meta-genomics", "meta-transcriptomics", "meta-barcoding", "taxonomic profiling" or "phylogenetic profiling".

Simply put, given the following fragment of an rRNA 16S sequence from an uncultured microbe:

TGGGGAATTTTCCGCAATGGGCGAAAGCCTGACGGAGCAATACCGCGTGAGGGAGGAAGGCCTTAGGGTT
GTAAACCTCTTTTCTCTGGGAAGAAGATCTGACGGTACCAGAGGAATAAGCCTCGGCTAACTCCGTGCCA
GCAGCCGCGGTAAGACGGAGGAGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGTCCGTAGGCGGTT
AATTAAGTCTGTTGTTAAAGCCCACAGCTCAACTGTGGATCGGCAATGGAAACTGGTTGACTAGAGTGTG
GTAGGGGTAGAGGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCG

crest4 will be able to tell you that this gene is likely originating from the following taxonomy genus:

Bacteria; Terrabacteria; Cyanobacteria; Cyanobacteriia; Phormidesmiales; Nodosilineaceae; Nodosilinea

To produce this result, the input sequence is compared against a built-in reference database of marker genes (such as the SSU rRNA), using the BLAST or VSEARCH algorithms. All high similarity hits are recorded and filtered for both a minimum score threshold, and a minimum identify threshold. Next, for every surviving hit, the exact position in the phylogenetic tree of life is recorded. Finally, the full name of the lowest common ancestor (given this collection of nodes in the tree) is determined and reported as a confident taxonomic classification. Simply put, if for instance all hits for a given sequence only agree at the order level, the assignment stops at the order level.

This strategy contrasts with the other tools that instead use a naive bayesian classifier for taxonomic assignment. Often referred to as the Wang method and used for example in the RDP software, it consists of the following steps: calculate the probability that the query sequence would be part of any given reference taxonomy sequence based on the decomposed kmer content and pick the taxonomy with the highest probability while considering a confidence limit computed by a bootstrapping algorithm.

Citation

If you use CREST in your research, please cite this publication:

CREST - Classification Resources for Environmental Sequence Tags, PLoS ONE, 7:e49334
Lanzén A, Jørgensen SL, Huson D, Gorfer M, Grindhaug SH, Jonassen I, Øvreås L, Urich T (2012)

Installing

Since crest4 is written in python it is compatible with all operating systems: Linux, macOS and Windows. The only prerequisite is python version 3.8 or above which is often installed by default. Simply choose one of the two following methods to install, depending on which package manager you prefer to use.

Installing via conda

$ conda install -c bioconda -c conda-forge -c xapple crest4

Or to create a custom environment named crest which you activate later:

$ conda create -n crest -c bioconda -c conda-forge -c xapple crest4

Installing via pip

$ pip3 install crest4

Notes and extras

Once the installation completes you are ready to use the crest4 executable command from the shell. Please note that the reference databases are downloaded automatically during first run, so this might take some time depending on your internet connection.

In order to search the reference databases, you will also need either BLAST or VSEARCH installed. These can be installed automatically with the conda package manager.

$ conda install blast
$ conda install vsearch

If you don't use conda, you can obtain them with these commands on Linux, provided you have admin rights:

$ sudo apt update
$ sudo apt install ncbi-blast+
$ sudo apt install vsearch

Or these commands on macOS that work without sudo access:

$ brew install blast
$ brew install vsearch

If you wish to install crest4 from the repository source code you can follow these instructions instead.

Troubleshooting

  • If you do not have conda on your system you can refer to this section.
  • If you do not have pip3 on your system you can refer to this section.
  • If you do not have python3 on your system or have an outdated version, you can refer to this other section.
  • If you can't run the crest4 command after a successful installation, make sure that the python bin directory is in your path. This is usually $HOME/.local/bin/ for Ubuntu.
  • If none of the above has enabled you to install crest4, please open an issue on the bug tracker and we will get back to you shortly.

Database location

To download the databases that are used in the classification algorithm, crest4 needs somewhere to write to on the filesystem. This will default to your home directory at: ~/.crest4/. If you wish to change this, simply set the environment variable $CREST4_DIR to another writable directory path prior to execution.

Usage

Bellow are some examples to illustrate the various ways there are to use this package.

crest4 -f sequences.fasta

Simply specifying a FASTA file with the sequences to classify is sufficient, and crest4 will choose default values for all the parameters automatically. The results produced will be placed in a subdirectory inside the same directory as the FASTA file.

To change the output directory, specify the following option:

crest4 -f sequences.fasta -o ~/data/results/crest_test/

To parallelize the sequence similarity search with 32 threads use this option:

crest4 -f sequences.fasta -t 32

Silvamod138pr2 is the default reference database. To use another database, e.g., midori, the -d option must be specified followed by the database name:

crest4 -f sequences.fasta -d midori248

All options

The full list of options is as follows:

Required arguments:
  --fasta PATH, -f PATH
                        The path to a single FASTA file as a string.
                        These are the sequences that will be taxonomically
                        classified.

Optional arguments:
  --search_algo ALGORITHM, -a ALGORITHM
                        The algorithm used for the sequence similarity search
                        that will be run to match the sequences against the
                        database chosen. Either `blast` or `vsearch`. No
                        other values are currently supported. By default,
                        `blast`.

  --num_threads NUM, -t NUM
                        The number of processors to use for the sequence
                        similarity search. By default, parallelism is turned
                        off and this value is 1. If you pass the value `True`
                        we will run as many processes as there are CPUs but
                        no more than 32.

  --search_db DATABASE, -d DATABASE
                        The database used for the sequence similarity search.
                        Either `midori253darn`, `silvamod138pr2`, 'mitofish' or
                        `silvamod128`.
                        By default, `silvamod138pr2`. Optionally, the user can
                        provide a custom database by specifying the full path
                        to a directory containing all required files under
                        `search_db`. See the README for more information.

  --output_dir DIR, -o DIR
                        The directory into which all the classification
                        results will be written to. This defaults to a
                        directory with the same name as the original FASTA
                        file and a `.crest4` suffix appended.

  --search_hits PATH, -s PATH
                        The path where the search results will be stored.
                        This defaults to the output directory. However,
                        if the search operation has already been completed
                        beforehand, specify the path here to skip the
                        sequence similarity search step and go directly to
                        the taxonomy step. If a hits file exists in the output
                        directory and this option is not specified, it is
                        deleted and regenerated.

  --min_score MINIMUM, -m MINIMUM
                        The minimum bit-score for a search hit to be considered
                        when using BLAST as the search algorithm. All hits below
                        this score are ignored. When using VSEARCH, this value
                        instead indicates the minimum identity between two
                        sequences for the hit to be considered.
                        The default is `155` for BLAST and `0.75` for VSEARCH.

  --score_drop SCORE_DROP, -c SCORE_DROP
                        Determines the range of hits to retain and the range
                        to discard based on a drop in percentage from the score
                        of the best hit. Any hit below the following value:
                        "(100 - score_drop)/100 * best_hit_score" is ignored.
                        By default `2.0`.

  --min_smlrty MIN_SMLRTY, -i MIN_SMLRTY
                        Determines if the minimum similarity filter is turned
                        on or off. Pass any value like `False` to turn it off.
                        The minimum similarity filter prevents classification
                        to higher ranks when a minimum rank-identity is not met.
                        The default is `True`.

  --otu_table OTU_TABLE, -u OTU_TABLE
                        Optionally, one can specify the path to an existing OTU
                        table in CSV or TSV format when running `crest4`.
                        The sequence names in the OTU table must be rows and
                        have to match the names in the FASTA file. The column,
                        on the other hand, provide your samples names.
                        When this option is used, then two extra output files
                        are generated. Firstly, a table summarizing the
                        assignment counts per taxa. Secondly, a table
                        propagating the sequence counts upwards
                        in a cumulative fashion.

Other arguments:
  --version, -v         Show program's version number and exit.
  --help, -h            Show this help message and exit.
  --pytest              Run the test suite and exit.

Python API

If you want to integrate crest4 directly into your python pipeline, you may do so by accessing the convenient Classify object as follows:

# Import #
from crest4 import Classify
# Create a new instance #
get_tax = Classify('~/data/sequences.fasta', num_threads=16)
# Run the similarity search and classification #
get_tax()
# Print the results #
for name, query in get_tax.queries_by_id.items():
    print(name, query.taxonomy)

The specific arguments accepted are the same as the command line version as specified in the internal API documentation.

Test suite

To test that the installation was successful you can launch the test suite by executing:

crest4 --pytest

Splitting computation

It is possible to run the sequence similarity search yourself without passing through the crest4 executable. This is useful for instance if you want to run BLAST on a dedicated server for increased speed and only want to perform the taxonomic assignment on your local computer.

In such a case you just need to copy the hits file that was generated back to your local computer and specify its location with the following parameter:

crest4 sequences.fasta --hits_file=~/results/seq_search.hits

To create the hits file on a different server you should call the blastn executable with the following options:

blastn -query sequences.fasta -db ~/.crest4/silvamod138pr2/silvamod138pr2.fasta -num_alignments 100 -outfmt "7 qseqid sseqid bitscore length nident" -out seq_search.hits

We also recommend that you use -num_threads to enable multi-threading and speed up the alignments.

The equivalent VSEARCH command is the following:

vsearch --usearch_global sequences.fasta -db ~/.crest4/silvamod138pr2/silvamod138pr2.udb -blast6out seq_search.hits -threads 32 -id 0.75 -maxaccepts 100

More information

Classification databases

The silvamod138pr2 database was derived by manual curation of the SILVA NR SSU Ref v.138 for Bacteria, Archaea, Metazoa and Fungi. For other eukaryotes (protists), the PR2 v4.13 database was used. The SILVA database used was last release in August 2020 and PR2 database in March 2021.

The silvamod128 database was derived by manual curation of the SILVA NR SSU Ref v.128. It supports SSU sequences from bacteria and archaea (16S) as well as eukaryotes (18S), with a high level of manual curation and defined environmental clades. This database was last released in September 2016.

Classification algorithm

The classification is carried out based on a subset of the best matching alignments using the Lowest Common Ancestor strategy. Briefly, the subset includes sequences that score within x% of the "bit-score" of the best alignment, provided the best score is above a minimum value. Default values are 155 for the minimum bit-score and 2% for the score drop threshold. Based on cross-validation testing using the non-redundant silvamod128 database, this results in relatively few false positives for most datasets. However, the score drop range can be turned up to about 10%, to increase accuracy with short reads and for datasets with many novel sequences.

In addition to the lowest common ancestor classification, a minimum similarity filter is used, based on a set of taxon-specific requirements, by default depending on their taxonomic rank. By default, a sequence must be aligned with at least 99% nucleotide similarity to the best reference sequence in order to be classified to the species rank. For the genus, family, order, class and phylum ranks the respective default cut-offs are 97%, 95%, 90%, 85% and 80%. These cutoffs can be changed manually by editing the .names file of the respective reference database. This filter ensures that classification is made to the taxon of the lowest allowed rank, effectively re-assigning sequences to parent taxa until allowed.

When using amplicon sequences, we strongly recommend preparing the sequences by performing a noise reduction step as well as applying chimera removal. This can be achieved with various third party software such as: VSEARCH, UPARSE, DADA2, SWARM, etc.

For amplicon sequencing experiments with many replicates or similar samples (>~10), the unique noise-reduced sequences may be further clustered using a similarity threshold (often 97% although larger thresholds are probably preferable) into operational taxonomic units (OTUs), prior to classification.

Custom databases

It is possible to construct a custom reference database for use with crest4. The scripts necessary to do this along with some documentation is available in this other git repository:

https://github.com/xapple/crest4_utils

Continuous testing

The repository for crest4 comes along with five different GitHub actions for CI/CD which are:

  • Pytest master branch - Pytest passing
  • Test PyPI release on Ubuntu - PyPI Ubuntu
  • Test PyPI release on macOS - PyPI macOS
  • Test conda release on Ubuntu - conda Ubuntu
  • Test conda release on macOS - conda macOS

Only the first action is set to be run automatically on each commit to the master branch. The four other actions can be manually launched and will run the pytest suite for both python 3.8 and python 3.9 on different operating systems.

Distributing the package

  • Instructions for distributing and uploading crest4 on PyPI so that it can be installed by pip can be found here. The current uploaded version is listed here.

  • Instructions for distributing and uploading crest4 on anaconda so that it can be installed by conda can be found here. The current uploaded version is listed here.

Two scripts that automate these processes can be found in the following repository:

https://github.com/xapple/bumphub

Updating the databases

The location of the database files that crest4 will download upon first run can easily be updated by editing this file:

Once that file is updated, all downloads will now point to the new URLs, without even needing to redistribute a new version of crest4. This is possible as the JSON file is checked before initiating any new download.

Developer documentation

The internal documentation of the crest4 python package is available at:

http://xapple.github.io/crest4/crest4

This documentation is simply generated from the source code with this command:

$ pdoc --output-dir docs crest4

About

The `crest4` python package can automatically assign taxonomic names to DNA sequences obtained from environmental sequencing.

Topics

Resources

License

Stars

Watchers

Forks

Languages