Skip to content

Running recentrifuge for a generic classifier

Jose Manuel Martí edited this page Sep 25, 2024 · 6 revisions

Taxonomic classifiers with direct support

Recentrifuge includes native support for results from Centrifuge, LMAT, CLARK, and Kraken. If you have data from these classifiers, please proceed, respectively to:

In case you have results from any other classifier, proceed to the next section.

In case you would like to compare output from different classifiers in the same Recentrifuge run (including or not the ones with native support listed above), you may convert the output from all of them to the same format, and then use the generic parser as described in the next section to process them all.

Requisites for generic support

One of the cornerstones of Recentrifuge's robust approach is to consider and propagate the confidence level of taxonomic assignments throughout the analysis. A well-known principle in Physics says "every measurement with its uncertainty," and Recentrifuge honors this. The confidence level of a taxonomic assignment scores it, thus indicating us how reliable it is. If we don't have a score, we lack a direct quantitative measure of the reliability that any taxon is really present in our samples.

In practice, this means that Recentrifuge only supports data from taxonomic classifiers providing any kind of confidence level for their assignments. Beyond that, as Recentrifuge works on a read-by-read basis, it requires primary information available for each read analyzed.

Format for generic support

Overview

Recentrifuge can parse files with comma-separated values (CSV), tab-separated values (TSV), or space-separated values (SSV), with a line per each sequenced read processed in a sample of the dataset. Per read/line Recentrifuge requires at least the following data:

  • The NCBI taxonomic identifier if the assignment is positive, or any code in case of unclassified read.
  • Any confidence level for the assignment.
  • The length of the read.

When using the generic parser, both the filenames and the expected format should be provided to Recentrifuge main script command line.

Details about the files

To indicate the generic files to parse, you use the -g option, as many times as generic samples you have. If you use the -g option to indicate a directory with the samples (instead of individual samples), Recentrifuge will suppose that all the files in the directory are generic samples to process in parallel. Please plan accordingly by removing any other non-sample file; while Recentrifuge will probably reject many files that are not samples, it's not a good idea to rely on that mechanism as it can eventually fail and will consume additional computational resources.

Use of compressed files

Recentrifuge supports gzip and bz2 compression for generic samples. For example, to process generic output from samples S1 and S2 (gzipped), and S3 (bzipped2) the command would be:

rcf -g S1.gz -g S2.gz -g S3.bz2

The automatic processing of all the files in a directory also supports compressed files.

Details about the format

The format of the data files is passed to Recentrifuge using the --format command-line option followed by a string, such as in --format "TYP:TSV, TID:2, LEN:4, SCO:5, UNC:0". The string is composed for the next comma-separated fields with 3 letters indicating the name of the field and the value of the field separated by a colon. The fields are:

  • The TYPe of file (TYP), and the accepted values are CSV, TSV or SSV.
  • The column number for the NCBI Taxonomic IDentifier (TID).
  • The column number for the LENgth (LEN) of the read in nucleotides, as an integer number.
  • The column number for the SCOre (SCO) given to the assignments, which could be an integer or a real number.
  • The code in the TID column that indicated that a read is UNClassified (UNC).

For the column numbers, the accepted values are positive integers, thus starting with 1. The order of the fields in the format string is arbitrary. If there are repeated fields, Recentrifuge attends to the last value. Spaces around commas and colons in the format string are tolerated.

When parsing the data files, Recentrifuge will remove double-quotes or spaces around the data, which is typically useful for some CSV formats. The values for the LENgth of reads are expected to be integers with the read length in nucleotides, while the SCOres given to the assignments can be integers or real numbers, even negative (as in LMAT). In any case, Recentrifuge assumes that lower values of score indicate lower confidence in the assignments.

Examples

CSV files

Example of TYPe of file CSV with data enclosed in double-quotes and header, Taxonomic IDs in the 3rd col, LENgth of reads in the 2nd col, SCOre of assignments in the 4th col, and the UNClassified code is NA. File head:

#label,length,taxid,score,read-type
"read01","200","9606","30","paired-end"
"read02","200","9606","31","paired-end"
"read03","200","9606","32","paired-end"
"read12","200","NA","-","paired-end"

The correspondent format string is "TYP:CSV,TID:3,LEN:2,SCO:4,UNC:NA" or equivalent (for instance, the string "UNC: NA, TYP: csv, LEN: 2, TID: 3, SCO: 4" is equivalent).

TSV files

Example of TYPe of file TSV with no header, Taxonomic IDs in the 2nd col, LENgth of reads in the 4th col, SCOre of assignments in the 5th col, and the UNClassified code is 0. File head:

RR01	9606	SEQEX2	111	50
RR02	9606	SEQEX2	111	51
RR03	0	SEQEX2	150	-
RR04	9606	SEQEX2	111	52

The correspondent format string is "TYP:TSV,TID:2,LEN:4,SCO:5,UNC:0" or equivalent.

SSV files

Example of TYPe of file SSV with header, Taxonomic IDs in the 2nd col, LENgth of reads in the 4th col, SCOre of assignments in the 5th col, and the UNClassified code is *. File head:

label tid sequencer len score
READ01 9606 NANOPORE 1589 50
READ02 * - 244 -
READ03 9606 NANOPORE 2106 51
READ04 9606 NANOPORE 849 52

The correspondent format string is "TYP:ssv,TID:2,LEN:4,SCO:5,UNC:*" or equivalent.

Running quick start

Suppose you have installed Recentrifuge with pip and have used retaxdump to populate ./taxdump. Now, you would like to analyze and compare the output from samples S1, S2 and S3 from a classifier that generates data with the format of the TSV example above. In such case, the command would be:

rcf --format "TYP:TSV,TID:2,LEN:4,SCO:5,UNC:0" -g S1.any -g S2.any -g S3.any

Alternatively, you may have placed these samples (or create symbolic links) in the directory ./my_generic_samples and just run:

rcf --format "TYP:TSV,TID:2,LEN:4,SCO:5,UNC:0" -g ./my_generic_samples

Running details

Scoring schemes

There are different options to score the reads classified by a generic taxonomic classifier. Recentrifuge supports the following scoring schemes for a generic classifier, which could be selected with the option -s/--scoring.:

  • GENERIC: This scoring scheme is the default scheme used for a generic classifier and is not available for specific classifiers. It directly uses the scores given to the assignments in the data files, which can be integers or real numbers. Recentrifuge takes for granted that lower values of score indicate lower confidence in the taxonomic assignments.

Recentrifuge also supports the following general scoring schemes, which are especially useful when there are reads with a diverse order of magnitude in length, such as in nanopore sequencing:

  • LENGTH: The score of a read will be its length (or the combined length of mate pairs).
  • LOGLENGTH: Logarithm (base 10) of the length score.
  • NORMA: This score is the normalized score GENERIC / LENGTH in percentage, so it takes into account both the assignment quality and the length of the read. Very useful when both the score assignments and lengths are variable among the reads.

For all these scoring schemes, the minscore parameter works for the direct score of the read assigned by the taxonomic classifier. So, for example, a minscore of 35 (indicated with the -y 35 option) will filter the same reads independently of the scoring scheme selected.

Advanced example

Let's see a more complex example in detail. In order to analyze the generic output:

  • with the taxonomy files downloaded to /my/tax/dir,
  • of a taxonomic classifier of nanopore sequences with the SSV format of the example above,
  • from samples X1 (file X1.nnp), X2 (file X2.nnp) and X3 (file X3.nnp),
  • with two negative controls (files CTRL1.nnp and CTRL2.nnp),
  • saving the output to nanoporeXsamples.rcf.html file,
  • with the scoring referred to the normalized score by length percentage (NORMA),
  • filtering reads with 20 as a minimum value for the score,
  • and excluding the reads assigned to humans (taxid 9606),

the command would be:

rcf -n /my/tax/dir --format "TYP:ssv,TID:2,LEN:4,SCO:5,UNC:*" -g CTRL1.nnp -g CTRL2.nnp -g X1.nnp -g X2.nnp -g X3.nnp -c 2 -o nanoporeXsamples.rcf.html -s NORMA -y 20 -x 9606

The complete guide to rcf options and flags is in the Recentrifuge command line page.