Skip to content

Troubleshooting

Jose Manuel Martí edited this page Mar 27, 2024 · 8 revisions

This page includes topics of interest in case of issues or suspected troubles in the use of Recentrifuge package.

Mismatch between NCBI database versions

The mismatch of the versions of the NCBI database between Recentrifuge and the taxonomic classification engine is not usually a big problem, as long as the Recentrifuge database is more recent and the one used by the classifier is not very old.

  • If the taxonomic classification engine uses a newer database, Recentrifuge could miss some new taxids (taxonomic identifiers, aka TaxIDs). Please, use retaxdump regularly to keep an updated version of the NCBI taxonomic database used by Recentrifuge.

  • If the taxonomic classifier uses an old NCBI database, Recentrifuge could miss some obsolete or merged taxonomic identifiers. Please keep the NCBI taxonomic database used by your classification engine updated, since the database is evolving rapidly. If your data is old, you may benefit from reanalyzing your dataset with updated database and software releases.

When parsing your samples, if Recentrifuge detects orphan taxids (those without a known parent in the NCBI taxonomic tree), it issues a warning like:

Warning! 4 orphan taxids (rerun with --debug for details)

In this case, or if you are unsure of the effects of the NCBI database mismatch between Recentrifuge and your taxonomic classifier, or any other problem related to the taxonomic identifiers, we suggest using the debug flag (-d or --debug) in rcf. The debugging mode enables the taxid-loss tests, which provide details about the number of orphan taxids and the number of orphan sequences (those sequences afflicted by being assigned to an orphan taxid). Bellow is an example of real output for a CLARK-S sample:

  Checking taxid loss (orphans)...
  Warning! Orphan taxid=697046
  Warning! Orphan taxid=758602
  Warning! Orphan taxid=1870930
  Warning! Orphan taxid=2071623
  WARNING! 4 orphan taxids (0.25% of accepted)
    and 6 orphan sequences (0.001% of accepted)

In this particular case, the first three taxonomic identifiers were merged into other taxids recently, during the previous months. The last one is currently unknown for the NCBI taxonomic database (as of Dec 30, 2018). Those taxids affected a tiny fraction of the reads, but they sometimes could represent organisms critical to the results.

In summary, our recommendation is to keep both databases updated and use the debug flag (-d) in Recentrifuge (rcf) to clear up doubts.

Number of total accumulated reads is less than the accepted reads

The number of total accumulated reads (those counts shown for "root" in a sample by Recentrifuge) can be less than the number of accepted reads. Recentrifuge gives information about this situation:

  Check for more seqs lost ([in/ex]clude affects)... 
  Info: 902451 additional seqs discarded (83.560% of accepted)

On the other hand, if the values are the same, the message is simply:

  Check for more seqs lost ([in/ex]clude affects)... OK!

The main reason for the mismatch between the number of total accumulated reads and accepted reads is the use of exclusion (-x option) or inclusion (-i option) lists in rcf, especially if the number of discarded sequences is high. Another reason could be the (undesirable) presence of orphan taxids (see the previous section here). In any case, you can get further details running Recentrifuge with the debugging flag -d enabled.

Where have those taxa with scarce reads assigned gone?

In the taxonomic tree built for each sample, a taxon with a number of assigned reads below the threshold of mintaxa is "folded" in the parent taxon, which then combines its counts and scores with those of the child. This algorithm is a kind of high-frequency noise filter to prevent background noise, such as the produced by random sequencing errors or misclassification statistical noise. This noise is seen as an overdispersion of reads in the space of the taxids, in other words, an overestimation of the number of taxa present in the samples due to relatively low quantities of reads assigned to many taxids.

If you don't use exclusion (-x option) or inclusion (-i option) lists in rcf, you can get details about the "tree folding" of the taxa placed on the leaves of the taxonomic tree by enabling the debugging flag (-d or --debug). For example, using the same CLARK example as in subsections above:

  Assess accumulation due to "folding the tree"...
  Info: Folded taxid 146923 (Streptomyces parvulus) with 1 original seqs
  Info: Folded taxid 78258 (Parascardovia denticolens) with 2 original seqs
  Info: Folded taxid 28095 (Burkholderia gladioli) with 3 original seqs
  Info: Folded taxid 585 (Proteus vulgaris) with 2 original seqs
  (...)
  Info: Folded TaxID 860 (Fusobacterium periodonticum) with 1 original seqs
  Info: Folded TaxID 2005460 (Chondrocystis sp. NIES-4102) with 1 original seqs
  Info: Folded TaxID 194424 (Sulfurospirillum halorespirans) with 1 original seqs
  INFO: 1313 TaxIDs folded (83.31% of TAF —TaxIDs after filtering—)
  INFO: Final assigned TaxIDs: 579 (reduced to 36.74% of number of TAF)

So, there were a lot of taxids with a low number of assigned reads, in other words, a lot of dispersion of scarce reads in taxids. The 83.31% of the TaxIDs with reads assigned after filtering were "folded," so their reads were accumulated into their parents and so forth, until the mintaxa threshold was satisfied. Finally, the number of TaxIDs with reads assigned was reduced to a 36.74% of the number of TaxIDs after filtering (a value that could be lower than the initial number of TaxIDs in the sample because of the score filter applied when parsing the sample data).

Issues with LMAT plasmids

If you add the file plasmid.names.txt to the taxdump directory as detailed here in order to include LMAT plasmids classifications in your results, Recentrifuge will detect and parse it. Typically, Recentrifuge performs a sanity check for inconsistencies in the plasmid data and outputs a summary with the relevant information. If you have issues with the LMAT plasmids, please enable the debugging flag is active (-d) to rcf, since then Recentrifuge will give details. Please see the subsection "More about plasmids" for more information.

HTML or "extra" output files are too large

For large and complex metagenomic/metatranscriptomic studies, the size of the HTML file can be excessive and a typical browser may be unable to manage it. In some cases, even the "extra" output may be very large for some downstream codes. There are different alternatives that you can use to reduce the size of rcf output:

  1. Before any other action, please update Recentrifuge to the last version or, at least, v1.14. That is essential because of recent changes to the code introduced to drastically reduce the size of the HTML output.

  2. The first thing to try is the --summary or -u argument in rcf, which allows one to control the behavior of the summarization in Recentrifuge. By using --summary ONLY or just -u ONLY the code only outputs the input samples and the summarized samples. This argument selection reduces a lot the size of both the HTML and the "extra" files by skipping all of the generated samples before the summarization step. The current options for the summary flag are:

    • ADD: to "add" summary samples to other samples (this is the default when no other option is selected),
    • ONLY: to show original and summary samples "only",
    • AVOID: to "avoid" summary samples at all
  3. If you would like to further reduce the size of the "extra" output, you can use a different strategy (that you may combine with the previous one): as the "extra" output is not intended for interactive use, you have the option to generate one file per sample by using the --extra MULTICSV or just -e MULTICSV option in rcf.

  4. In case you are only interested in the output for the original samples, i.e., you are not interested in removing control samples nor the additional samples generated by Recentrifuge's cross analysis, then you can run rcf with the flag --avoidcross or just -a. That will disable Recentrifuge's comparative engine. You can combine this flag with the previous one (-e MULTICSV), but not with the first one since -a will imply --summary AVOID.

  5. Finally, if you are processing Recentrifuge's results via a custom code downstream, you may take advantage of the --pickle flag. With that, rcf will pickle (serialize) both the statistics and data results in pandas DataFrames contained in a compressed pickle file. Be aware that the specific format of the DataFrames are affected by the selection of any relevant options, such as --extra.