-
Notifications
You must be signed in to change notification settings - Fork 3
The definitive guide to MLST analysis
Hello you little human... So you want to do an MLST analysis of your favorite bacteria? Let me take you through this fascinating journey, by doing an MLST analysis from start to finish using FastMLST. This document will cover (almost) everything you need to know to use MLST in your next paper.
I will assume that you have basic knowledge of how MLST works.
I know, I know... MLST is an old methodology lacking resolution and only takes a few genes to analyze the complete diversity of a species. How will I waste my millions of Illumina reads or my fabulous complete chromosome done with PacBio with that old methodology? Why use only a bunch of genes if I can use the core-genome and have much more resolution?
These are valid questions. Although if you work with classical pathogens with a long history of genomic analysis such as Haemophilus influenzae, Streptococcus pneumoniae, Staphylococcus aureus, etc... you will inevitably have to rely on MLST to have a context comparable with hundreds of previous reports of your favorite pathogen. Also, no one stops you from performing your fancy pangenome analysis or core-genome phylogenetic analysis of hundreds of genomes. MLST is a complementary analysis for all your other analyses.
Be glad if your favorite bacteria has a defined scheme at https://pubmlst.org/organisms because you will have more to discuss in your brand new manuscript.
We will analyze from scratch all available genomes in GenBank of the species H. influenzae (random selection).
To complete this guide, you will need to install the following programs and their dependencies (usually all easy to install):
- FastMLST
- ncbi-genome-download
- rename (optional)
- mafft
- Iqtree
Over the years, I have found that the simplest way to download a particular species genomes quick and easy is to use ncbi-genome-download. To download all the genomes of the species you are interested in from GenBank do the following::
ncbi-genome-download --genera "Haemophilus influenzae" --flat-output -o genomes -p 64 -r 64 -F fasta bacteria
If you want more information, I recommend you to read his great manual on GitHub
Then, in your working directory, you will have the 751 publicly available genomes of H. influenzae in the genomes directory (date: 2021-02-02).
Personally, I like to work with the names a little more standardized so to modify the names leaving only their access code I do the following in the genomes
directory (it is an optional step):
rename 's/\.\d+.+_genomic//' *.gz
There are many programs to determine ST from assembled genomes. I want to recommend FastMLST as it is super fast and easy to use than the competitors (Conflicts of interest: Yes. Let's continue!).
Complete usage is on GitHub. But in our case, we will only need to do the following:
fastmlst --scheme hinfluenzae -to mlst.csv -fo mlst.fasta --nove novel_mlst.fasta genomes/*.fna.gz
As a result, you will have three files.
- mlst.csv: This file contains the main output of the analysis consisting of the allelic profile of the genomes analyzed. The file can be visualized in Excel, R, Stata or any program that can visualize comma-separated files (CSV).
- mlst.fasta: This is a magic file, it contains the concatenated alleles of the MLST profile for each genome, and you can use it to make MLST-based phylogenetic trees! Few programs provide this output (Conflicts of interest: Yes. Let's continue!).
- novel_mlst.fasta: Contains the genes that are probably new alleles that you should report.
Recap. We analyzed 751 genomes, the mlst.csv
file contains 751 rows, but mlst.fasta
has only 718 sequences! This sometimes happens when an allele could not be found in a genome. Why? Maybe the allele is split into two contigs because of misassembly, or maybe it has a deletion of the gene (something super rare), but you will have to find out in further analysis. For now, we will continue with the 718 genomes that did not fail to extract the sequence.
Let's take a closer look at the mlst.csv
file. You can see that there are 718 genomes with the ST determined (either known or new ST) in the file. Nine of them are new profiles!
Here arises a problem that is very complicated to face with the other programs. Which known ST does it seem similar to those nine new profiles? From the table of allelic profiles, it is impossible to answer. Fortunately, FastMLST thought of that and provided the concatenated alleles ready to do fabulous phylogenetic analysis to answer that question.
To proceed, we need to align the alleles and make the tree. It is as simple as the following:
mafft mlst.fasta > mlst.fasta.aln
iqtree2 -s mlst.fasta.aln -B 1000
For illustration purposes only, we will take 3 assemblies GCF_000636035
, GCF_000636055
and GCF_000636075
. All with the following allelic profile.
Genome | Scheme | ST | adk | atpG | frdB | fucK | mdh | pgi | recA |
---|---|---|---|---|---|---|---|---|---|
GCF_000636035.fna.gz | hinfluenzae | new_ST | 26 | 1 | 46 | 15 | 79 | 64 | 29 |
GCF_000636055.fna.gz | hinfluenzae | new_ST | 26 | 1 | 46 | 15 | 79 | 64 | 29 |
GCF_000636075.fna.gz | hinfluenzae | new_ST | 26 | 1 | 46 | 15 | 79 | 64 | 29 |
Evidently they are all from the same ST. But which ST does it look like?
Iqtree2 output file in nwk format, is easily visualized in Figtree or Microreact. We searched for these genomes within the tree and discovered the following
Figure 1. Phylogenetic tree of the 7-gene MLST scheme of H. influenzae. A phylogenetic tree reconstructed using Iqtree2 using the seven genes of the H. influenzae MLST scheme is shown. The rapid bootstrap is shown at the nodes. The interactive tree is online.
You can easily see that the genomes GCF_003414235
, GCF_003414515
, GCF_003414465
and GCF_003414445
(all from ST519) are the most similar to the three new representatives of the new ST.
Hereafter, the sky is the limit. I can list a recommendation:
- Attach metadata of your isolates like country of origin, year of isolation, host. This way, you can see exciting groupings
- Check for recombinations, for example, using SplitsTree or RDP.
- Do a core-genome analysis. As mentioned before, MLST is just the beginning.
- Confirm that the phylogenetic analysis of MLST and core-genome have the same topology.