-
Notifications
You must be signed in to change notification settings - Fork 3
Hands on tutorial on Partitioned MLST phylogenetic analysis
A common route of MLST phylogenetic analysis is to concatenate the alignment of gene fragments from the MLST scheme into a super alignment and then make the inference. This approach assumes that all gene fragments evolve with the same substitution model.
An alternative is to partition the super alignment, where each partition is a gene. A systematic analysis of how this affects phylogenetic inference was performed and they found that it affects topology, branch length and bootstrap inference.
FastMLST has an output that can be used for this type of analysis. Here I show a practical tutorial in which we will see step-by-step how to perform this type of analysis on your favourite bacteria.
let's get to it!
To complete this guide, you will need to install the following programs and their dependencies (usually all easy to install):
Over the years, I have found that the simplest way to download a particular species genomes quick and easy is to use ncbi-genome-download. To download all the genomes of the species you are interested in from GenBank do the following::
ncbi-genome-download --genera "Clostridioides difficile" --flat-output -o genomes -p 64 -r 64 -F fasta bacteria
If you want more information, I recommend you to read his great manual on GitHub
Then, in your working directory, you will have the 2068 publicly available genomes of C. difficile in the genomes directory (date: 2021-12-01).
Personally, I like to work with the names a little more standardized so to modify the names leaving only their access code I do the following in the genomes
directory (it is an optional step):
rename 's/\.\d+.+_genomic//' *.gz
There are many programs to determine ST from assembled genomes. I want to recommend FastMLST as it is super fast and easy to use than the competitors (Conflicts of interest: Yes. Let's continue!).
Complete usage is on GitHub. But in our case, we will only need to do the following:
fastmlst -sch cdifficile -fo mlst.fasta -to mlst.csv -cov 100 -sp spmlst --fasta2line genomes/*
As a result, you will have two files and one directory.
- mlst.csv: This file contains the main output of the analysis consisting of the allelic profile of the genomes analyzed. The file can be visualized in Excel, R, Stata or any program that can visualize comma-separated files (CSV).
- mlst.fasta: This is a magic file, it contains the concatenated alleles of the MLST profile for each genome, and you can use it to make MLST-based phylogenetic trees! Few programs provide this output (Conflicts of interest: Yes. Let's continue!).
- spmlst: This is the important directory because it contains the MLST gene fragments in individual files.
To proceed, we need to align each gene. It is as simple as the following:
mkdir spmlst_aln
find spmlst -name '*.fasta'| parallel 'mafft --thread 1 --auto {} > {}.aln'
mv spmlst/*.fasta.aln spmlst_aln
iqtree2 -p spmlst_aln -T AUTO -B 1000 -nstop 1000 --prefix partition_mlst
That's it! what we did above is to make a directory that will contain the alignments of each gene, then we run mafft in parallel for each gene and move the genes aligned to the directory we created above. Then with Iqtree2 we make a phylogenetic tree where each gene is a different partition.
The interesting thing is that for Iqtree2 each gene evolves differently and selects a different model for each gene (oh what a surprise!). let's take a look at Iqtree's logs:
Selecting individual models for 7 charsets using BIC...
No. Model Score TreeLen Charset
1 HKY+F+R2 3198.326 0.299 glyA.fasta.aln
2 TPM3+F+R3 2925.176 0.367 sodA.fasta.aln
3 TN+F+R2 2468.152 0.338 dxr.fasta.aln
4 HKY+F+I 2662.677 0.173 recA.fasta.aln
5 HKY+F+I+G4 2529.907 0.185 atpA.fasta.aln
6 HKY+F+I 2516.955 0.205 tpi.fasta.aln
7 HKY+F+I 2078.609 0.118 adk.fasta.aln
To visualize the tree you can use microreact, just upload the mlst.csv file as metadata and the iqtree tree (remember to add the extension ".nwk" to the tree file, for some reason it does not work without that extension).
You can see my result here where each taxon is coloured according to the clade it belongs to.
This is the first tutorial after FastMLST was accepted in the journal bioinformatics and biology insights. I am very happy, I hope this tutorial will be helpful!