- Make sure you have this folder available on your machine. You can get it using git or by downloading the zip file.
cd
into thisexample\1KP
directory. In the rest of the tutorial, we assume you are inside this example directory. To make sure, usels
and you should see:
(base) bash-3.2$ ls
GC.tar.gz README branchAnalysis.tar.gz genetrees.tar.gz occupancy.tar.gz parameters parameters.tar.gz relativeFreq.tar.gz species.tar.gz
- Unzip files:
bash-3.2$ tar xvfj species.tar.gz
Under the parameters folder, we have all the parameter files:
- annotation-1.txt: Anootation file for the first hypothesis structure
- annotation-2.txt: Anootation file for the second hypothesis structure
- annotation-3.txt: Anootation file for the third hypothesis structure
- annotation-4.txt: Anootation file for the forth hypothesis structure
- annotation.txt: The overall anootation file. The main annotation file, where it maps every species available in your dataset (species tree) to a clade name. Note that you would assign each species to only one clade.
- clade-defs.txt: Clade definition file. In this file you would define all the clade definitions accourding to the instruction or the code provided previousely. Note that the field seperator in this file is tab.
- names.txt: Names file. List the names of the species you have in
- newModel.txt: Model condition definition file. In this file you would specify the old model condition names with exactly the ordering you want them get displayed on the x-axis of the species tree analysis on the first line. On the second line, you have the new naming for those model conditions with the same ordering. Note that the names are seperated with tabs instead of spaces.
- newOrders.txt: Orders definition file. In this file you would specify the old clades names with exactly the ordering you want them get displayed on the first line, and the ordered new clades names on the second line of this file. Note that the names are seperated with tabs instead of spaces.
- rooting.txt: Rooting definition file. In this rooting file you should specify the outgroup species, but you don't have to just list them in one line. You have this option to list them with respect to their distances to the ingroup species. For example, the rooting definition available here, has 3 lines, where on the first line we have the most distant species to the ingroups. On the other two lines we have the other outgroup species, with respect to their distance to the ingroup after the main set of outgroups.
Under the species folder we have 31 folders each with this structure: Model_Condition-DST, where DST defines the type of sequence alignment. For example astral.trim50genes33taxa.no3rd.final-FNA2AA is a folder under the species folder, where astral.trim50genes33taxa.no3rd.final is the model condition name, and FNA2AA is the DST. Then under each folder we have a species tree with the name estimated_species_tree.tree. In order to generate the same figures as available in the supplementary materials of the paper you would use the following commands if you installed discoVista on your machine:
$WS_HOME/DiscoVista/src/utils/discoVista.py -c parameters/clade-defs.txt -p species/ -t 95 -y parameters/newModel.txt -w parameters/newOrders.txt -m 0 -o species/results
or using the docker image, you can run discovista with the following command. Note that <path to example folder> is the absolute path to the directory where 1KP example folder is placed:
docker run -v <path to example folder>/1KP:/data esayyari/discovista discoVista.py -c parameters/clade-defs.txt -p species/ -t 95 -y parameters/newModel.txt -w parameters/newOrders.txt -m 0 -o species/results
Here are the outputs:
In this figure rows correspond to major orders and clades, and columns correspond to the results of different methods of the plant dataset. The spectrum of blue-green indicates amount of MLBS values for monophyletic clades. Weakly rejected clades correspond to clades that are not present in the tree, but are compatible if low support branches (below 90%) are contracted
In this figure rows correspond to major orders and clades, and columns correspond to the results of different methods ofthe plants dataset. Weakly rejected clades correspond to clades that are not present in the tree, but are compatible if low support branches (below 90%) are contracted.
Under the genetrees/filtered folder we have 852 folders each has 3 subfolders with one gene tree under each with the name estimated_gene_trees.tree. Each of these subfolders are named as ID-Model_Condition-DST. More particularly, we have the 4032 (ID) folder, and under this folder we have 3 subfolders 4032-c1c2_filterlen33-FNA2AA_c1c2_filterlen33, 4032-filterlen33-FAA_filterlen33, and 4032-filterlen33-FNA2AA_filterlen33. In these folders c1c2_filterlen33, and filterlen33 are the model conditions, and FAA and FNA2AA are the sequence alignment type. In order to generate the same figures as available in the supplementary materials of the paper you would use the following commands if you installed discoVista on your machine:
$WS_HOME/DiscoVista/src/utils/discoVista.py -c parameters/clade-defs.txt -p genetrees/filtered/ -t 75 -w parameters/newOrders.txt -y parameters/newModel.txt -m 1 -o genetrees/filtered/results
or using the docker image, you can run discovista with the following command. Note that <path to example folder> is the absolute path to the directory where 1KP example folder is placed:
docker run -v <path to example folder>/1KP:/data esayyari/discovista discoVista.py -c parameters/clade-defs.txt -p genetrees/filtered/ -t 75 -w parameters/newOrders.txt -y parameters/newModel.txt -m 1 -o genetrees/filtered/results
Here are example outputs of this analysis:
This figure shows the portion of RAxML genes for which important clades (x-axis) are highly (weakly) supported or rejected for three model conditions of the plants dataset. FAA-filterlen33: gene trees on amino acids sequences, and fragmentary sequences removed (66% gaps or more) FNA2AA-f25: amino acid sequences back translated to DNA, and sequences on long branches (25X median branch length)removed; FNA2AA-filterlen33: amino acid sequences back translated to DNA, and fragmentary sequences removed (66% gaps or more). Weakly rejected clades are those that are not in the tree but are compatible if low support branches (below 75%) are contracted.
This figure shows the number of RAxML genes for which important clades (x-axis) are highly (weakly) supported or rejected or are missing of three model conditions (same as above) of the plants dataset. Weakly rejected clades are those that are not in the tree but are compatible if low support branches (below 75%) are contracted.
We have the GC/unfiltered folder available in the example folder. Under this folder we have 852 folders for each gene and the name of each folder is condsidered as the GENE ID, e.g. gene ID 4032. Then under each of these folders we have a fasta file, with the name DS-alignment-noFilter.fasta. For example, FNA2AA-alignment-noFilter.fasta is available under the folder GC/unfiltered/4032. In order to generate the same figures as available in the supplementary materials of the paper you would use the following commands if you installed discoVista on your machine:
$WS_HOME/DiscoVista/src/utils/discoVista.py -m 2 -p GC/unfiltered/ -o GC/unfiltered/results
or using the docker image, you can run discovista with the following command. Note that <path to example folder> is the absolute path to the directory where 1KP example folder is placed:
docker run -v <path to example folder>/1KP:/data esayyari/discovista discoVista.py -m 2 -p GC/unfiltered/ -o GC/unfiltered/results
This figure corresponds to the GC content analysis of the 1kp dataset. Each dot shows the average GC content ratio for each species in all (red), first (pink), second (light blue), and third (dark blue) codon positions.
This figure corresponds to the GC content analysis of the 1kp dataset, using boxplots for first, second, third, as well as all three codon positions.
We have the occupancy/filtered folder available in the example folder. Under this folder we have 852 folders for each gene and the name of each folder is condsidered as the GENE ID, e.g. gene ID 4032. Then under each of these folders we have a fasta file, with the name DST-alignment-Model_Condition.fasta. For example, FNA2AA-alignment-f25.fasta and FNA2AA-alignment-filterlen33.fasta are available under the folder occupancy/filtered/4032. In order to generate the same figures as available in the supplementary materials of the paper you would use the following commands if you installed discoVista on your machine:
$WS_HOME/DiscoVista/src/utils/discoVista.py -m 3 -a parameters/annotation.txt -p occupancy/filtered/ -o occupancy/filtered/results
or using the docker image, you can run discovista with the following command. Note that <path to example folder> is the absolute path to the directory where 1KP example folder is placed:
docker run -v <path to example folder>/1KP:/data esayyari/discovista discoVista.py -m 3 -a parameters/annotation.txt -p occupancy/filtered/ -o occupancy/filtered/results
Here are example outputs of this analysis:
This figure shows the occupancy analysis on the 1kp dataset over each individual species for two model conditions (described above).
This figure shows the occupancy analysis on the important splits of 1kp dataset over each individual species for two model conditions (described above).
Under the folder branchAnalysis available in the example folder, there are 6 folders, GAMMA.2, c1c2.GAMMA.2, c1c2.f25, c1c2_filterlen33, f25, and filterlen33, and under each of them we have a file with this naming structure FNA2AA-estimated_gene_trees.tree, where you would replace FNA2AA with any alignment type or label that you wish, and each of them has 852 gene trees (lines) in the newick format. In order to generate the same figures as available in the supplementary materials of the paper you would use the following commands if you installed discoVista on your machine:
$WS_HOME/DiscoVista/src/utils/discoVista.py -m 4 -p branchAnalysis/ -r parameters/rooting.txt -o branchAnalysis/results
or using the docker image, you can run discovista with the following command. Note that <path to example folder> is the absolute path to the directory where 1KP example folder is placed:
docker run -v <path to example folder>/1KP:/data esayyari/discovista discoVista.py -m 4 -p branchAnalysis/ -r parameters/rooting.txt -o branchAnalysis/results
Under the folder relativeFreq/astral.trim50genes33taxa.no3rd.final-FNA2AA, we have two files with the names estimated_species_tree.tree, and estimated_gene_trees.tree for species tree, and set of gene trees (852) in newick format. In order to generate the same figures as available in the supplementary materials of the paper you would use the following commands if you installed discoVista on your machine. Let's assume that you want to test the relative frequencies of the firts hypothesis (annotation-1.txt), in which there are 5 clades, Base (as outgroup), Charales, Coleochaetales, Landplants, Zygnematophyceae. We use the following set of commands:
$WS_HOME/DiscoVista/src/utils/discoVista.py -a parameters/annotation-1.txt -m 5 -p relativeFreq/astral.trim50genes33taxa.no3rd.final-FNA2AA/ -o relativeFreq/astral.trim50genes33taxa.no3rd.final-FNA2AA/results/anot1 -g Base
or using the docker image, you can run discovista with the following command. Note that <path to example folder> is the absolute path to the directory where 1KP example folder is placed:
docker run -v <path to example folder>/1KP:/data esayyari/discovista discoVista.py -a parameters/annotation-1.txt -m 5 -p relativeFreq/astral.trim50genes33taxa.no3rd.final-FNA2AA/ -o relativeFreq/astral.trim50genes33taxa.no3rd.final-FNA2AA/results/anot1 -g Base
Here is the example output of this analysis:
This figure corresponds to the DiscoVista relative frequency analysis on 1kp dataset considering one hypothesis. Frequency of three topologies around focal internal branches of ASTRAL species trees using the trimmed gene trees (removing alignments with more than 66% gap characters) on first and second codon positions of amino acid alignments back translated to DNA in 1kp dataset. Main topologies are shown in red, and the other two alternative topologies are shown in blue. The dotted lines indicate the 1/3 threshold. The title of each subfigure indicates the label of the corresponding branch on the tree on the right (also generated by DiscoVista). Each internal branch has four neighboring branches which could be used to represent quartet topologies. On the x-axis the exact definition of each quartet topology is shown using the neighboring branch labels separated by “#”.