2018 SFU Invent the Future Summer Scholar Program
In this project, students will learn how computational biology can allow us to 1) infer ancestral strains and 2) predict future strains of viruses -- in order to understand the infection pathway of influenza and preventatively make vaccines before outbreaks occur.
Phylogenetic trees are widely used in biology to represent evolutionary relationships between species, such as how wolves are related to domesticated dogs. In these trees, leaves represent currently living species, and the internal branches indicate speciation events, where new species were thought to be created. The overall structure and shape of a phylogenetic tree reveals useful information such as the rate of new species formation and extinction. Tree balance usually refers to the structure of the tree, and branch lengths show the time or genetic distance between branching or speciation events.
Phylogenetic trees can also be used to extract information from viral/bacterial/vector speciation events generated by disease outbreaks. This information can be used to analyze the rate and patterns of how new species of virus/bacteria/vectors evolve, providing valuable information towards the development of vaccinations and and other preventative measures.
Homepage: https://sites.google.com/view/ai4all-sfu2018/projects/bioinformatics
Github repository: https://github.com/ai4all-sfu/bio
Slides: https://docs.google.com/presentation/d/1XpPjTZGQP1KkpAeiDjAUfz-uCqQMs2x1_a4lo_aJa3o/edit?usp=sharing
-
install mafft (for multiple sequence alignment) @ https://mafft.cbrc.jp/alignment/software/
-
download this repository and follow the R code in index.html; remember to change the variable " root " to the local folder you downloaded this repository to, and if you are starting anew, set " result_time_="" "
data folder contains processed data
- FASTA.fa: the HA gene sequence in influenza subtype A/H3N2 from years 1997 - 2017 (sorted by date) (n = 10370)
- meta.csv: metadata merged, and sorted by date the same way as FASTA.fa
- alignment.fa: viral strain sequence alignments of mafft made using FASTA.fa
- Aux_data.csv: contains all the strain name and date from FASTA.fa
- FinalH3N2: maximum likelihood generated phylogeny made from FASTA.fa
- df.csv: features for each strain/clade used for prediction
result/<date>_<time> folder contains results separated by time of making (output of index.Rmd)
- ind.csv: randomly sampled strain names of 200 recent viral sequences
- FASTA_anc.fa: reconstructed ancestral sequences
- FASTA_all.fa: data/FASTA.fa (minus the 200 sequences) + reconstructed ancestral sequences
- alignment_anc.fa: alignment of sequences in FASTA_all.fa
- dm_anc.Rdata: distance matrix made using alignment_anc.fa
if you are having issues installing packages in the script, install the following packages in the order listed (note: install miniconda here); courtesy of raquel
terminal
source activate py27
conda install -c r r-e1071
conda install -c r r-igraph
conda install -c geraldmc r-phylotop
conda install -c r r-nloptr
conda install -c r r-xml
r
install.packages('phangorn')
install.packages('phytools')
install.packages('nloptr')
install.packages('lme4')
install.packages('pbkrtest')
install.packages('car')
install.packages('NHPoisson')
install.packages('RNeXML')
install.packages('phylobase')
install.packages('phyloTop')