-
Notifications
You must be signed in to change notification settings - Fork 3
Tutorial
Welcome! This is a short guide to population genetics analysis using ANGSD-wrapper. For additional information on how each wrapper is run, please refer to the Methods sidebar. We will be using a test data set containing sequences from Zea mays subsp. parviglumis, Zea mays subsp. mays, Zea mays subsp. mexicana, and Tripsacum dactylodies. For help with installing ANGSD-wrapper, please see the installation page. Note that we assume the user is using bash for this tutorial. Using zsh is possible, but you have to modify the path to point to the ANGSD-wrapper executable in your zsh configuration.
To download and set up a directory with test data and example configuration files, we run the following command:
./angsd-wrapper setup data
These data and configuration files are located in the Example_Data
directory. Finally, ANGSD-wrapper will be installed system-wide so that it can be used from any working directory. To make sure ANGSD-wrapper installed correctly, run angsd-wrapper
, without the ./
that we used before.
In the Example_Data
directory, there are five directories: three for the three Zea samples that we will be running with and one for the reference and ancestral sequences. The fifth directory contains example configuration files that have been pre-filled. These will work for the tutorial, but we recommend following along to get practice creating your own configuration files.
Directory | Contents | File names |
---|---|---|
Configuration_Files |
Example configuration files pre-filled for all three Zea samples for the following methods:
Common_Config for each Zea sample |
|
Maize |
|
|
Mexicana |
|
|
Sequences |
|
|
Teosinte |
|
|
Some programs, when generating BAM files, will not include the
@HD
header line. To see if you have this line, useSAMTools
to check the header for your BAM files:
samtools view -H <name of BAM file> | head -1The
@HD
header line should be the first line that pops up; if you don't see it, this Gist will add one for you.
ANGSD-wrapper has many different routines, or wrappers, that it can perform on a given dataset; we will be working with the Site Frequency Spectrum (SFS), Thetas Estimator, Admixture Analysis, and Principal Component Analysis (PCA) routines for this tutorial. To see all available wrappers, run angsd-wrapper
without any arguments.
We will also be graphing our results using a Shiny web app. All analyses should be done using a supercomputer-like device, at least 32 GB of RAM, and all graphing should be done using a computer with a graphical user interface. If you have access to a supercomputer cluster, we recommend setting up ANGSD-wrapper on both the cluster for analysis and local machine for graphing.
Note: There are pre-filled example configuration files in the
Example_Data/Configuration_Files
directory. If you would like to use these files instead of creating your own, you may do so, though we strongly recommend practicing creating your own configuration files
ANGSD-wrapper uses configuration files to figure out where the data is and what options should be passed to ANGSD and other dependencies. There is one configuration file per wrapper included with angsd-wrapper
, as well as a common configuration file (Common_Config
) that can be used by multiple wrappers. For this example, we will configure ANGSD-wrapper to analyze the Zea mays subsp. mays samples. All of these are located in the Configuration_Files
directory; we recommend copying this directory to another directory so that there is always a clean copy of the configuration files available. In this case, starting in the angsd-wrapper
directory, we will copy the Configuration_Files
directory into the Maize
directory inside the Example_Data
directory using the following command:
cp -r Configuration_Files/ Example_Data/Maize/
Each wrapper-specific configuration file is split into three parts: the
COMMON
definition, the 'not-using-common' section, and the wrapper-specific variables section. If a wrapper utilizes theCommon
definition, it will always be on line 10. The 'not-using-common' section is blocked off by 94 hash marks (#
). If you are not using theCommon_Config
file, please fill out the variable definitions in this section. Since we're usingCommon_Config
, we can skip these lines. Finally, the wrapper-specific section includes any other variable definitions as well as parameters for the specific wrapper. If theCommon_Config
files are used, they will overwrite anything specified in the wrapper-specific section. This is to ensure similarity between analyses that make use of theCommon_Config
file.
Now, let's go into the Maize
directory and figure out the full path to this directory using pwd
.
cd Example_Data/Maize
pwd
This will output a string that starts with /home/
; go ahead and copy everything following the forward slash after your user name. For example, if we get /home/user_group/user_name/software/angsd-wrapper/Example_Data/Maize
as our output, we only need /software/angsd-wrapper/Example_Data/Maize
.
Now, we'll go find our configuration files in the Configuration_Files
directory:
cd Configuration_Files/
Because we're using multiple wrappers in this tutorial, we'll use the Common_Config
file to hold variables that will be used across all methods. Open Common_Config
in your favorite text editor, such as Vim or Emacs.
First, we need to define a list of samples. On line 10 of Common_Config
, there's a place to define this sample list. If we remember back in our Maize
directory, our sample list is called Maize_Samples.txt
So, to tell ANGSD-wrapper where our sample list is, we will use our Maize
example with the directory location being /home/user_group/user_name/software/angsd-wrapper/Example_Data/Maize
. We'll make sure line 10 looks like this:
SAMPLE_LIST=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Samples.txt
If you're using the dev
branch, it looks like this:
GROUP_SAMPLES=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Samples.txt
We use ${HOME}
to help ANGSD-wrapper find your files, if we used /home
, some systems will error out, saying that the directory does not exist.
Adjust the /software/angsd-wrapper/Example_Data
part to whatever you copied from your output.
Next, we need our list of inbreeding coefficients. This is called Maize_Inbreeding.indF
, to run this we tell ANGSD-wrapper where this file is on line 13 of our Common_Config
file:
SAMPLE_INBREEDING=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Inbreeding.indF
If you're using the dev
branch, it looks like this:
GROUP_INBREEDING=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Inbreeding.indF
Lines 17 and 20 ask for our ancestral and reference sequences. These are Tripsacum_TDD39103.fa
and Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
, respectively, found in the Sequences
directory. In the Common_Config
file, we'd enter the following on their respective lines:
ANC_SEQ=${HOME}/software/angsd-wrapper/Example_Data/Sequences/Tripsacum_TDD39103.fa
REF_SEQ=${HOME}/software/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
Now we need to set up our outdirectory structure. We use two variables to define this: PROJECT
and SCRATCH
. All output files will be placed in $SCRATCH/$PROJECT/<name_of_program>
; for example, if we set SCRATCH
to be "${HOME}/scratch
" and PROJECT
to be "Maize" and calculated a site frequency spectrum, our outdirectory would be ${HOME}/scratch/Maize/SFS
.
If we used the same SCRATCH
and PROJECT
assignments and estimated Thetas, our outdirectory would be ${HOME}/scratch/Maize/Thetas
. The outdirectory structure is generated automatically, making any directory within the structure that doesn't already exist, so it is not necessary to make these directories before hand.
Let's set SCRATCH
to be "${HOME}/scratch
" and PROJECT
to be "Maize"; we define these two variables on lines 23, for PROJECT
, and 28, for SCRATCH
:
PROJECT=Maize
SCRATCH=${HOME}/scratch
Finally, we need to specify a regions file for ANGSD-wrapper. While we can run ANGSD-wrapper without a regions file, it becomes very computationally expensive and takes much longer. If you would like to generate a regions file, taking a random sample of all possible regions, this Gist will create a valid regions file for you.
We have a regions file in our Example_Data
directory called Maize_Regions.txt
, let's tell ANGSD-wrapper where this is on line 32 of Common_Config
REGIONS=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Regions.txt
Now we're ready to run ANGSD-wrapper to calculate a site frequency spectrum, estimate Thetas, perform an analyze admixture, and run a principal component analysis. Close out of Common_Config
, and be sure to save your changes. We're going to stay in the Configuration_Files
directory for now; we're going to need the full path for this directory, which we obtain with the following command:
pwd
Again, we'll get an output starting with /home/
and we only need the part after the second forward slash. Using our directory structure from before, our output would be /home/software/angsd-wrapper/Example_Data/Maize/Configuration_Files
and we need /software/angsd-wrapper/Example_Data/Maize/Configuration_Files
Each wrapper function has its own configuration file associated with it. To run the site frequency spectrum, we need the Site_Frequency_Spectrum_Config
file. Open this up in your favorite text editor.
In Site_Frequency_Spectrum_Config
, we need to tell ANGSD-wrapper where our Common_Config
file is. This definition is on line 10:
COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config
Remember to adjust the /software/angsd-wrapper/Example_Data/Maize/Configuration_Files
part for your own directory structure.
Most of the other variables and parameters are set up to run smoothly. For now, we're going to set OVERRIDE
to be true
, in case we run this another time and want updated results; we change this on line 53:
OVERRIDE=true
ANGSD-wrapper has several functions, or wrappers, built into it. These are predefined and very specific. The syntax for running ANGSD-wrapper is as follows:
angsd-wrapper <wrapper> <configuration file>
Where
<wrapper>
is one of the wrappers that ANGSD-wrapper can perform and<configuration file>
is the full path to the configuration file we set up for it. To see a full list of wrappers that ANGSD-wrapper has and how to call them, run the following command:
angsd-wrapper
This will display a usage message with the wrappers ANGSD-wrapper has and how to call them. Capitalization and spelling are very important with ANGSD-wrapper; you must type out what you see in the usage message to get ANGSD-wrapper to run. Also, you don't have to use our presupplied configuration files with ANGSD-wrapper, but you do need to have all of the variable definitions that we have supplied.
Now, lets calculate a site frequency spectrum using ANGSD-wrapper:
angsd-wrapper SFS ./Site_Frequency_Spectrum_Config
Once this finishes, our output files will be in the outdirectory we specified, ${HOME}/scratch/Maize/SFS
, let's go there and look at our files:
cd ${HOME}/scratch/Maize/SFS/
ls
The following are the output files we should see in the SFS
directory:
Maize_DerivedSFS
Maize_SFSOut.arg
Maize_SFSOut.geno.gz
Maize_SFSOut.mafs.gz
Maize_SFSOut.saf.gz
Maize_SFSOut.saf.idx
Maize_SFSOut.saf.pos.gz
We'll need the Maize_DerivedSFS
file for our Thetas estimation and graphing later on.
Now, we need to go back to our Configuration_Files
directory so we can set up ANGSD-wrapper to estimate Thetas values for us. We use the cd
command to do this:
cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/
Open up Thetas_Config
in your favorite text editor. We have three variables we need to define in this configuration file. First, we need to tell ANGSD-wrapper where our Common_Config
file is; this will be the same as what we put in our Site_Frequency_Spectrum_Config
file. On line 10, we'll put:
COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config
Next, we need to specify our pest (prior estimation) file. This file comes from our site frequency spectrum; in this case, our file is Maize_DerivedSFS
. We need to specify this on line 41 of Thetas_Config
:
PEST=${HOME}/scratch/Maize/SFS/Maize_DerivedSFS
Finally, let's set OVERRIDE
to be true
again; we do this on line 56:
OVERRIDE=true
Now, we can estimate Thetas values using ANGSD-wrapper; we do this with the following command:
angsd-wrapper Thetas ./Thetas_Config
Our output files will be in ${HOME}/scratch/Maize/Thetas
, let's go there and look at our files
cd ${HOME}/scratch/Maize/Thetas/
ls
Here we have output files we should see in the Thetas
directory:
Maize_Diversity.arg
Maize_Diversity.mafs.gz
Maize_Diversity.thetas.gz
Maize_Diversity.thetas.gz.bin
Maize_Diversity.thetas.gz.idx
Maize_Diversity.thetas.gz.pestPG
Maize_Diversity.thetas.graph.me
Let's go back out to Configuration_Files
directory to set up our admixture analysis:
First we need to get genotype likelihoods from the Genotypes
command. Open the Genotype_Config
file and add in the path to the Common_Config
file:
COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config
Then run the Genotypes
command to get out the genotype likelihoods (in beagle format).
angsd-wrapper Genotypes ./Genotype_Config
Now that this is done, we can set up the admixture configuration file:
cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/
We need to edit variables in Admixture_Config
to tell ANGSD-wrapper where everything is for the admixture analysis. Open up Admixture_Analysis
with your favorite text editor. On line 10, we need to specify where our Common_Config
file is:
COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config
The only other variable we need to specify is our likelihood file. On line 26 of Admixture_Config
, we need to tell ANGSD-wrapper where this likelihood file is:
LIKELIHOOD=${HOME}/scratch/Maize/GenotypeLikelihoods/Maize_SFSOut.beagle.gz
Now, let's run the admixture analysis. This is done with the following command:
angsd-wrapper Admixture ./Admixture_Config
There are several commands that run optimization methods that may end up converging upon a local maxima. In order to make sure the optimization finds the global maximum, we recommend running the program several times from different starting points. For Admixture, this just means running the program several times because the starting point is chosen randomly.
Our output files will be in ${HOME}/scratch/Maize/Admixture
, let's go there an look at our files.
cd ${HOME}/scratch/Maize/Admixture/
ls
Here, we see some more output files in the Admixture
directory:
Maize.2.filter
Maize.2.fopt.gz
Maize.2.log
Maize.2.qopt
Maize.3.filter
Maize.3.fopt.gz
Maize.3.log
Maize.3.qopt
Maize.4.filter
Maize.4.fopt.gz
Maize.4.log
Maize.4.qopt
Maize.5.filter
Maize.5.fopt.gz
Maize.5.log
Maize.5.qopt
Where each number in the filename correlates with the number of K ancestral populations graphed.
Let's go back out to Configuration_Files
directory to set up our principal component analysis (PCA):
cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/
We only need to tell ANGSD-wrapper where our Common_Config
file is, everything else for the PCA is taken care of. We do this on line 10 of Principal_Component_Analysis_Config
:
COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config
Now, let's run the PCA. We use the following command to do this:
angsd-wrapper PCA ./Principal_Component_Analysis_Config
Our output files will be in ${HOME}/scratch/Maize/PCA
, let's go there and look at our files.
cd ${HOME}/scratch/Maize/PCA/
ls
Here are the output files we should see in the PCA
directory:
Maize_PCA.arg
Maize_PCA.covar
Maize_PCA.geno
Maize_PCA.mafs.gz
ANGSD-wrapper comes with a visualization package, based off of Rstudio's Shiny platform. To use this, we need to be on a machine that has a graphical user interface (GUI) and a web browser. If you used ANGSD-wrapper on a high performance computing system, please transfer your files to another machine with a GUI so we can utilize the visualization package. You may need to setup ANGSD-wrapper again.
The files we need for graphing are:
Maize_DerivedSFS
Maize.pestPG
- All
.qopt
files Maize_PCA.covar
Use git clone
to clone ANGSD-wrapper in a terminal window on your local machine.
git clone https://github.com/mojaveazure/angsd-wrapper.git
cd angsd-wrapper
Setup ANGSD-wrapper on your local machine the same way you set it up at the beginning of this tutorial. You will not need to setup the data again, only the dependencies.
./angsd-wrapper setup dependencies
source ~/.bash_profile
Once setup is complete, you can now start Shiny.
To start the Shiny graphing interface run:
angsd-wrapper shiny graphing
Additional help text is available on the side panels within Shiny.