-
Notifications
You must be signed in to change notification settings - Fork 3
Tutorial
Welcome! This is a short guide to population genetics analysis using ANGSD-wrapper. For additional information on how each wrapper is run, please refer to the Methods
sidebar. We will be using a test data set containing sequences from Zea mays subsp. parviglumis, Zea mays subsp. mays, Zea mays subsp. mexicana, and Tripsacum dactylodies. First, we need to clone the ANGSD-wrapper repository. You need to have git
installed to do this. Alternatively, you can download a zip file from the releases page or use the download zip button on the home page of the repository.
The basic dependencies for ANGSD-wrapper are SAMTools
, Wget
, and Git
. Most Linux distributions have Wget
and Git
installed by default, however some will not; users will need to download SAMTools
, and it's dependency HTSlib
from its GitHub page or your package manager.
You will need to install all three of the basic dependencies to run, as well as the GNU Scientific Library. We recommend using Homebrew to manage the installation process. If you use Homebrew, you can install all of the required dependencies from anywhere in your terminal with the following commands:
brew install git brew install samtools brew install wget brew install gsl
---
### Downloading and Installing ANGSD-wrapper
We'll use `Git` to download ANGSD-wrapper. To do this, type the following commands:
```shell
git clone https://github.com/mojaveazure/angsd-wrapper.git
cd angsd-wrapper
ANGSD-wrapper comes with its own version of ANGSD, to prevent compatibility-breaking changes in ANGSD from affecting ANGSD-wrapper, as well as a few other programs. In order to compile these programs, you must run the following setup routine:
./angsd-wrapper setup dependencies
source ~/.bash_profile
This will download and install ANGSD, ngsAdmix, ngsTools, and ngsF. All of these programs are downloaded to the dependencies
directory.
To download and set up a directory with test data, we run the following command:
./angsd-wrapper setup data
These data are located in the Example_Data
directory. Finally, ANGSD-wrapper will be installed system-wide so that it can be used from any working directory. To make sure ANGSD-wrapper installed correctly, run angsd-wrapper
, without the ./
that we used before.
In the Example_Data
directory, there are four directories: three for the three Zea samples that we will be running with and one for the reference and ancestral sequences.
Directory | Contents | File names |
---|---|---|
Maize |
|
|
Mexicana |
|
|
Sequences |
|
|
Teosinte |
|
|
Some programs, when generating BAM files, will not include the
@HD
header line. To see if you have this line, useSAMTools
to check the header for your BAM files:
samtools view -H <name of BAM file> | head -1The
@HD
header line should be the first line that pops up; if you don't see it, this Gist will add one for you.
ANGSD-wrapper has many different routines, or wrappers, that it can perform on a given dataset; we will be working with the Site Frequency Spectrum (SFS), Thetas Estimator, Admixture Analysis, and Principal Component Analysis (PCA) routines for this tutorial. To see all available wrappers, run angsd-wrapper
without any arguments.
We will also be graphing our results using a Shiny web app. All analysis should be done using a supercomputer-like device, at least 32 GB of RAM, and all graphing should be done using a computer with a graphical user interaface. If you have access to a supercomputer cluster, we recommend setting up ANGSD-wrapper on both the cluster for analysis and local machine for graphing.
ANGSD-wrapper uses configuration files to figure out where the data is and what options should be passed to ANGSD and other dependencies. There is one configuration file per wrapper included with angsd-wrapper
, as well as a common configuration file (Common_Config
) that can be used by multiple wrappers. FOr this example, we will confgiure ANGSD-wrapper to analyse the Zea mays subsp. mays samples. All of these are located in the Configuration_Files
directory; we recommend copying this directory to another directory so that there is always a clean copy of the configuration files available. In this case, starting in the angsd-wrapper
directory, we will copy the Configuration_Files
directory into the Maize
directory inside the Example_Data
directory using the following command:
cp -r Configuration_Files/ Example_Data/Maize/
Each wrapper-specific configuration file is split into three parts: the
COMMON
definition, the 'not-using-common' section, and the wrapper-specific variables section. If a wrapper utilizes theCommon
definition, it will always be on line 10. The 'not-using-common' section is blocked off by 94 hash marks (#
). If you are not using theCommon_Config
file, please fill out the variable definitions in this section. Since we're usingCommon_Config
, we can skip these lines. finally, the wrapper-specific section includes any other variable definitions as well as parameters for the specific wrapper.
Now, let's go into the Maize
directory and figure out the full path to this directory using pwd
.
cd Example_Data/Maize
pwd
This will output a string that starts with /home/
; go ahead and copy everything following the forward slash after your user name. For example, if we get /home/user_group/user_name/software/angsd-wrapper/Example_Data/Maize
as our output, we only need /software/angsd-wrapper/Example_Data/Maize
.
Now, we'll go find our configuration files in the Configuration_Files
directory:
cd Configuration_Files/
Because we're using multiple wrappers in this tutorial, we'll use the Common_Config
file to hold variables that will be used across all methods. Open Common_Config
in your favorite text editor, such as Vim or Emacs.
First, we need to define a list of samples. On line 10 of Common_Config
, there's a place to define this sample list. If we remember back in our Maize
directory, our sample list is called Maize_Samples.txt
So, to tell ANGSD-wrapper where our sample list is, we will use our Maize
example with the directory location being /home/user_group/user_name/software/angsd-wrapper/Example_Data/Maize
. We'll make sure line 10 looks like this:
SAMPLE_LIST=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Samples.txt
If you're using the dev
branch, it looks like this:
GROUP_SAMPLES=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Samples.txt
We use ${HOME}
to help ANGSD-wrapper find your files, if we used /home
, some systems will error out, saying that the directory does not exist.
Adjust the /software/angsd-wrapper/Example_Data
part to whatever you copied from your output.
Next, we need our list of inbreeding coefficients. This is called Maize_Inbreeding.indF
, to run this we tell ANGSD-wrapper where this file is on line 13 of our Common_Config
file:
SAMPLE_INBREEDING=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Inbreeding.indF
If you're using the dev
branch, it looks like this:
GROUP_INBREEDING=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Inbreeding.indF
Lines 17 and 20 ask for our ancestral and reference sequences. These are Tripsacum_TDD39103.fa
and Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
, respectively, found in the Sequences
directory. In the Common_Config
file, we'd enter the following on their respective lines:
ANC_SEQ=${HOME}/software/angsd-wrapper/Example_data/Sequences/Tripsacum_TDD39103.fa
REF_SEQ=${HOME}/software/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
Now we need to set up our outdirectory structure. We use two variables to define this: PROJECT
and SCRATCH
. All output files will be placed in $SCRATCH/$PROJECT/<name_of_program>
; for example, if we set SCRATCH
to be "${HOME}/scratch
" and PROJECT
to be "Maize" and calculated a site frequency spectrum, our outdirectory would be ${HOME}/scratch/Maize/SFS
.
If we used the same SCRATCH
and PROJECT
assignments and estimated Thetas, our outdirectory would be ${HOME}/scratch/Maize/Thetas
. The outdirectory structure is generated automatically, making any directory within the structure that doesn't already exist, so it is not necessary to make these directories before hand.
Let's set SCRATCH
to be "${HOME}/scratch
" and PROJECT
to be "Maize"; we define these two variables on lines 23, for PROJECT
, and 28, for SCRATCH
:
PROJECT=Maize
SCRATCH=${HOME}/scratch
Finally, we need to specify a regions file for ANGSD-wrapper. While we can run ANGSD-wrapper without a regions file, it becomes very computationally expensive and takes much longer. If you would like to generate a regions file, taking a random sample of all possible regions, this Gist will create a valid regions file for you.
We have a regions file in our Example_Data
directory called Maize_Regions.txt
, let's tell ANGSD-wrapper where this is on line 32 of Common_Config
REGIONS=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Regions.txt
Now we're ready to run ANGSD-wrapper to calculate a site frequency spectrum, estimate Thetas, perform an analyze admixture, and run a principal component analysis. Close out of Common_Config
, and be sure to save your changes. We're going to stay in the Configuration_Files
directory for now; we're going to need the full path for this directory, which we obtain with the following command:
pwd
Again, we'll get an output starting with /home/
and we only need the part after the second forward slash. Using our directory structure from before, our output would be /home/software/angsd-wrapper/Example_Data/Maize/Configuration_Files
and we need /software/angsd-wrapper/Example_Data/Maize/Configuration_Files
Each wrapper function has its own configuration file associated with it. To run the site frequency spectrum, we need the Site_Frequency_Spectrum_Config
file. Open this up in your favorite text editor.
In Site_Frequency_Spectrum_Config
, we need to tell ANGSD-wrapper where our Common_Config
file is. This definition is on line 10:
COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config
Remember to adjust the /software/angsd-wrapper/Example_Data/Maize/Configuration_Files
part for your own directory structure.
Most of the other variables and parameters are set up to run smoothly. For now, we're going to set OVERRIDE
to be true
, in case we run this another time and want updated results; we change this on line 53:
OVERRIDE=true
ANGSD-wrapper has several functions, or wrappers, built into it. These are predefined and very specific. The syntax for running ANGSD-wrapper is as follows:
angsd-wrapper <wrapper> <configuration file>
Where
<wrapper>
is one of the wrappers that ANGSD-wrapper can perform and<configuration file>
is the full path to the configuration file we set up for it. To see a full list of wrappers that ANGSD-wrapper has and how to call them, run the following command:
angsd-wrapper
This will display a usage message with the wrappers ANGSD-wrapper has and how to call them. Capitalization and spelling are very important with ANGSD-wrapper; you must type out what you see in the usage message to get ANGSD-wrapper to run. Also, you don't have to use our presupplied configuration files with ANGSD-wrapper, but you do need to have all of the variable definitions that we have supplied.
Now, lets calculate a site frequency spectrum using ANGSD-wrapper:
angsd-wrapper SFS ./Site_Frequency_Spectrum_Config
Once this finishes, our output files will be in the outdirectory we specified, ${HOME}/scratch/Maize/SFS
, let's go there and look at our files:
cd ${HOME}/scratch/Maize/SFS/
ls
The following are the output files we should see in the SFS
directory:
Maize_DerivedSFS
Maize_SFSOut.arg
Maize_SFSOut.beagle.gz
Maize_SFSOut.geno.gz
Maize_SFSOut.mafs.gz
Maize_SFSOut.saf.gz
Maize_SFSOut.saf.idx
Maize_SFSOut.saf.pos.gz
We'll need the Maize_DerivedSFS
file for our Thetas estimation and graphing later on.
Now, we need to go back to our Configuration_Files
directory so we can set up ANGSD-wrapper to estimates Thetas values for us. We use the cd
command to do this:
cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/
Open up Thetas_Config
in your favorite text editor. We have three variables we need to define in this configuration file. First, we need to tell ANGSD-wrapper where our Common_Config
file is; this will be the same as what we put in our Site_Frequency_Spectrum_Config
file. On line 10, we'll put:
COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config
Next, we need to specify our pest file. This file comes from our site frequency spectrum; in this case, our file is Maize_DerivedSFS
. We need to specify this on line 41 of Thetas_Config
:
PEST=${HOME}/scratch/Maize/SFS/Maize_DerivedSFS
Finally, let's set OVERRIDE
to be true
again; we do this on line 56:
OVERRIDE=true
Now, we can estimate Thetas values using ANGSD-wrapper; we do this with the following command:
angsd-wrapper Thetas ./Thetas_Config
Our output files will be in ${HOME}/scratch/Maize/Thetas
, let's go there and look at our files
cd ${HOME}/scratch/Maize/Thetas/
ls
Here we have output files we should see in the Thetas
directory:
Maize_Diversity.arg
Maize_Diversity.mafs.gz
Maize_Diversity.thetas.gz
Maize_Diversity.thetas.gz.bin
Maize_Diversity.thetas.gz.idx
Maize_Diversity.thetas.gz.pestPG
Maize_Diversity.thetas.graph.me
Let's go back to out Configuration_Files
directory to set up our admixture analysis:
cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/
We need to edit variables in Admixture_Config
to tell ANGSD-wrapper where everything is for the admixture analysis. Open up Admixture_Analysis
with your favorite text editor. On line 10, we need to specify where our Common_Config
file is:
COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config
The only other variable we need to specify is our likelihood file. This comes from the site frequency spectrum and is called *.beagle.gz
; remember, this is located in ${HOME}/scratch/Maize/SFS
. On line 26 of Admixture_Config
, we need to tell ANGSD-wrapper where this likelihood file is:
LIKELIHOOD=${HOME}/scratch/Maize/SFS/.beagle.gz
Now, let's run the admixture analysis. This is done with the following command:
angsd-wrapper Admixture ./Admixture_Config
Our output files will be in ${HOME}/scratch/Maize/Admixture
, let's go there an look at our files.
cd ${HOME}/scratch/Maize/Admixture/
ls
Here, we see some more output files in the Admixture
directory:
Maize.2.filter
Maize.2.fopt.gz
Maize.2.log
Maize.2.qopt
Maize.3.filter
Maize.3.fopt.gz
Maize.3.log
Maize.3.qopt
Maize.4.filter
Maize.4.fopt.gz
Maize.4.log
Maize.4.qopt
Maize.5.filter
Maize.5.fopt.gz
Maize.5.log
Maize.5.qopt
Where each number in the filename correlates with the number of K ancestral populations graphed.
Let's go back out to Configuration_Files
directory to set up our principal component analysis (PCA):
cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/
We only need to tell ANGSD-wrapper where our Common_Config
file is, everything else for the PCA is taken care of. We do this on line 10 of Principal_Component_Analysis_Config
:
COMMON=${HOME}/software/angsd-wrapper/Example_Data/Configuration_Files/Common_Config
Now, let's run the PCA. We use the following command to do this:
angsd-wrapper PCA ./Principal_Component_Analysis_Config
Our output files will be in ${HOME}/scratch/Maize/PCA
, let's go there an look at our files.
cd ${HOME}/scratch/Maize/PCA/
ls
Here are the output files we should see in the PCA
directory:
Maize_PCA.arg
Maize_PCA.covar
Maize_PCA.geno
Maize_PCA.mafs.gz
ANGSD-wrapper comes with a visualization package, based off of Rstudio's Shiny platform. To use this, we need to be on a machine that has a graphical user interface (GUI) and a web browser. If you used ANGSD-wrapper on a high performance computing system, please transfer your files to another machine with a GUI so we can utilize the visualization package. You may need to setup ANGSD-wrapper again.
The files we need for graphing are:
Maize_Derived_SFS
Maize.pestPG
- All
.qopt
files Maize_PCA.covar
Use git clone
to clone ANGSD-wrapper in a terminal window on your local machine. Setup ANGSD-wrapper on your local machine the same way you set it up at the beginning of this tutorial. Once setup is complete, you can now start Shiny.
To start the Shiny graphing interface run:
angsd-wrapper shiny graphing
Additional help text is available on the side panels within Shiny.