Skip to content
This repository has been archived by the owner on Jan 16, 2019. It is now read-only.

Tutorial

Paul Hoffman edited this page Jan 20, 2016 · 62 revisions

ANGSD-wrapper Tutorial

Installation

Welcome! This is a short guide to population genetics analysis using ANGSD-wrapper. For additional information on how each wrapper is run, please refer to the Methods sidebar. We will be using a test data set containing sequences from Zea mays subsp. parviglumis, Zea mays subsp. mays, Zea mexicana, and Tripsacum dactylodies. First, we need to clone the ANGSD-wrapper repository. You need to have git installed to do this. Alternatively, you can download a zip file from the releases page or use the download zip button on the home page of the repository.

Dependencies

The basic dependencies for ANGSD-wrapper are SAMTools, Wget, and Git. Most Linux distributions have Wget and Git installed by default, however some will not; users will need to download SAMTools, and it's dependency HTSlib from its GitHub page or your package manager.


Note: Mac Users Have Special Installation Requirements

You will need to install all three of the basic dependencies to run, as well as the GNU Scientific Library. We recommend using Homebrew to manage the installation process. If you use Homebrew, you can install all of the required dependencies from anywhere in your terminal with the following commands:

brew install git brew install samtools brew install wget brew install gsl

---

### Downloading and Installing ANGSD-wrapper

We'll use `Git` to download ANGSD-wrapper. To do this, type the following commands:

```shell
git clone https://github.com/mojaveazure/angsd-wrapper.git
cd angsd-wrapper

ANGSD-wrapper comes with its own version of ANGSD, to prevent compatibility-breaking changes in ANGSD from affecting ANGSD-wrapper, as well as a few other programs. In order to compile these programs, you must run the following setup routine:

./angsd-wrapper setup please
source ~/.bash_profile

This will download and install ANGSD, ngsAdmix, ngsTools, and ngsF. All of these programs are downloaded to the dependencies directory.

To download and set up a directory with test data, we run the following command:

./angsd-wrapper setup data

These data are located in the Example_Data directory. Finally, ANGSD-wrapper will be installed system-wide so that it can be used from any working directory. To make sure ANGSD-wrapper installed correctly, run angsd-wrapper, without the ./ that we used before.

In the Example_Data directory, there are four directories: three for the three Zea samples that we will be running with and one for the reference and ancestral sequences.

Directory Contents File names
Maize
  • BAM files for Zea mays subsp. mays
  • Indecies for BAM files
  • Inbreeding coefficients for Zea mays subsp. mays
  • A regions file
  • Sample list
  • All .bam files
  • All .bai files
  • Maize_Inbreeding.indF
  • Maize_Regions.txt
  • Maize_Samples.txt
Mexicana
  • BAM files for Zea mexicana
  • Indecies for BAM files
  • Inbreeding coefficients for Zea mexicana
  • A regions file
  • Sample list
  • All .bam files
  • All .bai files
  • Mexicana_Inbreeding.indF
  • Mexicana_Regions.txt
  • Mexicana_Samples.txt
Sequences
  • Ancestral Tripsacum dactylodies sequence
  • Reference Zea mays sequence
  • FASTA index files
  • Tripsacum_TDD39103.fa
  • Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa
  • All .fai files
Teosinte
  • BAM files for Zea mays subsp. parviglumis
  • Indecies for BAM files
  • Inbreeding coefficients for Zea mays subsp. parviglumis
  • A regions file
  • Sample list
  • All .bam files
  • All .bai files
  • Teosinte_Inbreeding.indF
  • Teosinte_Regions.txt
  • Teosinte_Samples.txt

Note: BAM Files MUST Have an @HD Header Line

Some programs, when generating BAM files, will not include the @HD header line. To see if you have this line, use SAMTools to check the header for your BAM files:

samtools view -H <name of BAM file> | head -1

ANGSD-wrapper has many different routines, or wrappers, that it can perform on a given dataset; we will be working with the Site Frequency Spectrum (SFS), Thetas Estimator, Admixture Analysis, and Principal Component Analysis (PCA) routines for this tutorial. To see all available wrappers, run angsd-wrapper without any arguments.

We will also be graphing our results using a Shiny web app. All analysis should be done using a supercomputer-like device, at least 32 GB of RAM, and all graphing should be done using a computer with a graphical user interaface. If you have access to a supercomputer cluster, we recommend setting up ANGSD-wrapper on both the cluster for analysis and local machine for graphing.

Configuring ANGSD-wrapper with the Common_Config file

ANGSD-wrapper uses configuration files to figure out where the data is and what options should be passed to ANGSD and other dependencies. There is one configuration file per wrapper included with angsd-wrapper, as well as a common configuration file (Common_Config) that can be used by multiple wrappers. FOr this example, we will confgiure ANGSD-wrapper to analyse the Zea mays subsp. mays samples. All of these are located in the Configuration_Files directory; we recommend copying this directory to another directory so that there is always a clean copy of the configuration files available. In this case, starting in the angsd-wrapper directory, we will copy the Configuration_Files directory into the Maize directory inside the Example_Data directory using the following command:

cp -r Configuration_Files/ Example_Data/Maize/

Note: A Word About Configuration Files

Each wrapper-specific configuration file is split into three parts: the COMMON definition, the 'not-using-common' section, and the wrapper-specific variables section. If a wrapper utilizes the Common definition, it will always be on line 10. The 'not-using-common' section is blocked off by 94 hash marks (#). If you are not using the Common_Config file, please fill out the variable definitions in this section. Since we're using Common_Config, we can skip these lines. finally, the wrapper-specific section includes any other variable definitions as well as parameters for the specific wrapper.


Now, let's go into the Maize directory and figure out the full path to this directory using pwd.

cd Example_Data/Maize
pwd

This will output a string that starts with /home/; go ahead and copy everything following the forward slash after your user name. For example, if we get /home/user_group/user_name/software/angsd-wrapper/Example_Data/Maize as our output, we only need /software/angsd-wrapper/Example_Data/Maize.

Now, we'll go find our configuration files in the Configuration_Files directory:

cd Configuration_Files/

Because we're using multiple wrappers in this tutorial, we'll use the Common_Config file to hold variables that will be used across all methods. Open Common_Config in your favorite text editor, such as Vim or Emacs.

First, we need to define a list of samples. On line 10 of Common_Config, there's a place to define this sample list. If we remember back in our Maize directory, our sample list is called Maize_Samples.txt

So, to tell ANGSD-wrapper where our sample list is, we will use our Maize example with the directory location being /home/user_group/user_name/software/angsd-wrapper/Example_Data/Maize. We'll make sure line 10 looks like this:

SAMPLE_LIST=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Samples.txt

We use ${HOME} to help ANGSD-wrapper find your files, if we used /home, some systems will error out, saying that the directory does not exist.

The @HD header line should be the first line that pops up; if you don't see it, this Gist will add one for you.


Adjust the /software/angsd-wrapper/Example_Data part to whatever you copied from your output.

Next, we need our list of inbreeding coefficients. This is called Maize_Inbreeding.indF, to run this we tell ANGSD-wrapper where this file is on line 13 of our Common_Config file:

SAMPLE_INBREEDING=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Inbreeding.indF

Lines 16 and 19 ask for our ancestral and reference sequences. These are Tripsacum_TDD39103.fa and Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa, respectively, found in the Sequences directory. In the Common_Config file, we'd enter the following on their respective lines:

ANC_SEQ=${HOME}/software/angsd-wrapper/Example_data/Sequences/Tripsacum_TDD39103.fa
REF_SEQ=${HOME}/software/angsd-wrapper/Example_Data/Sequences/Zea_mays.AGPv3.30.dna_sm.chromosome.10.fa

Now we need to set up our outdirectory structure. We use two variables to define this: PROJECT and SCRATCH. All output files will be placed in $SCRATCH/$PROJECT/<name_of_program>; for example, if we set SCRATCH to be "${HOME}/scratch" and PROJECT to be "Maize" and calculated a site frequency spectrum, our outdirectory would be ${HOME}/scratch/Maize/SFS.

If we used the same SCRATCH and PROJECT assignments and estimated Thetas, our outdirectory would be ${HOME}/scratch/Maize/Thetas. The outdirectory structure is generated automatically, making any directory within the structure that doesn't already exist, so it is not necessary to make these directories before hand.

Let's set SCRATCH to be "${HOME}/scratch" and PROJECT to be "Maize"; we define these two variables on lines 22, for PROJECT, and 27, for SCRATCH:

PROJECT=Maize
SCRATCH=${HOME}/scratch

Finally, we need to specify a regions file for ANGSD-wrapper. While we can run ANGSD-wrapper without a regions file, it becomes very computationally expensive and takes much longer. If you would like to generate a regions file, taking a random sample of all possible regions, this Gist will create a valid regions file for you.

We have a regions file in our Example_Data directory called Maize_Regions.txt, let's tell ANGSD-wrapper where this is on line 31 of Common_Config

REGIONS=${HOME}/software/angsd-wrapper/Example_Data/Maize/Maize_Regions.txt

Now we're ready to run ANGSD-wrapper to calculate a site frequency spectrum, estimate Thetas, perform an analyze admixture, and run a principal component analysis. Close out of Common_Config, and be sure to save your changes. We're going to stay in the Configuration_Files directory for now; we're going to need the full path for this directory, which we obtain with the following command:

pwd

Again, we'll get an output starting with /home/ and we only need the part after the second forward slash. Using our directory structure from before, our output would be /home/software/angsd-wrapper/Example_Data/Maize/Configuration_Files and we need /software/angsd-wrapper/Example_Data/Maize/Configuration_Files

Site Frequency Spectrum

Each wrapper function has its own configuration file associated with it. To run the site frequency spectrum, we need the Site_Frequency_Spectrum_Config file. Open this up in your favorite text editor.

In Site_Frequency_Spectrum_Config, we need to tell ANGSD-wrapper where our Common_Config file is. This definition is on line 10:

COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config

Remember to adjust the /software/angsd-wrapper/Example_Data/Maize/Configuration_Files part for your own directory structure.

Most of the other variables and parameters are set up to run smoothly. For now, we're going to set OVERRIDE to be true, in case we run this another time and want updated results; we change this on line 53:

OVERRIDE=true

Note: How ANGSD-wrapper Knows What to do

ANGSD-wrapper has several functions, or wrappers, built into it. These are predefined and very specific. The syntax for running ANGSD-wrapper is as follows:

angsd-wrapper <wrapper> <configuration file>

Where <wrapper> is one of the wrappers that ANGSD-wrapper can perform and <configuration file> is the full path to the configuration file we set up for it. To see a full list of wrappers that ANGSD-wrapper has and how to call them, run the following command:

angsd-wrapper

This will display a usage message with the wrappers ANGSD-wrapper has and how to call them. Capitalization and spelling are very important with ANGSD-wrapper; you must type out what you see in the usage message to get ANGSD-wrapper to run. Also, you don't have to use our presupplied configuration files with ANGSD-wrapper, but you do need to have all of the variable definitions that we have supplied.


Now, lets calculate a site frequency spectrum using ANGSD-wrapper:

angsd-wrapper SFS ./Site_Frequency_Spectrum_Config

Once this finishes, our output files will be in the outdirectory we specified, ${HOME}/scratch/Maize/SFS, let's go there and look at our files:

cd ${HOME}/scratch/Maize/SFS/
ls

The following are the output files we should see in the SFS directory:

  • Maize_DerivedSFS
  • Maize_SFSOut.arg
  • Maize_SFSOut.beagle.gz
  • Maize_SFSOut.geno.gz
  • Maize_SFSOut.mafs.gz
  • Maize_SFSOut.saf.gz
  • Maize_SFSOut.saf.idx
  • Maize_SFSOut.saf.pos.gz

We'll need the Maize_DerivedSFS file for our Thetas estimation and graphing later on.

Thetas Estimation

Now, we need to go back to our Configuration_Files directory so we can set up ANGSD-wrapper to estimates Thetas values for us. We use the cd command to do this:

cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/

Open up Thetas_Config in your favorite text editor. We have three variables we need to define in this configuration file. First, we need to tell ANGSD-wrapper where our Common_Config file is; this will be the same as what we put in our Site_Frequency_Spectrum_Config file. On line 10, we'll put:

COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config

Next, we need to specify our pest file. This file comes from our site frequency spectrum; in this case, our file is Maize_DerivedSFS. We need to specify this on line 41 of Thetas_Config:

PEST=${HOME}/scratch/Maize/SFS/Maize_DerivedSFS

Finally, let's set OVERRIDE to be true again; we do this on line 56:

OVERRIDE=true

Now, we can estimate Thetas values using ANGSD-wrapper; we do this with the following command:

angsd-wrapper Thetas ./Thetas_Config

Our output files will be in ${HOME}/scratch/Maize/Thetas, let's go there and look at our files

cd ${HOME}/scratch/Maize/Thetas/
ls

Here we have output files we should see in the Thetas directory:

  • Maize_Diversity.arg
  • Maize_Diversity.mafs.gz
  • Maize_Diversity.thetas.gz
  • Maize_Diversity.thetas.gz.bin
  • Maize_Diversity.thetas.gz.idx
  • Maize_Diversity.thetas.gz.pestPG

Admixture Analysis

Let's go back to out Configuration_Files directory to set up our admixture analysis:

cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/

We need to edit variables in Admixture_Config to tell ANGSD-wrapper where everything is for the admixture analysis. Open up Admixture_Analysis with your favorite text editor. On line 10, we need to specify where our Common_Config file is:

COMMON=${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/Common_Config

The only other variable we need to specify is our likelihood file. This comes from the site frequency spectrum and is called *.beagle.gz; remember, this is located in ${HOME}/scratch/Maize/SFS. On line 26 of Admixture_Config, we need to tell ANGSD-wrapper where this likelihood file is:

LIKELIHOOD=${HOME}/scratch/Maize/SFS/.beagle.gz

Now, let's run the admixture analysis. This is done with the following command:

angsd-wrapper Admixture ./Admixture_Config

Our output files will be in ${HOME}/scratch/Maize/Admixture, let's go there an look at our files.

cd ${HOME}/scratch/Maize/Admixture/
ls

Here, we see some more output files in the Admixture directory:

  • Maize.2.filter
  • Maize.2.fopt.gz
  • Maize.2.log
  • Maize.2.qopt
  • Maize.3.filter
  • Maize.3.fopt.gz
  • Maize.3.log
  • Maize.3.qopt
  • Maize.4.filter
  • Maize.4.fopt.gz
  • Maize.4.log
  • Maize.4.qopt
  • Maize.5.filter
  • Maize.5.fopt.gz
  • Maize.5.log
  • Maize.5.qopt

Where each number in the filename correlates with the number of K ancestral populations graphed.

Principal Component Analysis

Let's go back out to Configuration_Files directory to set up our principal component analysis (PCA):

cd ${HOME}/software/angsd-wrapper/Example_Data/Maize/Configuration_Files/

We only need to tell ANGSD-wrapper where our Common_Config file is, everything else for the PCA is taken care of. We do this on line 10 of Principal_Component_Analysis_Config:

COMMON=${HOME}/software/angsd-wrapper/Example_Data/Configuration_Files/Common_Config

Now, let's run the PCA. We use the following command to do this:

angsd-wrapper PCA ./Principal_Component_Analysis_Config

Our output files will be in ${HOME}/scratch/Maize/PCA, let's go there an look at our files.

cd ${HOME}/scratch/Maize/PCA/
ls

Here are the output files we should see in the PCA directory:

  • Maize_PCA.arg
  • Maize_PCA.covar
  • Maize_PCA.geno
  • Maize_PCA.mafs.gz

Graphing

ANGSD-wrapper comes with a visualization package, based off of Rstudio's Shiny platform. To use this, we need to be on a machine that has a graphical user interface (GUI) and a web browser. If you used ANGSD-wrapper on a high performance computing system, please transfer your files to another machine with a GUI so we can utilize the visualization package. You may need to setup ANGSD-wrapper again.

The files we need for graphing are:

  • Maize_Derived_SFS
  • Maize.pestPG
  • All .qopt files
  • Maize_PCA.covar

Starting Shiny

To start the Shiny graphing interface run:

angsd-wrapper shiny graphing