pqgen calculates population and quantitative genetic statistics from genomic data. The focus is on speed and ease of use. Statistics are calculated from the numeric genotype format (GT) used in the Variant Call Format (VCF), making it simple to go from variant calling to population genetic statistics in a few commands. The input format is flexible, but by default uses an extended bed format that makes it trivial to integrate bedtools into the analysis pipeline if, for example, you would like to calculate popgen stats for coding vs non-coding regions, or for synonymous vs non-synonymous sites.
pqgen by default expects an input file in an extended bed format:
chrom start end name GT.1 GT.2 ... GT.n
where GT.1 to GT.n are the genotypes of each individual given in GT format. For example:
Contig0 <start> 1 <name> 0/0 0/0 0/1 0/0 0/1
Contig0 <start> 2 <name> 0/0 0/0 0/0 0/0 0/0
Contig0 <start> 3 <name> 0/1 0/1 0/0 0/0 0/1
Note that the start position is not required by pqgen to calculate stats but is included to maintain the bed format. This format can be generated from a vcf file using bcftools:
bcftools query -f "%CHROM\t<start>\t%POS\t<name>[\t%GT]\n" aln.vcf > aln.gt.bed
Here, we are writing the output to a new file aln.gt.bed.
To calculate some basic site frequency based stats such as the number of segregating sites and Tajima's D use the theta command:
$ pqgen theta aln.gt.bed
Contig0 0 1372378 Contig0 10 1367408 15844 0.004096 0.004072 -0.029052
Contig1 0 865787 Contig1 10 862828 12965 0.005312 0.005468 0.148427
Contig2 0 786857 Contig2 10 783622 15404 0.006949 0.007175 0.163521
By default, statistics will be calculated per chromosome (or contig in this case), but this can be changed (see options below).
All commands follow the same basic usage:
pqgen [--help] [--version] <command> [OPTIONS] [FILE]
where <command>
can be anyone of the following:
het Calculates heterozygosity.
sfs Outputs the folded site frequency spectrum.
theta Calculates site frequency based statistics.
and OPTIONS
can include any of the following:
Options:
-c <int>
Column number (1-indexed) giving the chromosome/contig/scaffold ID.
-d <char>
Delimiter separating each column. Must be enclosed in single quotes, spaces
must be escaped (e.g. -d '\ '). Tab delimited by default.
-f <int>
Column number (1-indexed) of the factor over which the stats should be calculated.
By default, statistics are output per chromosome.
-k <str>
1-indexed column numbers indicating the columns over which the statistic
should be calculated. Columns can be separated by commas, and ranges specified
using dashes. For example, -k 2,5-7 would specify columns 2, 5, 6 and 7.
-p <int>
Column number (1-indexed) giving the 1-indexed reference position.
Download and unpack the latest version of pqgen (replacing X, Y and Z with the version number):
wget https://github.com/mestocks/pqgen/releases/download/vX.Y.Z/pqgen-X.Y.Z.tar.gz
tar -zxvf pqgen-X.Y.Z.tar.gz
Then change into the pqgen directory, compile the source and install it:
cd pqgen-X.Y.Z
./configure
make
make install
The installation path can be changed using './configure --prefix=<path/to/dir>'. Admin permissions may be required for the 'make install' step.
dna2codon - Collapse consecutive dna sequences into codons.
Input format:
```bash
"Contig0 <pos0> 1 <name> 0.65 + A G ..."
```
dna2div - Calculate divergence based statistics. A maximum of two alleles should be given.
Input format:
```bash
"Contig0 <pos0> 1 <name> A G"
```
codon2pnds - Count the number of silent and replacement substitutions and polymorphisms. The first codon is the outgroup/ancestral codon.
Input format:
```bash
"Contig0 <pos0> 1 <name> ATG ATG ..."
```