LIN28A Rare Variant Analysis

Information

DESCRIPTION

Author(s): Sara Bandres-Ciga, Mary B. Makarious, Monica Diez-Fairen
Project: LIN28A Letter (IPDGC)
PI(s): Mike Nalls, Andrew Singleton
Collaborators: Cornelis Blauwendraat
Date Last Updated: 21.04.2020

INSPIRATION

Working Title: Assessment of LIN28A variants in Parkinson's disease
Aim: To scrutinize whether LIN28A (LOF) mutations are causative for Parkinson's disease
Brief Description: A loss-of-function LIN28A mutation has been reported to cause early onset Parkinson's disease. However, replication in a large cohort is needed. So we were inspired to look into the IPDGC + AMP PD cohorts
Original Publication: https://www.ncbi.nlm.nih.gov/pubmed/31750563

PROPOSED WORKFLOW

0. Getting Started

1. Extract the LIN28A region

Generate --mac 3 for statistics, --mac 1 for burdens

2. Running with `--model` and `--assoc` in PLINK

3. Running with `--logistic` and `--assoc` in PLINK

4. Generate VCFs

5. Annotate via ANNOVAR

6. Burden Analysis via RVTests

7. Single variant Wald and score tests via RVTests

8. Analysis in IPDGC Cohort

0. GETTING STARTED

# Initializing some variables 
## REMOVED paths to files 
COV_NAMES="SEX,AGE,FAMILY_HISTORY,EDUCATION,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,STUDY_BioFIND,STUDY_PDBP,STUDY_PPMI"

Load the necessary modules (done on Biowulf)

module load plink #v1.9
module load samtools #v1.9
module load annovar #v2018-04-16
module load rvtests #v2.1.0

Information on AMP-PD WGS Used:

BioFind, PDBP, and PPMI cohorts
- --geno 0.05
- European ancestry
28195229 variants
- 2927 people total -
- 1647 cases, 1050 controls (230 missing phenotype)

1. EXTRACT THE LIN28A REGION

# Create a working directory for the LIN28A paper 
# Change into directory
cd $WORKING_DIR

# Extract the LIN28A region
	# These are hg38 co-ordinates 
# --mac 1
plink --bfile $AMP_PLINK \
--chr 1 --from-bp 26410817 --to-bp 26429728 \
--mac 1 \
--make-bed --out $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1
	# Done 21.04.2020
	# PLINK output:
		# Total genotyping rate is 0.999946.
		# 108 variants and 2927 people pass filters and QC.
		# Among remaining phenotypes, 1647 are cases and 1050 are controls. (230 phenotypes are missing.)

# --mac 3 
plink --bfile $AMP_PLINK \
--chr 1 --from-bp 26410817 --to-bp 26429728 \
--mac 3 \
--make-bed --out $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3 

# wc -l LIN28A_WGS_AMP_PD_mac3.bim
	# 83 LIN28A_WGS_AMP_PD_mac3.bim
	# 83 variants total pass --mac 3

2. RUNNING WITH --MODEL AND --ASSOC IN PLINK

We run --model and --assoc to get the frequencies of the SNPs and distribution of alleles of the dataset

Running with `--model`

plink --bfile $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3 --model \
--out LIN28A_WGS_AMP_PD_mac3_noCov
	# File generated @ LIN28A_WGS_AMP_PD_mac3.model
		# --model does not take covariates 
		# Shows distribution of hom/het/hom 
	# Done 21.04.2020

Running --assoc with no covariates

plink --bfile $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3 --assoc \
--pheno $PHENO \
--ci 0.95 \
--out LIN28A_WGS_AMP_PD_mac3_noCov
	# File generated @ LIN28A_WGS_AMP_PD_mac3_noCov.assoc
		# --assoc is a quick way to get frequencies 
		# Does not take covariates 
	# Done 21.04.2020

3. RUNNING WITH `--LOGISTIC` + `--FISHER` IN PLINK

Running `--logistic` with covariates

plink --bfile $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3 --logistic \
--pheno $PHENO \
--covar $COV_FILE \
--covar-name SEX,AGE,FAMILY_HISTORY,EDUCATION,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--ci 0.95 \
--hide-covar \
--out LIN28A_WGS_AMP_PD_mac3_wCovs
	# File generated @ LIN28A_WGS_AMP_PD_mac3_wCovs.assoc.logistic
		# --logistic does take covariates but wasn't run here 
		# --hide-covar flag to remove each and every individual covariate test from the final output
		# Adding study covariates with PLINK did not work... so were removed
			# But did work for RVTests, see section 7 
	# Done 21.04.2020

Running `--fisher` with covariates

plink --bfile $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3 --assoc fisher \
--pheno $PHENO \
--ci 0.95 \
--out LIN28A_WGS_AMP_PD_mac3 
	# File generated @ LIN28A_WGS_AMP_PD_mac3.assoc.fisher
		# --assoc fisher does not take covariates 
	# Done 21.04.2020

4. GENERATE VCFS

Generating a VCF file for `--mac 1`

plink --bfile $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1 \
--mac 1 \
--recode 'vcf-fid' --out $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1
	# Done 21.04.2020

Generating a VCF file for `--mac 3`

plink --bfile $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3 \
--mac 3 \
--recode 'vcf-fid' --out $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3
	# Done 21.04.2020

Zipping and tabix-ing the VCF for `--mac 1`

bgzip $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1.vcf
tabix -f -p vcf $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1.vcf.gz
	# Done 21.04.2020

Zipping and tabix-ing the VCF for `--mac 3`

bgzip $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3.vcf
tabix -f -p vcf $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3.vcf.gz
	# Done 21.04.2020

5. ANNOTATE VIA ANNOVAR

Annotate with ANNOVAR

# For burdens, only need --mac 1 

table_annovar.pl $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1.vcf.gz \
$ANNOVAR_DATA/hg38 --thread 20 -buildver hg38 \
-out $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1.annovar \
-arg '-splicing 15',,, \
-remove -protocol refGene,ljb26_all,gnomad211_genome,clinvar_20190305 \
-operation g,f,f,f -nastring . -vcfinput -polish

# Remove the VCF 
rm $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1.annovar.hg38_multianno.vcf

	# Done 21.04.2020

Information:

wc -l LIN28A_WGS_AMP_PD.annovar.hg38_multianno.txt
- 135 LIN28A_WGS_AMP_PD.annovar.hg38_multianno.txt
- Matches variant numbers above

Trim the Annotation File

head -1 LIN28A_WGS_AMP_PD_mac1.annovar.hg38_multianno.txt > header.txt
colct="$(wc -w header.txt| cut -f1 -d' ')"
cut -f1-$colct LIN28A_WGS_AMP_PD_mac1.annovar.hg38_multianno.txt > LIN28A.trimmed.annotation.txt

6. BURDEN ANALYSIS VIA RVTESTS

Generate CODING.txt - the range of exonic regions

For burdens, only need --mac 1

synonymous + non-synonymous
From trimmed file

grep "exonic" $WORKING_DIR/LIN28A.trimmed.annotation.txt > $WORKING_DIR/CODING_regions.txt
cut -f 1,2,3,7 $WORKING_DIR/CODING_regions.txt > CODING.txt 
# Information: 
	# wc -l CODING.txt
	# 4 CODING.txt
	# head CODING.txt
		# 1	26411441	26411441	LIN28A
		# 1	26425455	26425455	LIN28A
		# 1	26426394	26426394	LIN28A
		# 1	26426443	26426443	LIN28A

Prep the files and generate PLINK with only coding variants

plink --bfile $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1 --extract range CODING.txt --make-bed --out $WORKING_DIR/LIN28A_exonic_AMP_PD 
	# Files generated @ LIN28A_exonic_AMP_PD.*
	# Done 21.04.2020

Create the VCF with only coding variants

plink --bfile $WORKING_DIR/LIN28A_exonic_AMP_PD --recode 'vcf-fid' --out $WORKING_DIR/LIN28A_exonic_AMP_PD
	# Done 21.04.2020

# bgzip and tabix
bgzip $WORKING_DIR/LIN28A_exonic_AMP_PD.vcf
tabix -f -p vcf $WORKING_DIR/LIN28A_exonic_AMP_PD.vcf.gz
	# Done 21.04.2020

Run RVTests

NO MAF THRESHOLD on ALL and CODING ONLY variants

## ALL VARIANTS ##
rvtest --noweb --hide-covar --inVcf $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1.vcf.gz \
--pheno $COV_FILE \
--pheno-name PHENO \
--covar $COV_FILE \
--covar-name $COV_NAMES \
--kernel skat,skato --geneFile /data/LNG/makariousmb/refFlat_hg38.txt \
--out AMP_PD_BURDEN.LIN28A.NOFREQTHRESHOLD
	# Files generated @ AMP_PD_BURDEN.LIN28A.NOFREQTHRESHOLD.*.assoc
	# Done 21.04.2020

## CODING VARIANTS ##
rvtest --noweb --hide-covar --inVcf $WORKING_DIR/LIN28A_exonic_AMP_PD.vcf.gz \
--pheno $COV_FILE --covar $COV_FILE \
--pheno-name PHENO \
--covar-name $COV_NAMES \
--kernel skat,skato --geneFile /data/LNG/makariousmb/refFlat_hg38.txt \
--out AMP_PD_BURDEN.LIN28A.NOFREQTHRESHOLD_CODING
	# Files generated @ AMP_PD_BURDEN.LIN28A.NOFREQTHRESHOLD_CODING.*.assoc
	# Done 21.04.2020

MAF < 0.01 on ALL and CODING ONLY variants

## ALL VARIANTS ##
rvtest --noweb --hide-covar --inVcf $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1.vcf.gz \
--pheno $COV_FILE \
--pheno-name PHENO \
--covar $COV_FILE \
--covar-name $COV_NAMES \
--kernel skat,skato --geneFile /data/LNG/makariousmb/refFlat_hg38.txt \
--freqUpper 0.01 --out AMP_PD_BURDEN.LIN28A.maf001
	# Files generated @ AMP_PD_BURDEN.LIN28A.maf001.*.assoc
	# Done 21.04.2020

## CODING VARIANTS ##
rvtest --noweb --hide-covar --inVcf $WORKING_DIR/LIN28A_exonic_AMP_PD.vcf.gz \
--pheno $COV_FILE \
--pheno-name PHENO \
--covar $COV_FILE \
--covar-name $COV_NAMES \
--kernel skat,skato --geneFile /data/LNG/makariousmb/refFlat_hg38.txt \
--freqUpper 0.01 --out AMP_PD_BURDEN.LIN28A.maf001_CODING
	# Files generated @ AMP_PD_BURDEN.LIN28A.maf001_CODING.*.assoc
	# Done 21.04.2020

MAF < 0.03 on ALL and CODING ONLY variants

## ALL VARIANTS ##
rvtest --noweb --hide-covar --inVcf $WORKING_DIR/LIN28A_WGS_AMP_PD_mac1.vcf.gz \
--pheno $COV_FILE \
--pheno-name PHENO \
--covar $COV_FILE \
--covar-name $COV_NAMES \
--kernel skat,skato --geneFile /data/LNG/makariousmb/refFlat_hg38.txt \
--freqUpper 0.03 --out AMP_PD_BURDEN.LIN28A.maf003
	# Files generated @ AMP_PD_BURDEN.LIN28A.maf003.*.assoc
	# Done 21.04.2020

## CODING VARIANTS ##
rvtest --noweb --inVcf $WORKING_DIR/LIN28A_exonic_AMP_PD.vcf.gz \
--pheno $COV_FILE --covar $COV_FILE \
--pheno-name PHENO \
--covar-name $COV_NAMES \
--kernel skat,skato \
--geneFile /data/LNG/makariousmb/refFlat_hg38.txt \
--freqUpper 0.03 --out AMP_PD_BURDEN.LIN28A.maf003_CODING
	# Files generated @ AMP_PD_BURDEN.LIN28A.maf003_CODING.*.assoc
	# Done 21.04.2020

7. SINGLE VARIANT WALD AND SCORE TESTS VIA RVTESTS

For statistics, need --mac 3: Reason being if there is one allele, then not enough information to do stats, it is assigned a 0 and results in an NA

Single Variant Wald + Score Tests RVTests

	# Tests done on single variants done on ALL variants 
		# Wald = same as logistic in PLINK, better for common variants 
		# score = better for rare variants
rvtest --noweb --hide-covar --inVcf $WORKING_DIR/LIN28A_WGS_AMP_PD_mac3.vcf.gz \
--pheno $COV_FILE \
--pheno-name PHENO \
--covar $COV_FILE \
--covar-name $COV_NAMES \
--single wald,score --geneFile /data/LNG/makariousmb/refFlat_hg38.txt \
--out AMP_PD_mac3.LIN28A.LOGISTIC.allPCs
	# Files generated at AMP_PD_mac3.LIN28A.LOGISTIC.allPCs.*.assoc
		# --hide-covar flag to remove each and every individual covariate test from the final output
		# The variant line reported includes all PCs 
	# Done 21.04.2020

8. ANAYSIS IN IPDGC COHORT

A. Getting Started

B. Subset PLINK Binaries + Convert to VCFs

C. Output Frequency with PLINK

D. Logistic Regression with PLINK

E. Annotate VCFs with Annovar + KGGSeq

F. RVTESTS Burden tests

A. Getting Started

mkdir $WORKING_DIR/
cd $WORKING_DIR/

mkdir hardcallsNoNeuroX hardcallsNoNeuroX/bin 
mkdir hardcallsNoNeuroX/freq hardcallsNoNeuroX/score 
mkdir hardcallsNoNeuroX/burden hardcallsNoNeuroX/vcf 
mkdir hardcallsNoNeuroX/annotation hardcallsNoNeuroX/burden 
mkdir hardcallsNoNeuroX/logistic hardcallsNoNeuroX/freq

B. Subset PLINK Binaries + Convert to VCFs

# ---------Remove NeuroX + keep males + genotype quality of 15% or less missingness
plink --bfile $WORKING_DIR/HARDCALLS_PD_september_2018_no_cousins \
--remove-fam $WORKING_DIR/NeuroX.fID.txt \
--chr 1 --geno 0.15 --from-bp 26737269 --to-bp 26756213 \
--make-bed --out hardcallsNoNeuroX/bin/LIN28A.GWAS

plink --bfile hardcallsNoNeuroX/bin/LIN28A.GWAS --recode vcf --out hardcallsNoNeuroX/vcf/LIN28A.GWAS

cd hardcallsNoNeuroX/vcf
bgzip LIN28A.GWAS.vcf
tabix -f -p vcf LIN28A.GWAS.vcf.gz

C. Output Frequency

cd $WORKING_DIR/
## Overall
plink --bfile $WORKING_DIR/HARDCALLS_PD_september_2018_no_cousins --chr 1 --from-bp 26737269 --to-bp 26756213  --remove-fam $WORKING_DIR/NeuroX.fID.txt --freq --geno 0.15 --out hardcallsNoNeuroX/freq/LIN28A

## Case-control

plink --bfile $WORKING_DIR/HARDCALLS_PD_september_2018_no_cousins --chr 1 --from-bp 26737269 --to-bp 26756213  --remove-fam $WORKING_DIR/NeuroX.fID.txt --freq case-control --geno 0.15 --out hardcallsNoNeuroX/freq/LIN28A

D. Output Logistic Regression

plink --bfile $WORKING_DIR/HARDCALLS_PD_september_2018_no_cousins \
--remove-fam $WORKING_DIR/NeuroX.fID.txt \
--chr 1 --from-bp 26737269 --to-bp 26756213 \
--geno 0.15 --covar $WORKING_DIR/IPDGC_all_samples_covariates.tab \
--covar-name sex,AGE,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,DUTCH,FINLAND,GERMANY,MCGILL,MF,NIA,OSLO,PROBAND,PROPARK,SHULMAN,SPAIN3,SPAIN4,TUBI,UK_GWAS,VANCE \
--assoc --out hardcallsNoNeuroX/logistic/LIN28A

E. Annotate VCFs with ANNOVAR

mkdir $WORKING_DIR/annovar

# Downloading databases 
# annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
# annotate_variation.pl -buildver hg19 -downdb cytoBand humandb/
# annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ensGene humandb/
# annotate_variation.pl -buildver hg19 -downdb -webfrom annovar exac03 humandb/ 
# annotate_variation.pl -buildver hg19 -downdb -webfrom annovar avsnp147 humandb/ 
# annotate_variation.pl -buildver hg19 -downdb -webfrom annovar dbnsfp30a humandb/
# annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad211_genome humandb/

table_annovar.pl $WORKING_DIR/hardcallsNoNeuroX/vcf/LIN28A.GWAS.vcf.gz $WORKING_DIR/annovar/humandb/ -buildver hg19 \
--thread 16 \
-out $WORKING_DIR/hardcallsNoNeuroX/annotation/LIN28A.annovar \
-remove -protocol avsnp147,refGene,ensGene,gnomad211_genome \
-operation f,g,g,f \
-nastring . \
-vcfinput

cd $WORKING_DIR/hardcallsNoNeuroX/annotation
head -1 LIN28A.annovar.hg19_multianno.txt > header.txt
colct="$(wc -w header.txt| cut -f1 -d' ')"
cut -f1-$colct LIN28A.annovar.hg19_multianno.txt > LIN28A.trimmed.annotation.txt

F. Burden Analysis

Note that here we do not adjust by AGE since the model does not fit in RVTests

cd $WORKING_DIR/LIN28A

rvtest --noweb --inVcf $WORKING_DIR/hardcallsNoNeuroX/vcf/LIN28A.GWAS.vcf.gz --pheno $WORKING_DIR/IPDGC_all_samples_covariates.vcf.tab --covar $WORKING_DIR/IPDGC_all_samples_covariates.vcf.tab --covar-name sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,DUTCH,FINLAND,GERMANY,MCGILL,MF,NIA,OSLO,PROBAND,PROPARK,SHULMAN,SPAIN3,SPAIN4,TUBI,UK_GWAS,VANCE --burden cmc,zeggini,mb,fp,cmcWald --kernel skat,skato --geneFile $WORKING_DIR/refFlat_hg19.txt --freqUpper 0.03 --out hardcallsNoNeuroX/burden/BURDEN.LIN28A.maf03

Files

LIN28A.md

Latest commit

History

LIN28A.md

File metadata and controls

LIN28A Rare Variant Analysis

Information

DESCRIPTION

INSPIRATION

PROPOSED WORKFLOW

0. GETTING STARTED

Load the necessary modules (done on Biowulf)

Information on AMP-PD WGS Used:

1. EXTRACT THE LIN28A REGION

2. RUNNING WITH --MODEL AND --ASSOC IN PLINK

Running with --model

Running --assoc with no covariates

3. RUNNING WITH --LOGISTIC + --FISHER IN PLINK

Running --logistic with covariates

Running --fisher with covariates

4. GENERATE VCFS

Generating a VCF file for --mac 1

Generating a VCF file for --mac 3

Zipping and tabix-ing the VCF for --mac 1

Zipping and tabix-ing the VCF for --mac 3

5. ANNOTATE VIA ANNOVAR

Annotate with ANNOVAR

Information:

Trim the Annotation File

6. BURDEN ANALYSIS VIA RVTESTS

Generate CODING.txt - the range of exonic regions

Prep the files and generate PLINK with only coding variants

Create the VCF with only coding variants

Run RVTests

NO MAF THRESHOLD on ALL and CODING ONLY variants

MAF < 0.01 on ALL and CODING ONLY variants

MAF < 0.03 on ALL and CODING ONLY variants

7. SINGLE VARIANT WALD AND SCORE TESTS VIA RVTESTS

Single Variant Wald + Score Tests RVTests

8. ANAYSIS IN IPDGC COHORT

A. Getting Started

B. Subset PLINK Binaries + Convert to VCFs

C. Output Frequency

D. Output Logistic Regression

E. Annotate VCFs with ANNOVAR

F. Burden Analysis

Running with `--model`

3. RUNNING WITH `--LOGISTIC` + `--FISHER` IN PLINK

Running `--logistic` with covariates

Running `--fisher` with covariates

Generating a VCF file for `--mac 1`

Generating a VCF file for `--mac 3`

Zipping and tabix-ing the VCF for `--mac 1`

Zipping and tabix-ing the VCF for `--mac 3`