Image source: Council of State Archivists
A detailed workflow to compute and visualize the principal component analysis (PCA) of genotypic Single Nucleotide Polymorphisms (SNPs). The workflow leverages a curated set of 10,000 SNPs predefined by GRAF to pinpoint ancestry markers. For the computation of PCA, we employ PLINK for generating the eigenvectors and eigenvalues.
- Fingerprinting SNPs Extraction: Extract GRAF's 10,000 curated SNPs from the dbSNP database.
- Data Cleaning: Ensure the extracted SNPs are exclusively biallelic. (included in the previous notebook)
- SNPs Retrieval from 1,000 Genomes Project: Extract the genotypes of 10,000 fingerprinting positions from the 1,000 Genomes Project's VCF dataset.
- PCA Computation: Generate PCA's eigenvectors and eigenvalues using PLINK.
- PCA Visualization: Visualize the PCA data, highlighting the relationships between samples using R.