-
Notifications
You must be signed in to change notification settings - Fork 100
Idefix
The method described here aims to identify sample mix-ups in biobanks using polygenic scores (PGS). Sample mix-ups frequently occur in genetic genomic datasets generated in a research setting (Westra et al, 2011). This novel tool takes advantage of the relationship between PGSs and actual phenotypes to predict which samples are erroneous. Details of this new method are described in our pre-print: Warmerdam & Lanting et al, 2021. The GitHub repository can be cloned from here.
Our method for identifying mix-ups is based on the expected correspondence of the genetic propensity for a particular phenotype and the actual phenotype. Herein the genetic propensity for a particular phenotype is represented by polygenic scores. When this sample mix-up method is performed for enough traits, the predictive power (measured by the area under the ROC, for 25 traits) exceeds that of the regular used check for the correspondence of reported sex and inferred sex (~ 0.8 vs ~ 0.75 respectively). Our method works by:
- Modelling the relationships between phenotypes and polygenic scores,
- calculating the residuals of the provided samples and their permutations,
- employing a likelihood model fitted on the residuals for provided and permuted samples,
- combining log likelihood ratios over traits.
- estimating the mix-up rate in the dataset to apply a threshold, making the remaining set of samples suitable for the intended application.
Here, we will focus on how to use the main scripts to identify mix-ups in a dataset. Other scripts are described here
This tool is developed in the R
programming language
(tested with version 3.6.1). A number of R packages are used in the tool.
These can be installed using the following steps:
- Either clone this repository using the following git command:
git clone https://github.com/molgenis/pgs_based_sample_mix-up_correction.git
or download the source code directly from Github.
- Move directory to the package directory:
cd pgs_based_sample_mix-up_correction
- Install the required packages by executing the
./src/install-packages.R
R-script:
Rscript ./src/install-packages.R
Input requires polygenic scores and phenotypes for a range of traits that have well powered GWASs available like height, BMI and platelet concentration. Traits can be of continuous, ordinal or binary type.
The traits that are most effective for predicting mix-ups are those for which the most accurate polygenic scores can be calculated. Accurate polygenic scores can only be calculated for traits with high SNP-heritability, such as body height or platelet concentration. The heritability of traits in the UK biobank is maintained in the UKB SNP-Heritability browser. This resource also contains links to correlation data. Traits ideally are independently inherited. This can be assessed using these correlations.
There should also be a well-powered GWAS with summary statistics available for the trait in question. While the previous resource also lists summary statistics estimated in the UK biobank, A list of available summary statistics is also maintained in, for instance, the GWAS Catalog.
The PGS Catalog lists published polygenic scores with performance estimates and how they are constructed.
Examples of informative traits include the following:
- Standing height
- Body mass index
- Hand grip strength
- Basal metabolic rate
- Mean platelet (thrombocyte) volume
- Corneal resistance factor
- Heel bone mineral density (BMD)
- IGF-1 (nmol/L)
- Monocyte count
- Reticulocyte count
- Medication for cholesterol, blood pressure or diabetes: Blood pressure medication
- Eosinophill percentage
- Creatinine (umol/L)
- Glycated haemoglobin (mmol/mol)
- Lymphocyte count
- Total protein (g/L)
- Neutrophill count
- Coronary atherosclerosis
- Smoking status
- Red hair colour
The core of the tool expects polygenic scores in a tab separated files with sample identifiers in the first column, and polygenic scores in the second column. The first line is assumed to be a header:
IID SCORESUM
S01 0.042
S02 1.345
... ...
Over recent years a wide range of tools have been published for calculating polygenic scores like LDpred
(Vilhjálmsson et al., 2015)
and PRScs (Ge et al., 2019).
The latter relies on the scoring functionality of PLINK (--score
) to do the final calculation.
The ./src/sum-plink-profiles.R
script has been made to sum the output over all chromosomes for all samples when
having performed PLINK's allelic scoring method on a per chromosome basis.
usage: ./src/sum-plink-profiles.R [-h] --plink-profiles PLINK_PROFILES [PLINK_PROFILES ...] --out
OUT --scoresum-column SCORESUM_COLUMN
optional arguments:
-h, --help show this help message and exit
--plink-profiles PLINK_PROFILES [PLINK_PROFILES ...]
path to plink --score output files
--out OUT output path of summed profiles
--scoresum-column SCORESUM_COLUMN
columnname containing summable polygenic scores
The PLINK2 --score
command requires additional option cols=+scoresums
to obtain a column that can be summed
between chromosomes.
For optimal performance polygenic scores should be able to explain a large part of the variance in phenotypes. Therefore, phenotypes should ideally be processed in similar fashion to the phenotypes used in their corresponding GWAS. In this light our method also incorporates sex and age of samples in modelling the relationship between polygenic scores and actual phenotypes.
The input with regards to phenotypes should be a tab-separated file with 5 required columns (each having a fixed column name):
ID AGE SEX VALUE TRAIT
S01 34 Male 176 Height
S02 45 Female 165 Height
...
S89 19 Male 19 EduYears
S99 52 Female 22 EduYears
...
Herein every row corresponds to the phenotype of a sample for a particular trait, combined with covariates corresponding to the individual and the time of the measurement. Each of the columns represent the following:
-
ID
: Sample identifier column -
AGE
: Age at the time of measurement -
SEX
: Biological sex of the sample. EitherFemale
orMale
-
VALUE
: The phenotype of the sample for the trait in question -
TRAIT
: The trait corresponding to this row
A tab-separated file is required that links the polygenic score input to the phenotypes. This file
should additionally include the data type of the phenotypes,
which affects the type of model that used for modelling the relationship between the polygenic scores and phenotypes.
The supported data types are continuous
, ordinal
and binary
. We apply linear regression,
ordinal logistic regression and regular logistic regression respectively.
The file should look like the following:
trait traitDataType summaryStatistics
Height continuous Meta-analysis_Wood_et_al+UKBiobank_2018.processed
BMI continuous Meta-analysis_Locke_et_al+UKBiobank_2018.processed
Blondeness of hair ordinal ukb-d-1747_1.processed
Red hair colour binary ukb-d-1747_2.processed
...
Herein every row corresponds to a single trait, and the column names must match. The columns represent the following:
-
trait
: Name of a trait; Should correspond to theTRAIT
column in the phenotype input file -
traitDataType
: Data type of the trait in question;continuous
,ordinal
andbinary
-
summaryStatistics
: Name of the folder containing polygenic scores for the trait in question.
The path to the actual polygenic scores is compiled of a base path,
the folder name corresponding to a particular trait,
and the end of the path directing to a particular file present in all of the trait-specific folders.
The base path (--base-pgs-path
) and end of the path (--pgs-file-name
) are provided as an option to the method.
(detailed use in sample swap prediction).
In the case sample identifiers are of different nature for the phenotypes compared to the polygenic scores, a sample coupling file can be supplied with the genotype / PGS sample ids in the first column and the phenotype sample ids in the second column.
Options for the sample swap prediction script are listed below:
usage: src/sample-swap-prediction.R [-h] [--debug]
[--base-fit-model-path BASE_FIT_MODEL_PATH]
--trait-gwas-mapping TRAIT_GWAS_MAPPING
(--base-pgs-path BASE_PGS_PATH | --pgs-file PGS_FILE)
[--pgs-file-name PGS_FILE_NAME]
[--no-adults-only] --phenotypes-file
PHENOTYPES_FILE
[--sample-coupling-file SAMPLE_COUPLING_FILE]
[--split-prediction [SPLIT_PREDICTION]]
[--llr-bayes-method LLR_BAYES_METHOD [LLR_BAYES_METHOD ...]]
(--sex-check-table SEX_CHECK_TABLE | --calc-p-expected-mixUps CALC_P_EXPECTED_MIXUPS CALC_P_EXPECTED_MIXUPS CALC_P_EXPECTED_MIXUPS CALC_P_EXPECTED_MIXUPS | --p-expected-mixUps P_EXPECTED_MIXUPS)
[--p-max-mixUps P_MAX_MIXUPS]
[--bayes-method-sweep-mode] [--out OUT]
[--likelihood-ratio-alpha LIKELIHOOD_RATIO_ALPHA]
[--output-intermediate-statistics]
optional arguments:
-h, --help show this help message and exit
--debug write intermediate llr and residuals files.
--base-fit-model-path BASE_FIT_MODEL_PATH
path to a directory to write fitted model parameters
to, or to load fitted models from.
--trait-gwas-mapping TRAIT_GWAS_MAPPING
path to a tab-delimited file that maps traits to the
polygenic scores.
--base-pgs-path BASE_PGS_PATH
path to a directory containing polygenic scores.
Folder structure should be '<base-pgs-path>/<name-of-
gwas-summary-statistic>/<pgs-file-name>'
--pgs-file PGS_FILE path to a tab-delimited file holding all polygenic
score data. Columns should represent sample IDs and
individual PGSs respectively.
--pgs-file-name PGS_FILE_NAME
name of files that hold polygenic scores. ignored when
a combined table of polygenic scores is supplied
(--pgs-file).
--no-adults-only Turns filtering step (AGE > 18) off
--phenotypes-file PHENOTYPES_FILE
path to a tab-delimited file holding all processed
phenotype data
--sample-coupling-file SAMPLE_COUPLING_FILE
file containing genotype sample ids in the first
column and phenotype sample ids in the second column
--split-prediction [SPLIT_PREDICTION]
receiver operating characteristics using 50/50 split
procedure.
--llr-bayes-method LLR_BAYES_METHOD [LLR_BAYES_METHOD ...]
the naive bayes method to use for calculating log
likelihood ratios. If NA, the method will be based on
the data type for each trait <ewi-discretization |
efi-discretization | gaussian | NA> [(average) number
of samples per bin]
--sex-check-table SEX_CHECK_TABLE
Tab-separated file containing genotype ids in the
first column genotype inferred sex in the second
column, and reported sex in the third column This will
be used to estimate the number of mix-ups present in
the data, and for a sex correspondence check. Use
--p-expected-mixUps or --calc-p-expected-mixUps if sex
correspondence check is already performed
--calc-p-expected-mixUps SWAPS_SEX_CHECK SWAPS_TOTAL N_REMAINING SEX_FREQUENCY
Estimates the number of mix-ups in the cohort.
Number of samples that are identified in a sex-
correspondence check, The total number of mix-ups that
are removed, The total number of remaining samples and
the frequency of females or males.
--p-expected-mixUps P_EXPECTED_MIXUPS
Proportion of expected mix-ups. range between 0 and 1
--p-max-mixUps P_MAX_MIXUPS
Maximum proportion of mix-ups. range between 0 and 1.
Default is 1e-04
--bayes-method-sweep-mode
special mode to perform a sweep over a series of naive
bayes methods
--out OUT path to output directory
--likelihood-ratio-alpha LIKELIHOOD_RATIO_ALPHA
the t-test alpha to use for selecting traits based on
the difference in log likelihood ratios between the
provided and permuted samples
--output-intermediate-statistics
Setting this to false will prevent intermediate AUC
calculations, requiring less memory.
Notes:
-
--base-fit-model-path
can be used to fit parameters on a set of samples which are expected to be correct. The fitted parameters can subsequently be used on a set of samples with a large proportion of introduced fake mix-ups to get a better ROC estimate. -
--llr-bayes-method
: If this is NA, gaussian functions will be fitted to residuals for continuous traits, and an equal width discretization technique (ewi-discretization) will be used for ordinal and binary traits. -
--likelihood-ratio-alpha
: This is 0.05 by default. -
--bayes-method-sweep-mode
: Aborts the method after traits are independently processed; no combined output.
To allow Idéfix to calculate a mix-up rate and perform filtering based op a set maximum, the sex-check table should be present
if a sex-check was not already performed. Otherwise the mix-up rate can be provided using
--p-expected-mixUps
. Idéfix can calculate the mix-up rate when arguments are supplied to
--calc-p-expected-mixUps SWAPS_SEX_CHECK SWAPS_TOTAL N_REMAINING SEX_FREQUENCY
. Herein,
-
SWAPS_SEX_CHECK
denotes the number of samples for which the genotype inferred sex did not match the phenotype inferred sex (sex-check failed samples). -
SWAPS_TOTAL
denotes the total number of samples that are removed after sample mix-ups identificant methods (for instance both a sex-check and a pedigree check) -
N_REMAINING
denotes the total number of samples that remain after removal of sample mix-ups. -
SEX_FREQUENCY
denotes the frequency of males or females. The maximum mix-up rate can be given using--p-max-mixUps
. This is set to 0.01% by default.
A number of output files are written to the directory supplied to --out
. The main output is described below:
-
./idefixPredictions.txt
: tab-separated file with every row representing an input sample. The following columns are included:-
Var1
: the sample identifier from the phenotype input -
Var2
: the sample identifier from the PGS input. -
logLikelihoodRatios
: the raw log likelihood ratio of the sample being mixed-up. -
scaledLlr
: main prediction value; the scaled log likelihood ratio of the sample being mixed-up. -
numberOfTraits
: the number of traits that were available and used for aggregating log likelihood ratios over traits. -
sensitivity
: the sensitivity calculated from the provided and permuted sample mappings. -
specificity
: the specificity calculated from the provided and permuted sample mappings.
When the sex-check is performed in idéfix, or when the mix-up rate is calculated or provided by other means. The following columns are additionally present:
-
nPass
: for each sample the number of samples with a sensitivity larger or equal. -
rPass
: for each sample the updated mix-up rate in the set of samples with sensitivity larger or equal. -
pass
:TRUE
whenrPass
meets the requirement of a maximum mix-up rate,FALSE
whenrPass
is higher than the maximum set mix-up rate.
-
-
./overallOutputStatistics.tsv
: tab-separated file with a series of statistics gathered. Contains the following data:-
matrixWideAuc
: An ROC-AUC calculated on scaled log likelihood ratios, over provided samples and all their permutations. -
confinedAuc
: An ROC-AUC calculated on scaled log likelihood ratios, over provided samples only. Only available with fake mix-ups introduced.
-
-
./outputStatisticsPerTrait.tsv
: tab-separated file with a series of statistics on single traits. The following statistics are always included:-
numberOfSamples
: The number of samples for which the trait was available. -
pValue
: The pValue for a t-test testing the if the log likelihood ratios for permuted samples are greater that those of provided samples.
Optional values when
--output-intermediate-statistics
has not been disabled:-
matrixWideAucOnScaledLlr
: An ROC-AUC calculated on scaled log likelihood ratios, over provided samples and all their permutations. -
confinedAucOnScaledLlr
: An ROC-AUC calculated on scaled log likelihood ratios, over provided samples only. Only available with fake mix-ups introduced.
-
-
./aggregatedLogLikelihoodRatiosDataFrame.tsv
: tab-separated file similar in nature to./providedSampleDataFrame.tsv
, but permuted samples are included.
-
Ge, Tian, Chia-Yen Chen, Yang Ni, Yen-Chen Anne Feng, and Jordan W. Smoller. “Polygenic Prediction via Bayesian Regression and Continuous Shrinkage Priors.” Nature Communications 10, no. 1 (April 16, 2019): 1–10. https://doi.org/10.1038/s41467-019-09718-5.
-
Vilhjálmsson, Bjarni J., Jian Yang, Hilary K. Finucane, Alexander Gusev, Sara Lindström, Stephan Ripke, Giulio Genovese, et al. “Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores.” American Journal of Human Genetics 97, no. 4 (October 1, 2015): 576–92. https://doi.org/10.1016/j.ajhg.2015.09.001.
-
Westra, Harm-Jan, Ritsert C. Jansen, Rudolf S. N. Fehrmann, Gerard J. te Meerman, David van Heel, Cisca Wijmenga, and Lude Franke. “MixupMapper: Correcting Sample Mix-Ups in Genome-Wide Datasets Increases Power to Detect Small Genetic Effects.” Bioinformatics 27, no. 15 (August 1, 2011): 2104–11. https://doi.org/10.1093/bioinformatics/btr323.
- QTL mapping pipeline
- Genotype Harmonizer
- Genotype IO
- ASE
- GADO Command line
- Downstreamer
- GeneNetwork Analysis
Analysis plans
Other