From 84009ddaa783edf594938f67f57d05d0e07dc7ae Mon Sep 17 00:00:00 2001 From: smlmbrt Date: Tue, 21 May 2024 11:43:39 +0100 Subject: [PATCH] Add in documentation about popsimilarity file. --- docs/explanation/output.rst | 30 ++++++++++++++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/docs/explanation/output.rst b/docs/explanation/output.rst index 243a9ce8..4aff03fe 100644 --- a/docs/explanation/output.rst +++ b/docs/explanation/output.rst @@ -38,8 +38,12 @@ If you have run the pipeline **without** using ancestry information the followin commands; however, the calculation of the PGS is based on the full precision of the effect_weight value in the scoring file. -If you have run the pipeline **using ancestry information** (``--run_ancesty``) the following columns may be present -depending on the ancestry adjustments that were run (see :ref:`norm` for more details): +``--run_ancestry``-specific outputs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you have run the pipeline **using ancestry information** (``--run_ancestry``) the following columns may be present +in the ``[sampleset]_pgs.txt.gz`` file depending on the ancestry adjustments that were run (see :ref:`norm` for +more details): - ``percentile_MostSimilarPop``: PGS reported as a percentile of the distribution for the Most Similar Population - ``Z_MostSimilarPop``: PGS reported as a Z-score in reference to the mean/sd of the Most Similar Population @@ -47,6 +51,28 @@ depending on the ancestry adjustments that were run (see :ref:`norm` for more de - ``Z_norm2``: PGS adjusted to have mean 0 and unit variance across ancestry groups (result of regressing *resid(PGS)^2 ~ PCs*) +A second gzipped-text space-delimited text file called ``[sampleset]_popsimilarity.txt.gz`` will also be output, +describing the analysis of the target samples in relation to the reference panel and ancestry labels. The file has the +following headers: + +- ``sampleset``: the name of the input sampleset, or ``reference`` for the panel. +- ``IID``: the identifier of each sample within the dataset. +- ``[PC1 ... PCN]``: The projection of the sample within the PCA space defined by the reference panel. There will be as + many PC columns as there are PCs calculated (default: 10). +- ``Unrelated``: True/False flag for whether the reference panel sample is part of the unrelated subset of individuals + used for calculating PGS adjustments. +- ``RF_P_[POP LABEL]`` or ``Mahalanobis_P_[POP LABEL]``: Probability that this sample's PCA projection is consistent + with the PCA location of the specified population label defined using either a RandomForest classifier (``RF``, + default) or the Chi-square derived probability from a Mahalanobis distance (``Mahalanobis``). +- ``MostSimilarPop``: Population label with the highest probability across ``RF_P_[POP LABEL]`` + or ``Mahalanobis_P_[POP LABEL]`` columns. +- ``MostSimilarPop_LowConfidence``: Whether the probability is below the default QC threshold for the population + comparison method. +- ``REFERENCE``: True/False flag for whether the sample is from the reference panel. +- ``SuperPop``: Population label from the reference panel used to assign the ``MostSimilarPop`` labels and PGS + distributions for empirical adjustments. + + Report ~~~~~~