I have some confused part for the preparation of the input genomes when using pgsc_calc #98

xiyasong · 2023-03-29T13:51:58Z

xiyasong
Mar 29, 2023

Hi:
Thank you for developing such a great nextflow pipeline, I tested and it's running well on the test dataset. Now I want to apply it to my own sample cohort. My genome files are originally separate VCF files and each VCF file only contains WGS data for one individual. I tried setting one file (I converted it to plink2 format) using a sample sheet and it's successfully running. Then got the hint that the sample size is too small. My question here is should I always use pgsc_calc to calculate PRS scores on a cohort not just per sample? And if so, should I get multi-sample merged VCF files (or all other kinds of format) always, and the sample sheet going to be only one line? (if I just want to use all of the chromosomes)

My second question is, when I calculated PRS scores by only one sample's genotype, it always give errors of variant matching lower than the threshold (around 50%). Although I fixed the error by resetting the threshold, I wonder if is that a normal thing.

Thank you in advance and looking forwards to your reply.

nebfield · 2023-03-29T14:17:16Z

nebfield
Mar 29, 2023
Maintainer

Hello,

It's definitely best to calculate scores on cohorts in version v1.3.2, so merging your individual VCFs is a good idea. If the merged VCF contains multiple chromosomes, then a single line in the sample sheet is OK.

A 50% match rate isn't good, but it's a normal problem when working with WGS data. We assume VCF input to this workflow have been processed by an imputation server like Michigan or TopMed. The workflow has difficulties with WGS VCFs because variants that are homozygous REF are treated as missing data, lowering the match rate.

Here's an explanation of match rates: #86 (comment)

Another user working with WGS gVCFs described some extra steps they had to take to process the VCF before using our workflow:

I hope I've explained well enough 😄 Please let me know if you have any more questions

4 replies

xiyasong Apr 12, 2023
Author

Hi: Thanks for your reply. I have checked the #50 that another user working with WGS data, and I understood what I should do starting from BAM files. But the problem is that I do not have the original bam files that can re-create the VCF files with the correct reference alleles included. Here is what I tried: I used bcftools to merge the sample VCF files directly, and it gave a final VCF file like this: (I deleted some long stuff here, that's originally a correct vcf format file)

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample 1 sample 2 sample 3 sample 4 ...
chr10 10038 . T TA 8.15 PASS (some info) GT ./. ./. 0/1 ./.

Then I transfer the vcf file to the Plink2 format by
./plink2 --vcf 207-samples-Merged.vcf.gz \ --allow-extra-chr \ --chr 1-22, X, Y, XY \ -make-pgen --out 207-samples-Merged
I successfully got some PRS scores and reports. Indeed, there are a lot of variant sites that exist because I merged around 200 samples, so the matching rate saying to the report was 97.8%. But I am just still not sure how does this pipeline treat all of these missing genotype data? I feel like the PGS scores I got are all limited in a very small range (56.1266-56.3283). The density plot is like this:

So I wonder is that result a normal case? Or the scores are all limited to around 56 because of the missing genotypes problem?

Looking forward to your reply. Thanks!

smlmbrt Apr 12, 2023
Maintainer

Hi @xiyasong, the current pipeline will use the observed allele-frequency to impute the missing genotypes as is standard in plink2 (https://www.cog-genomics.org/plink/2.0/score). You could try considering the missing alleles as homozygous reference by adding the bcftools merge --missing-to-ref [...] (https://samtools.github.io/bcftools/bcftools.html#merge) and see how much this changes the results as it's probably calculating the allele-frequency on non-missing sites.

xiyasong Apr 19, 2023
Author

Hi: Thank you for your reply! I have tried bcftools merge --missing-to-ref [...] and rerun the pgs score calculation on the same cohort for the same PGS score. The average score has been decreasing and is still distributed in a very narrow range.

Then I calculated the PGS score rank I got from two kinds of methods, the similarity score (Spearman's rank correlation coefficient) is 0.63. So the two settings give the rank with some correlations but are not very much the same. Not sure whether it makes sense, but will just share it here.

smlmbrt Apr 21, 2023
Maintainer

Thanks for sharing @xiyasong - I think that's interesting, and probably makes sense that the shape of the distribution widens because the non-missing alleles were probably contributing to the allele frequency being higher for everyone. Need to think a bit more on this, and perhaps calculate the score on UKB & 1000G to see what distribution I would get there. Sometimes they are quite narrow on the SUM scale.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I have some confused part for the preparation of the input genomes when using pgsc_calc #98

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

I have some confused part for the preparation of the input genomes when using pgsc_calc #98

xiyasong Mar 29, 2023

Replies: 1 comment · 4 replies

nebfield Mar 29, 2023 Maintainer

xiyasong Apr 12, 2023 Author

smlmbrt Apr 12, 2023 Maintainer

xiyasong Apr 19, 2023 Author

smlmbrt Apr 21, 2023 Maintainer

xiyasong
Mar 29, 2023

Replies: 1 comment 4 replies

nebfield
Mar 29, 2023
Maintainer

xiyasong Apr 12, 2023
Author

smlmbrt Apr 12, 2023
Maintainer

xiyasong Apr 19, 2023
Author

smlmbrt Apr 21, 2023
Maintainer