-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MATCH_COMBINE error #72
Comments
@Srividhya-Sainath : this error occurs when a single line/variant in the scoring file matches multiple variants in the genotyping datasets. The first thing to check would be that there aren't duplicated variants in your genotyping data (are you using a single file or is the data split across chromosomes)? |
I am using a single file with data across chromosomes. |
Ok, so there's only a header and 1 row in
and then use plink to extract the positions [We'll also try to work in parallel to update the software to have more informative error messages about which variants are causing problems] |
@smlmbrt I've also found this in some of my current runs using the HGDP + KGP dataset recently released (see logfile + sample sheet below). To check for duplicate variants I've also used the following:
And it doesn't seem to print anything (indicating no duplicated variants in the genotyping dataset). Do multi-allelic variants need to be filtered out from the dataset based on |
@aabiddanda it's possible you could accidentally have duplicates when you look at the union of all variants across all of the files. You could check this by catting all of the files at the same time before piping:
|
Thanks for the suggestion @nebfield - I've tried what you suggested and I am not getting any duplicate IDs appearing still. However, I did get a full run to work by changing all of the |
Great 🚀 I think there's a mismatch between the VCF with a chr prefix and the internal plink files we make at the start of the workflow (which strips the chr prefix by default). The samplesheet information is used to match against the internal plink files. I'll set up something to check chromsoomes in a future release. |
I have the same error and I'm wondering if it has to do with multi-allelic variants? In my original bgen I have many multiallelic variants for example I'm just using one chunked pgen in my sameplesheet to test
I tried the
|
This assertion error has been changed to be more helpful and descriptive in our latest release Please update ( |
Hi,
I am trying to run the following code:
nextflow run pgscatalog/pgsc_calc -profile test,conda --input parkinson_samplesheet.csv --target_build GRCh37 --pgs_id PGS000123
My input genome data is in pfile format with date from chr 1-22. I am getting this error:
line 33, in combine_matches
_check_duplicate_vars(matches)
File "/home/ssainath/.conda/envs/nextflowenv/lib/python3.11/site-packages/pgscatalog_utils/match/combine_matches.py", line 48, in _check_duplicate_vars
assert max_occurrence == [1], "Duplicate IDs in final matches"
AssertionError: Duplicate IDs in final matches
Not sure where I am going wrong.
The text was updated successfully, but these errors were encountered: