MATCH_COMBINE error #72

Srividhya-Sainath · 2022-12-29T18:44:52Z

Hi,

I am trying to run the following code:

nextflow run pgscatalog/pgsc_calc -profile test,conda --input parkinson_samplesheet.csv --target_build GRCh37 --pgs_id PGS000123

My input genome data is in pfile format with date from chr 1-22. I am getting this error:

line 33, in combine_matches
_check_duplicate_vars(matches)
File "/home/ssainath/.conda/envs/nextflowenv/lib/python3.11/site-packages/pgscatalog_utils/match/combine_matches.py", line 48, in _check_duplicate_vars
assert max_occurrence == [1], "Duplicate IDs in final matches"
AssertionError: Duplicate IDs in final matches

Not sure where I am going wrong.

The text was updated successfully, but these errors were encountered:

smlmbrt · 2023-01-03T15:57:56Z

@Srividhya-Sainath : this error occurs when a single line/variant in the scoring file matches multiple variants in the genotyping datasets. The first thing to check would be that there aren't duplicated variants in your genotyping data (are you using a single file or is the data split across chromosomes)?

Srividhya-Sainath · 2023-01-04T22:31:55Z

I am using a single file with data across chromosomes.

smlmbrt · 2023-01-05T11:10:43Z

Ok, so there's only a header and 1 row in parkinson_samplesheet.csv? Is it possible then that there are duplicate variants in the genotyping data. One test for that would be to make a 4 column text file of scoring file positions (chrom, start, stop, id):

query.txt
1 1000 1000 1:1000
2 2000 2000 2:2000

and then use plink to extract the positions plink --pfile [your pfile] --extract range query.txt --make-just-pvar --out extract_positions and check whether there are duplicates?

[We'll also try to work in parallel to update the software to have more informative error messages about which variants are causing problems]

aabiddanda · 2023-10-11T13:51:49Z

@smlmbrt I've also found this in some of my current runs using the HGDP + KGP dataset recently released (see logfile + sample sheet below).

To check for duplicate variants I've also used the following:

for i in {1..22}; do zstdcat work/genomes/HgdpTgpCancerTest/GRCh38/chr$i/GRCh38_HgdpTgpCancerTest_chr$i.pvar.zst | grep -v "#" | awk '{print $3}' | sort | uniq -d ; done

And it doesn't seem to print anything (indicating no duplicated variants in the genotyping dataset). Do multi-allelic variants need to be filtered out from the dataset based on CHROM:POS explicitly prior to running the workflow? Any suggestions would be quite helpful!

nextflow.pgs_calc.ovarian_cancer.101123.txt

hgdp_tgp_samplesheet.csv

nebfield · 2023-10-13T09:44:30Z

@aabiddanda it's possible you could accidentally have duplicates when you look at the union of all variants across all of the files. You could check this by catting all of the files at the same time before piping:

$ find genomes/hapnest/GRCh38 -name '*.pvar.zst' -exec zstdcat {} \+ | grep -v '#' | awk '{print $3}' | sort | uniq -d

aabiddanda · 2023-10-13T16:15:58Z

Thanks for the suggestion @nebfield - I've tried what you suggested and I am not getting any duplicate IDs appearing still.

However, I did get a full run to work by changing all of the chr* chromosome values in my samplesheet to just the number value - chr1 -> 1. This is in spite of my genome build being GRCh38. I'll see if I can understand why this works but just wanted to let you know that this was the fix for me.

nebfield · 2023-10-17T12:51:41Z

Thanks for the suggestion @nebfield - I've tried what you suggested and I am not getting any duplicate IDs appearing still.

However, I did get a full run to work by changing all of the chr* chromosome values in my samplesheet to just the number value - chr1 -> 1. This is in spite of my genome build being GRCh38. I'll see if I can understand why this works but just wanted to let you know that this was the fix for me.

Great 🚀 I think there's a mismatch between the VCF with a chr prefix and the internal plink files we make at the start of the workflow (which strips the chr prefix by default). The samplesheet information is used to match against the internal plink files.

I'll set up something to check chromsoomes in a future release.

bnwolford · 2023-10-22T20:09:43Z

I have the same error and I'm wondering if it has to do with multi-allelic variants? In my original bgen I have many multiallelic variants for example
alternate_ids rsid chromosome position number_of_alleles first_allele alternative_alleles
. 21:10968913_G/A 21 10968913 2 A G
. 21:10968913_G/C 21 10968913 2 C G

I'm just using one chunked pgen in my sameplesheet to test

sampleset,path_prefix,chrom,format
test,/home/bwolford/archive/pgen/h234_hrc_chr21_chunk1,21,pfile

I tried the --keep_multiallelic option but I get the same error.

nextflow run pgscatalog/pgscalc     -profile conda     --input samplesheet.csv --pgs_id PGS000752 --target_build GRCh38 --chrom 21 --keep_multiallelic

ERROR ~ Error executing process > 'PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (test)'

Caused by:
  Process `PGSCATALOG_PGSCALC:PGSCALC:MATCH:MATCH_COMBINE (test)` terminated with an error exit status (1)

Command executed:

  export POLARS_MAX_THREADS=2

  combine_matches                  --dataset test         --scorefile scorefiles.txt.gz         --matches *.ipc.zst         -n 2         --min_overlap 0.75                  --keep_multiallelic                  --outdir $PWD         --split                  -v

  cat <<-END_VERSIONS > versions.yml
  MATCH_COMBINE:
      pgscatalog_utils: $(echo $(python -c 'import pgscatalog_utils; print(pgscatalog_utils.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  root: 2023-10-22 22:02:53 DEBUG    Verbose logging enabled
  pgscatalog_utils.config: 2023-10-22 22:02:53 DEBUG    Using 2 threads to read CSVs
  pgscatalog_utils.config: 2023-10-22 22:02:53 DEBUG    polars threadpool size: 2
  pgscatalog_utils.match.read: 2023-10-22 22:02:53 DEBUG    Reading scorefile
  pgscatalog_utils.match.read: 2023-10-22 22:02:53 DEBUG    --chrom parameter not set, using all variants in scoring file
  pgscatalog_utils.match.preprocess: 2023-10-22 22:02:53 DEBUG    Complementing column effect_allele
  pgscatalog_utils.match.preprocess: 2023-10-22 22:02:53 DEBUG    Complementing column other_allele
  pgscatalog_utils.match.combine_matches: 2023-10-22 22:02:53 DEBUG    Reading matches
  pgscatalog_utils.match.combine_matches: 2023-10-22 22:02:53 DEBUG    Labelling match candidates
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling best match type (refalt > altref > ...)
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling duplicated best match: keeping first instance as best_match = True
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling multiple scoring file lines (accession/row_nr) that best_match to the same variant
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling all duplicates with exclude flag
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling ambiguous variants
  pgscatalog_utils.match.preprocess: 2023-10-22 22:02:53 DEBUG    Complementing column REF
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Labelling ambiguous variants with exclude flag
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Not excluding multiallelic variants
  pgscatalog_utils.match.label: 2023-10-22 22:02:53 DEBUG    Not excluding flipped matches
  Traceback (most recent call last):
    File "/home/bwolford/pgs_calc/work/conda/pgscatalog_utils-b4f3f611180e4ff75ddd463e7ba86339/bin/combine_matches", line 8, in <module>
      sys.exit(combine_matches())
    File "/home/bwolford/pgs_calc/work/conda/pgscatalog_utils-b4f3f611180e4ff75ddd463e7ba86339/lib/python3.10/site-packages/pgscatalog_utils/match/combine_matches.py", line 37, in combine_matches
      _check_duplicate_vars(matches)
    File "/home/bwolford/pgs_calc/work/conda/pgscatalog_utils-b4f3f611180e4ff75ddd463e7ba86339/lib/python3.10/site-packages/pgscatalog_utils/match/combine_matches.py", line 52, in _check_duplicate_vars
      assert max_occurrence == [1], "Duplicate IDs in final matches"
  AssertionError: Duplicate IDs in final matches

Work dir:
  /home/bwolford/pgs_calc/work/cd/da3fd357a9ab8d0b9d74c011c291ed

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details
ERROR ~ ERROR: Matching subworkflow failed

 -- Check '.nextflow.log' file for details
ERROR ~ ERROR: No results report written!

 -- Check '.nextflow.log' file for details
ERROR ~ ERROR: No scores calculated!

 -- Check '.nextflow.log' file for details`

nebfield · 2023-12-05T14:18:11Z

This assertion error has been changed to be more helpful and descriptive in our latest release

Please update (nextflow pull pgscatalog/pgsc_calc -r v2.0.0-alpha.4) and create a new issue or discussion if you still experience problems.

smlmbrt added the user-query User queries & requests label Jan 3, 2023

smlmbrt mentioned this issue Jan 5, 2023

More informative error messages PGScatalog/pgscatalog_utils#36

Open

smlmbrt added the enhancement New feature or request label Jan 24, 2023

nebfield added this to the v2.1.0 milestone Jul 14, 2023

openpaul mentioned this issue Aug 24, 2023

combine_matches fails if no matches matched are found PGScatalog/pgscatalog_utils#52

Closed

nebfield mentioned this issue Oct 17, 2023

Check samplesheet for chr prefix PGScatalog/pgscatalog_utils#59

Closed

nebfield mentioned this issue Oct 23, 2023

MATCH_COMBINE assertion error when match dataframe is empty PGScatalog/pgscatalog_utils#60

Closed

nebfield closed this as completed Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MATCH_COMBINE error #72

MATCH_COMBINE error #72

Srividhya-Sainath commented Dec 29, 2022

smlmbrt commented Jan 3, 2023 •

edited

Loading

Srividhya-Sainath commented Jan 4, 2023

smlmbrt commented Jan 5, 2023 •

edited

Loading

aabiddanda commented Oct 11, 2023 •

edited

Loading

nebfield commented Oct 13, 2023

aabiddanda commented Oct 13, 2023

nebfield commented Oct 17, 2023

bnwolford commented Oct 22, 2023 •

edited

Loading

nebfield commented Dec 5, 2023

MATCH_COMBINE error #72

MATCH_COMBINE error #72

Comments

Srividhya-Sainath commented Dec 29, 2022

smlmbrt commented Jan 3, 2023 • edited Loading

Srividhya-Sainath commented Jan 4, 2023

smlmbrt commented Jan 5, 2023 • edited Loading

aabiddanda commented Oct 11, 2023 • edited Loading

nebfield commented Oct 13, 2023

aabiddanda commented Oct 13, 2023

nebfield commented Oct 17, 2023

bnwolford commented Oct 22, 2023 • edited Loading

nebfield commented Dec 5, 2023

smlmbrt commented Jan 3, 2023 •

edited

Loading

smlmbrt commented Jan 5, 2023 •

edited

Loading

aabiddanda commented Oct 11, 2023 •

edited

Loading

bnwolford commented Oct 22, 2023 •

edited

Loading