Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] GATK-bundle - duplicate key? #64

Closed
huguesfontenelle opened this issue Apr 30, 2015 · 6 comments
Closed

[bug] GATK-bundle - duplicate key? #64

huguesfontenelle opened this issue Apr 30, 2015 · 6 comments
Assignees
Labels

Comments

@huguesfontenelle
Copy link

I would like to replace pygr with pyfaidx.
Previously I have been using the human_g1k_v37_decoy.fasta file, which is part of the GATK bundle (available through FTP) for getting sequences. I use the same file for alignments.

>>> from pyfaidx import Fasta
>>> genome = Fasta('human_g1k_v37_decoy.fasta')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 552, in __init__
    filt_function=filt_function)
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 209, in __init__
    self.read_fai(split_char)
  File "/usr/local/lib/python2.7/site-packages/pyfaidx/__init__.py", line 244, in read_fai
    raise ValueError('Duplicate key "%s"' % rname)
ValueError: Duplicate key "['2', 'dna:chromosome', 'chromosome:GRCh37:2:1:243199373:1']"

The human_g1k_v37_decoy.fasta.fai file does not seem to contain that key twice.

@huguesfontenelle
Copy link
Author

FYI, tests run fine, good job!

nosetests --no-skip
..........................................................
----------------------------------------------------------------------
Ran 66 tests in 24.230s

OK

A bit more descriptive errors with iPython:

In [1]: import pyfaidx
In [3]: g=pyfaidx.Fasta('human_g1k_v37_decoy.fasta')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-21055aca6690> in <module>()
----> 1 g=pyfaidx.Fasta('human_g1k_v37_decoy.fasta')

/Users/huguesfo/Devel/clones/env-pyfaidx/lib/python2.7/site-packages/pyfaidx/__init__.pyc in __init__(self, filename, default_seq, key_function, as_raw, strict_bounds, read_ahead, mutable, split_char, filt_function)
    550                            default_seq=default_seq, strict_bounds=strict_bounds,
    551                            read_ahead=read_ahead, mutable=mutable, split_char=split_char,
--> 552                            filt_function=filt_function)
    553         self.keys = self.faidx.index.keys
    554         if not self.mutable:

/Users/huguesfo/Devel/clones/env-pyfaidx/lib/python2.7/site-packages/pyfaidx/__init__.pyc in __init__(self, filename, default_seq, key_function, as_raw, strict_bounds, read_ahead, mutable, split_char, filt_function)
    207 
    208         if os.path.exists(self.indexname) and getmtime(self.indexname) > getmtime(self.filename):
--> 209             self.read_fai(split_char)
    210         else:
    211             try:

/Users/huguesfo/Devel/clones/env-pyfaidx/lib/python2.7/site-packages/pyfaidx/__init__.pyc in read_fai(self, split_char)
    242                 for key in rname:
    243                     if key in self.index and not split_char:
--> 244                         raise ValueError('Duplicate key "%s"' % rname)
    245                     elif key in self.index and split_char:
    246                         duplicate_ids.append(key)

ValueError: Duplicate key "['2', 'dna:chromosome', 'chromosome:GRCh37:2:1:243199373:1']"

@mdshw5
Copy link
Owner

mdshw5 commented Apr 30, 2015

I'll take a look at this today. You might also be able to provide some perspective on #21 since I'm not quite sure I've implemented the pygr API completely!

@mdshw5 mdshw5 added the bug label Apr 30, 2015
@mdshw5 mdshw5 self-assigned this Apr 30, 2015
@mdshw5
Copy link
Owner

mdshw5 commented Apr 30, 2015

Could you please confirm that this is fixed in the current master branch?

It seems like you were using the supplied .fai index from the GATK bundle. For whatever reason, that file contains spaces in the sequence name column:

1 dna:chromosome chromosome:GRCh37:1:1:249250621:1      249250621       52      60      61
2 dna:chromosome chromosome:GRCh37:2:1:243199373:1      243199373       253404903       60      61
3 dna:chromosome chromosome:GRCh37:3:1:198022430:1      198022430       500657651       60      61
4 dna:chromosome chromosome:GRCh37:4:1:191154276:1      191154276       701980507       60      61
5 dna:chromosome chromosome:GRCh37:5:1:180915260:1      180915260       896320740       60      61
6 dna:chromosome chromosome:GRCh37:6:1:171115067:1      171115067       1080251307      60      61
7 dna:chromosome chromosome:GRCh37:7:1:159138663:1      159138663       1254218344      60      61
...

I'm not sure what tool the GATK team used to generate this index file, but samtools 0.1.19 creates an index without the whitespace:

1       249250621       52      60      61
2       243199373       253404903       60      61
3       198022430       500657651       60      61
4       191154276       701980507       60      61
5       180915260       896320740       60      61
6       171115067       1080251307      60      61
7       159138663       1254218344      60      61
...

In any case, samtools did not have an issue working with the supplied index, so I've updated pyfaidx to handle it as well. If split_char is not set when calling Faidx or Fasta, the key names will only include the first field in the sequence name column (e.g. "1", "2", ...).

@mdshw5
Copy link
Owner

mdshw5 commented Apr 30, 2015

@huguesfontenelle I've also closed #21 by adding one last bit of pygr.seqdb.SequenceFileDB API compatibility. If you find any incompatibilities when you add pyfaidx to your code, please let me know!

@huguesfontenelle
Copy link
Author

Excellent, works fine after the patch, thank you.
It was important for me that I do not have to generate the .fai since I'm sharing that file with other users on my cluster.

@mdshw5
Copy link
Owner

mdshw5 commented May 4, 2015

Great. Thanks for confirming this resolved!

@mdshw5 mdshw5 closed this as completed May 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants