Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False duplicate key error #150

Closed
Benjamin-Lee opened this issue Feb 18, 2019 · 1 comment
Closed

False duplicate key error #150

Benjamin-Lee opened this issue Feb 18, 2019 · 1 comment

Comments

@Benjamin-Lee
Copy link

I was using some relatively simple code I wrote to generate FASTA files containing random sequences:

import random
import sys

number = int(sys.argv[1])
length = int(sys.argv[2])

for i in range(number):
	seq = ""
	for j in range(length):
		seq += random.choice("ATGC")
	with open(f"{number}-{length}bp-random-seqs.fa", "a+") as f:
		print(f"> seq {i}", file=f)
		print(seq, file=f)

The file ends up looking like this:

> seq 0
...
.
.
.
> seq n
...

However, I am getting the following error:

  File "/Users/BenjaminLee/.virtualenvs/squiggle/lib/python3.6/site-packages/pyfaidx/__init__.py", line 481, in read_fai
    raise ValueError('Duplicate key "%s"' % key)
ValueError: Duplicate key "seq"

It seems that it's only picking up on the seq in the description line, not on the identifier.

@mdshw5
Copy link
Owner

mdshw5 commented Feb 18, 2019

This is expected behavior, which was chosen to mimic the samtools faidx behavior of splitting deflines on whitespace. You can specify that the .fai index should not be used as a key name by passing the read_long_names argument:

>>> genes = Fasta('10-1000bp-random-seqs.fa', read_long_names=True)
>>> genes
Fasta("10-1000bp-random-seqs.fa")
>>> genes.keys()
odict_keys([' seq 0', ' seq 1', ' seq 2', ' seq 3', ' seq 4', ' seq 5', ' seq 6', ' seq 7', ' seq 8', ' seq 9'])

Since duplicate keys are detected during index reading in pyfaidx, the .fai will appear to contain duplicate sequences (which may be a problem for other tools) but will not contain whitespace, and so I think is "more correct" with respect to samtools behavior. See #111 for more information about how I arrived at this behavior.

The documentation should definitely be clearer about this, as well as other Fasta and Faidx arguments, so feel free to submit a PR if you want to add something to the README.

@mdshw5 mdshw5 closed this as completed Feb 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants