Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser breaks if FASTA file does not contain > #144

Closed
Benjamin-Lee opened this issue Oct 12, 2018 · 6 comments
Closed

Parser breaks if FASTA file does not contain > #144

Benjamin-Lee opened this issue Oct 12, 2018 · 6 comments

Comments

@Benjamin-Lee
Copy link

Ok this is a weird one: I'm at a hackathon and they handed us "mystery genomes" that were FASTA files with the comment line removed. I tried to use pyfaidx (through squiggle) and got this error:

  File "/Users/BenjaminLee/Desktop/Python/Research/hackseq18/env/lib/python3.6/site-packages/pyfaidx/__init__.py", line 990, in __init__
    build_index=build_index)
  File "/Users/BenjaminLee/Desktop/Python/Research/hackseq18/env/lib/python3.6/site-packages/pyfaidx/__init__.py", line 423, in __init__
    self.build_index()
  File "/Users/BenjaminLee/Desktop/Python/Research/hackseq18/env/lib/python3.6/site-packages/pyfaidx/__init__.py", line 573, in build_index
    rname, rlen, thisoffset, clen, blen))
TypeError: unsupported format string passed to NoneType.__format__
@mdshw5
Copy link
Owner

mdshw5 commented Oct 13, 2018

Well, it's not a FASTA file without the description line. Are we talking about a file that starts with a semicolon (like this example)? In that case I could see adding support for FASTA comments.

If we're talking about file that just contain sequence and no comments or identifiers I doubt there's an indexing strategy for these, since a multi-FASTA file would have no record separator for multiple entries.

If you can provide a bit more detail about how you'd like this supported we can go from there. Thanks!

@Benjamin-Lee
Copy link
Author

Ideally, it would parse it as normal. That being said, I understand if you don't think that supporting non properly formatted FASTA files is within the scope or even advisable for this project (we recently realized that Biopython doesn't support it either and fails silently). If so, could we maybe add a specific warning if no > is found rather than a generic error?

@mdshw5
Copy link
Owner

mdshw5 commented Oct 14, 2018

Definitely adding better exceptions would be great. Can I have an example of the file format in question?

@Benjamin-Lee
Copy link
Author

Sure! The exact file in question can be viewed here.

Basically instead of:

> description
ATGGACAGTA...
GATAGATACC...

it was getting passed:

ATGGACAGTA...
GATAGATACC...

mdshw5 added a commit that referenced this issue Oct 14, 2018
@mdshw5
Copy link
Owner

mdshw5 commented Oct 14, 2018

I've added a case for handling files with no valid description lines and pushed a new release (https://github.com/mdshw5/pyfaidx/releases/tag/v0.5.5.1) that should be on PyPI in a few minutes.

@mdshw5 mdshw5 closed this as completed Oct 14, 2018
@Benjamin-Lee
Copy link
Author

Thanks a ton!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants