Parser breaks if FASTA file does not contain > #144

Benjamin-Lee · 2018-10-12T18:18:21Z

Ok this is a weird one: I'm at a hackathon and they handed us "mystery genomes" that were FASTA files with the comment line removed. I tried to use pyfaidx (through squiggle) and got this error:

  File "/Users/BenjaminLee/Desktop/Python/Research/hackseq18/env/lib/python3.6/site-packages/pyfaidx/__init__.py", line 990, in __init__
    build_index=build_index)
  File "/Users/BenjaminLee/Desktop/Python/Research/hackseq18/env/lib/python3.6/site-packages/pyfaidx/__init__.py", line 423, in __init__
    self.build_index()
  File "/Users/BenjaminLee/Desktop/Python/Research/hackseq18/env/lib/python3.6/site-packages/pyfaidx/__init__.py", line 573, in build_index
    rname, rlen, thisoffset, clen, blen))
TypeError: unsupported format string passed to NoneType.__format__

The text was updated successfully, but these errors were encountered:

mdshw5 · 2018-10-13T19:54:47Z

Well, it's not a FASTA file without the description line. Are we talking about a file that starts with a semicolon (like this example)? In that case I could see adding support for FASTA comments.

If we're talking about file that just contain sequence and no comments or identifiers I doubt there's an indexing strategy for these, since a multi-FASTA file would have no record separator for multiple entries.

If you can provide a bit more detail about how you'd like this supported we can go from there. Thanks!

Benjamin-Lee · 2018-10-14T04:54:56Z

Ideally, it would parse it as normal. That being said, I understand if you don't think that supporting non properly formatted FASTA files is within the scope or even advisable for this project (we recently realized that Biopython doesn't support it either and fails silently). If so, could we maybe add a specific warning if no > is found rather than a generic error?

mdshw5 · 2018-10-14T18:58:47Z

Definitely adding better exceptions would be great. Can I have an example of the file format in question?

Benjamin-Lee · 2018-10-14T19:01:18Z

Sure! The exact file in question can be viewed here.

Basically instead of:

> description
ATGGACAGTA...
GATAGATACC...

it was getting passed:

ATGGACAGTA...
GATAGATACC...

mdshw5 · 2018-10-14T19:50:02Z

I've added a case for handling files with no valid description lines and pushed a new release (https://github.com/mdshw5/pyfaidx/releases/tag/v0.5.5.1) that should be on PyPI in a few minutes.

Benjamin-Lee · 2018-10-14T19:58:33Z

Thanks a ton!

mdshw5 added a commit that referenced this issue Oct 14, 2018

Better handling for malformed files (#144)

2b79661

mdshw5 added a commit that referenced this issue Oct 14, 2018

Move variable initialization out of loop (#144)

b5873a3

mdshw5 added a commit that referenced this issue Oct 14, 2018

Add a test for #144

b2f377a

mdshw5 closed this as completed Oct 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser breaks if FASTA file does not contain > #144

Parser breaks if FASTA file does not contain > #144

Benjamin-Lee commented Oct 12, 2018

mdshw5 commented Oct 13, 2018

Benjamin-Lee commented Oct 14, 2018

mdshw5 commented Oct 14, 2018

Benjamin-Lee commented Oct 14, 2018

mdshw5 commented Oct 14, 2018

Benjamin-Lee commented Oct 14, 2018

Parser breaks if FASTA file does not contain > #144

Parser breaks if FASTA file does not contain > #144

Comments

Benjamin-Lee commented Oct 12, 2018

mdshw5 commented Oct 13, 2018

Benjamin-Lee commented Oct 14, 2018

mdshw5 commented Oct 14, 2018

Benjamin-Lee commented Oct 14, 2018

mdshw5 commented Oct 14, 2018

Benjamin-Lee commented Oct 14, 2018