Read actual defline from FASTA file index "gap" #54

mdshw5 · 2015-03-03T03:38:53Z

One current design limitation of pyfaidx is that it mirrors the samtools indexing behavior of truncating headers after whitespace. There is a good reason for this - any whitespace in the identifier would break the index file. A side effect of this is that frequently the "description" in a header will be lost when reading into the file using the index. It seems like an option to recover the full header line would be useful, and pretty cheap to implement.

To determine the byte offset and length of the header from the index file, we can determine the byte end of the preceding sequence by adding unprintable characters, and this should be the byte start of the real header line. We can then read from header byte start to sequence offset and save this as something like Sequence.long_name.

The text was updated successfully, but these errors were encountered:

mdshw5 added the enhancement label Mar 3, 2015

mdshw5 self-assigned this Mar 3, 2015

mdshw5 added this to the v0.3.7 milestone Mar 3, 2015

mdshw5 closed this as completed in 2b0cef8 Mar 3, 2015

mdshw5 added a commit that referenced this issue Mar 3, 2015

Bump version number and add documentation for #54.

1ed80af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read actual defline from FASTA file index "gap" #54

Read actual defline from FASTA file index "gap" #54

mdshw5 commented Mar 3, 2015

Read actual defline from FASTA file index "gap" #54

Read actual defline from FASTA file index "gap" #54

Comments

mdshw5 commented Mar 3, 2015