Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create or load htslib .fai and .gzi index files when using BGZF files #126

Closed
KwatMDPhD opened this issue Sep 13, 2017 · 12 comments
Closed
Assignees

Comments

@KwatMDPhD
Copy link
Contributor

Hi,

When I do samtools faidx file.fa.gz and then try to use the same file.fa.gz file for pyfaidx, I get an error saying that file.fa.gz is not a valid BGZF file. But when I delete the file.fa.gz.fai and then use pyfaidx, the error disappears. I believe this is because the .fai pyfaidx creates is different from .fai samtools creates.

If this behavior is real, Is it possible to unify the .fai of pyfaidx and samtools? Thoughts?

Kind regards,
Kwat

@KwatMDPhD
Copy link
Contributor Author

Follow up: samtools faidx fails using the index created from pyfaidx as well.
screen shot 2017-09-13 at 05 04 14

mdshw5 added a commit that referenced this issue Sep 13, 2017
@mdshw5
Copy link
Owner

mdshw5 commented Sep 13, 2017

I think this is a good idea, and the work to support this is:

  1. Store compressed file offsets, which is what it looks like samtools is storing in its .fai, instead of virtual offsets, which I'm doing in this library
  2. Write methods to build and load the .gzi index file that htslib creates, which stores the number of gzipped blocks, and (compressed offset, uncompressed offset) for each block start byte (currently described in Is the bgzip index format (.gzi) produced by faidx documented anywhere? samtools/htslib#473)
  3. Change Faidx logic to use the .fai and .gzi indices in combination to recreate virtual offsets (I think this is what samtools faidx is doing)

@mdshw5 mdshw5 changed the title samtools .fai and pyfaidx .fai Create or load htslib .fai and .gzi index files when using BGZF files Sep 13, 2017
@mdshw5 mdshw5 self-assigned this Sep 13, 2017
@KwatMDPhD
Copy link
Contributor Author

KwatMDPhD commented Sep 16, 2017

I checked the .fai files created from pyfaidx and samtools, and they are the same. Also, samtools must have .gzi to work. Hope this information helps.

@KwatMDPhD
Copy link
Contributor Author

Closing this issue assuming that #1701 closes this issue. Thanks :)

@mdshw5
Copy link
Owner

mdshw5 commented Oct 24, 2018

Are you able to test whether the issue is fixed? I’ll look into it as well, but I believe our BGZF indices may still be incompatible with samtools.

@mdshw5 mdshw5 reopened this Oct 24, 2018
@mdshw5
Copy link
Owner

mdshw5 commented Oct 24, 2018

Specifically the recursion issue in biopython is fixed, but I’d like to implement .gzi creation and a more efficient sequence retrieval in pyfaidx for BGZF files. Currently pyfaidx must fetch from the beginning of a record to the user specified end coordinate and returns the subset sequence from memory. This isn’t as efficient as samtools, and the limitation is in understanding how samtools generates virtual offsets from the .gzi to get the offset into the start coordinate.

@KwatMDPhD
Copy link
Contributor Author

I see. When this is in place, please let us know. Thanks @mdshw5

@mdshw5
Copy link
Owner

mdshw5 commented Oct 30, 2019

Re-opening to work on this issue before the end of the year.

@mdshw5 mdshw5 reopened this Oct 30, 2019
@IPetrik
Copy link

IPetrik commented Jun 25, 2020

Any progress on this?

@mdshw5
Copy link
Owner

mdshw5 commented Jun 25, 2020

@IPetrik I did do some work on this earlier this year, but never made something that works. I believe I pushed what work I had here: db7f140. I'll take a look on my local machine and see if there's anything else. I'd really like to get this feature working properly so if you've got ideas please share.

@mdshw5
Copy link
Owner

mdshw5 commented Jun 25, 2020

@IPetrik Forget me previous comment. I have some work on my local machine that's completely different. I'll update the samtools_bgzf_compatibility branch with what I have.

mdshw5 added a commit that referenced this issue Jun 25, 2020
mdshw5 added a commit that referenced this issue Jun 25, 2020
There is a lot here that doesn't work, but mainly I was trying
to figure out the format of the GZI file and provide methods to
unpack and pack the binary on-disk format. There are also methods
for loading the GZI into an object for use by Faidx.
@mdshw5
Copy link
Owner

mdshw5 commented Jun 25, 2020

I've opened a PR with the work for this issue in #164. If I have some time this summer I'll come back and keep working - it doesn't seem like there's much left to do except finish testing the GZI packing/unpacking and implementing methods to create and read the on-disk format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants