Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storing bgzip-compressed genomes #41

Closed
golobor opened this issue Jun 11, 2019 · 12 comments
Closed

storing bgzip-compressed genomes #41

golobor opened this issue Jun 11, 2019 · 12 comments

Comments

@golobor
Copy link

golobor commented Jun 11, 2019

hi!
Many thanks for writing this amazing library!!
One issue that personally prevents my lab from switching to genomepy is that it seems to store genomes in uncompressed .fa files. Our concern is that, with potentially many genomes that we have to deal with, the library of genomes will take a lot of space; moreover, given that we use network storage, storing data uncompressed will reduce the I/O performance.

Have you considered allowing optional compression of genomes with bgzip? bgzip plays well with faidx/pyfaidx and does not have any downsides, at least as much as we're concerned.

Thank you!
Anton.

@simonvh
Copy link
Member

simonvh commented Jun 12, 2019

This sounds like a great suggestion, thanks! I vaguely remember there was I reason I chose to use uncompressed fasta files, but this might be completely unnecessary. I will look into this!

@golobor
Copy link
Author

golobor commented Jun 12, 2019 via email

@mdshw5
Copy link

mdshw5 commented Jun 14, 2019

@simonvh Please note that the pyfaidx bgzip routines are not as efficient as samtools yet. See mdshw5/pyfaidx#126 (comment). I'd love to fully implement bgzip sequence region retrieval in the next pyfaidx release, and will update this issue at that time.

The work required is to store the list of BGZF blocks, along with their uncompressed offsets in the file (see the initial work I started on this here: mdshw5/pyfaidx@db7f140#diff-9ca6d0b185ffae472a824fcdf50f0f9eR285) and then do a binary search to find the first block left of the required uncompressed offset to start reading from, read and decompress the required sequence, and then trim the resulting string so that it's the correct length before returning.

BGZF retrieval will always have a slight overhead compared to uncompressed FASTA, but I believe the tradeoff is well worth it.

@simonvh
Copy link
Member

simonvh commented Jun 24, 2019

Thanks for pitching in @mdshw5! At the moment I'm leaning towards bgzip-compressed genomes as a configurable option. In that case, I think a slight performance penalty would not be a problem. I'll be sure to document it.

There are currently still some tools that don't accept bgzipped genomes:

  • bedtools getfasta- this can be replaced by faidx, so not a showstopper, however, some people might prefer to be able to use bedtools.
  • Some aligners don't work on compressed FASTA files. This is also not a big deal, it just means that the index plugins need to do some extra work, depend

@mdshw5
Copy link

mdshw5 commented Jun 24, 2019

I've almost completed full implementation of BGZF indexing, and when all test are passing I'll update this thread with a new release of pyfaidx that uses the samtools/tabix .gzi block index.

@simonvh
Copy link
Member

simonvh commented Jun 25, 2019 via email

simonvh added a commit that referenced this issue Jul 11, 2019
@simonvh
Copy link
Member

simonvh commented Sep 11, 2019

@golobor release 0.6.0 of genomepy now has this functionality. Please let me know if there are bugs or if it doesn't work as expected.

I'm leaving this issue open so that @mdshw5 can update when pyfaidx is updated.

@simonvh simonvh reopened this Sep 11, 2019
@golobor
Copy link
Author

golobor commented Sep 11, 2019

wow, thanks a lot! It's going to take some time for me to test it, but I 100% will.
Thank you for your work!!

@mdshw5
Copy link

mdshw5 commented Sep 11, 2019

I've been busy with some other work, but the "correct" bgzf access code is about 90% complete. I'll update here when it's ready.

@peterch405
Copy link

I noticed the GRCh38.p13.fa.sizes and gaps files are empty when using compression. Not sure if the two are linked but removing compression produced the expected sizes file.

@siebrenf
Copy link
Member

Hi @peterch405, I've tried to test this with the latest version (0.9.1) and cannot reproduce this.
Did you spot the .gz extension in the filename?

genomepy install GRCh38.p13 -b
head ~/.local/share/genomes/GRCh38.p13/GRCh38.p13.fa.gz.sizes
1       248956422
10      133797422
11      135086622
12      133275309
13      114364328
14      107043718
15      101991189
16      90338345
17      83257441
18      80373285

If this problem persists we can look at it in a separate issue!

@peterch405
Copy link

I tried it again on my local machine and it works. Must have been something about the cluster environment. I had a separate issue, but I think unrelated. I will post a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants