storing bgzip-compressed genomes #41

golobor · 2019-06-11T19:51:01Z

hi!
Many thanks for writing this amazing library!!
One issue that personally prevents my lab from switching to genomepy is that it seems to store genomes in uncompressed .fa files. Our concern is that, with potentially many genomes that we have to deal with, the library of genomes will take a lot of space; moreover, given that we use network storage, storing data uncompressed will reduce the I/O performance.

Have you considered allowing optional compression of genomes with bgzip? bgzip plays well with faidx/pyfaidx and does not have any downsides, at least as much as we're concerned.

Thank you!
Anton.

simonvh · 2019-06-12T07:02:37Z

This sounds like a great suggestion, thanks! I vaguely remember there was I reason I chose to use uncompressed fasta files, but this might be completely unnecessary. I will look into this!

golobor · 2019-06-12T19:46:14Z

awesome! Please, let me know if you need any help or would like to bounce ideas! Re: potential issues - agreed, in my experience, most if not all tools accept bzgip-compressed files.

…

On Wed, 12 Jun 2019 at 03:02, Simon van Heeringen ***@***.***> wrote: This sounds like a great suggestion, thanks! I vaguely remember there was I reason I chose to use uncompressed fasta files, but this might be completely unnecessary. I will look into this! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#41>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAG64CSVMTHYKLJH4EBME7LP2CNQ3ANCNFSM4HXCZ7HA> .

mdshw5 · 2019-06-14T18:16:45Z

@simonvh Please note that the pyfaidx bgzip routines are not as efficient as samtools yet. See mdshw5/pyfaidx#126 (comment). I'd love to fully implement bgzip sequence region retrieval in the next pyfaidx release, and will update this issue at that time.

The work required is to store the list of BGZF blocks, along with their uncompressed offsets in the file (see the initial work I started on this here: mdshw5/pyfaidx@db7f140#diff-9ca6d0b185ffae472a824fcdf50f0f9eR285) and then do a binary search to find the first block left of the required uncompressed offset to start reading from, read and decompress the required sequence, and then trim the resulting string so that it's the correct length before returning.

BGZF retrieval will always have a slight overhead compared to uncompressed FASTA, but I believe the tradeoff is well worth it.

simonvh · 2019-06-24T09:40:18Z

Thanks for pitching in @mdshw5! At the moment I'm leaning towards bgzip-compressed genomes as a configurable option. In that case, I think a slight performance penalty would not be a problem. I'll be sure to document it.

There are currently still some tools that don't accept bgzipped genomes:

bedtools getfasta- this can be replaced by faidx, so not a showstopper, however, some people might prefer to be able to use bedtools.
Some aligners don't work on compressed FASTA files. This is also not a big deal, it just means that the index plugins need to do some extra work, depend

mdshw5 · 2019-06-24T14:38:38Z

I've almost completed full implementation of BGZF indexing, and when all test are passing I'll update this thread with a new release of pyfaidx that uses the samtools/tabix .gzi block index.

simonvh · 2019-06-25T07:12:15Z

Sounds fantastic!

simonvh · 2019-09-11T08:52:30Z

@golobor release 0.6.0 of genomepy now has this functionality. Please let me know if there are bugs or if it doesn't work as expected.

I'm leaving this issue open so that @mdshw5 can update when pyfaidx is updated.

golobor · 2019-09-11T09:03:29Z

wow, thanks a lot! It's going to take some time for me to test it, but I 100% will.
Thank you for your work!!

mdshw5 · 2019-09-11T14:37:42Z

I've been busy with some other work, but the "correct" bgzf access code is about 90% complete. I'll update here when it's ready.

peterch405 · 2020-11-13T02:05:48Z

I noticed the GRCh38.p13.fa.sizes and gaps files are empty when using compression. Not sure if the two are linked but removing compression produced the expected sizes file.

siebrenf · 2020-11-13T15:51:07Z

Hi @peterch405, I've tried to test this with the latest version (0.9.1) and cannot reproduce this.
Did you spot the .gz extension in the filename?

genomepy install GRCh38.p13 -b

head ~/.local/share/genomes/GRCh38.p13/GRCh38.p13.fa.gz.sizes
1       248956422
10      133797422
11      135086622
12      133275309
13      114364328
14      107043718
15      101991189
16      90338345
17      83257441
18      80373285

If this problem persists we can look at it in a separate issue!

peterch405 · 2020-11-17T06:46:30Z

I tried it again on my local machine and it works. Must have been something about the cluster environment. I had a separate issue, but I think unrelated. I will post a new issue.

simonvh added the enhancement label Jun 12, 2019

simonvh added a commit that referenced this issue Jul 11, 2019

bgzip-compressed genome (#41)

f8f6774

simonvh closed this as completed in 56650ff Sep 11, 2019

simonvh reopened this Sep 11, 2019

siebrenf closed this as completed May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storing bgzip-compressed genomes #41

storing bgzip-compressed genomes #41

golobor commented Jun 11, 2019

simonvh commented Jun 12, 2019

golobor commented Jun 12, 2019 via email

mdshw5 commented Jun 14, 2019

simonvh commented Jun 24, 2019

mdshw5 commented Jun 24, 2019

simonvh commented Jun 25, 2019 via email

simonvh commented Sep 11, 2019

golobor commented Sep 11, 2019

mdshw5 commented Sep 11, 2019

peterch405 commented Nov 13, 2020

siebrenf commented Nov 13, 2020

peterch405 commented Nov 17, 2020

storing bgzip-compressed genomes #41

storing bgzip-compressed genomes #41

Comments

golobor commented Jun 11, 2019

simonvh commented Jun 12, 2019

golobor commented Jun 12, 2019 via email

mdshw5 commented Jun 14, 2019

simonvh commented Jun 24, 2019

mdshw5 commented Jun 24, 2019

simonvh commented Jun 25, 2019 via email

simonvh commented Sep 11, 2019

golobor commented Sep 11, 2019

mdshw5 commented Sep 11, 2019

peterch405 commented Nov 13, 2020

siebrenf commented Nov 13, 2020

peterch405 commented Nov 17, 2020