-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storing bgzip-compressed genomes #41
Comments
This sounds like a great suggestion, thanks! I vaguely remember there was I reason I chose to use uncompressed fasta files, but this might be completely unnecessary. I will look into this! |
awesome! Please, let me know if you need any help or would like to bounce
ideas!
Re: potential issues - agreed, in my experience, most if not all tools
accept bzgip-compressed files.
…On Wed, 12 Jun 2019 at 03:02, Simon van Heeringen ***@***.***> wrote:
This sounds like a great suggestion, thanks! I vaguely remember there was
I reason I chose to use uncompressed fasta files, but this might be
completely unnecessary. I will look into this!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#41>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAG64CSVMTHYKLJH4EBME7LP2CNQ3ANCNFSM4HXCZ7HA>
.
|
@simonvh Please note that the pyfaidx bgzip routines are not as efficient as samtools yet. See mdshw5/pyfaidx#126 (comment). I'd love to fully implement bgzip sequence region retrieval in the next pyfaidx release, and will update this issue at that time. The work required is to store the list of BGZF blocks, along with their uncompressed offsets in the file (see the initial work I started on this here: mdshw5/pyfaidx@db7f140#diff-9ca6d0b185ffae472a824fcdf50f0f9eR285) and then do a binary search to find the first block left of the required uncompressed offset to start reading from, read and decompress the required sequence, and then trim the resulting string so that it's the correct length before returning. BGZF retrieval will always have a slight overhead compared to uncompressed FASTA, but I believe the tradeoff is well worth it. |
Thanks for pitching in @mdshw5! At the moment I'm leaning towards bgzip-compressed genomes as a configurable option. In that case, I think a slight performance penalty would not be a problem. I'll be sure to document it. There are currently still some tools that don't accept bgzipped genomes:
|
I've almost completed full implementation of BGZF indexing, and when all test are passing I'll update this thread with a new release of |
Sounds fantastic!
|
wow, thanks a lot! It's going to take some time for me to test it, but I 100% will. |
I've been busy with some other work, but the "correct" bgzf access code is about 90% complete. I'll update here when it's ready. |
I noticed the GRCh38.p13.fa.sizes and gaps files are empty when using compression. Not sure if the two are linked but removing compression produced the expected sizes file. |
Hi @peterch405, I've tried to test this with the latest version (0.9.1) and cannot reproduce this.
If this problem persists we can look at it in a separate issue! |
I tried it again on my local machine and it works. Must have been something about the cluster environment. I had a separate issue, but I think unrelated. I will post a new issue. |
hi!
Many thanks for writing this amazing library!!
One issue that personally prevents my lab from switching to genomepy is that it seems to store genomes in uncompressed .fa files. Our concern is that, with potentially many genomes that we have to deal with, the library of genomes will take a lot of space; moreover, given that we use network storage, storing data uncompressed will reduce the I/O performance.
Have you considered allowing optional compression of genomes with bgzip? bgzip plays well with faidx/pyfaidx and does not have any downsides, at least as much as we're concerned.
Thank you!
Anton.
The text was updated successfully, but these errors were encountered: