-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BGZF recursion issue #125
Comments
Sorry for ignoring this issue - it slipped through the cracks! I'm going to close it and add you to the other similar issue #131. |
Just to double-check: did you compress this file with |
I'm definitely able to reproduce your issue. (thanks @peterjc for pointing out the source file)
yada, yada, yada ...
It appears that the line lengths are all 61 bytes, so this isn't a long line issue. I'll add some debug flags to |
Specifically I'm indexing a BGZF compressed file by calling the Line 476 in 8168b37
This returns a virtual offset which I then pass to the BGZF Lines 570 to 575 in 8168b37
Is there something obviously wrong with this? |
Ah, I think I know what's happening here. I'm calling It's not trivial, but I could try to come up with a way to be more efficient with my BGZF reads. @peterjc: maybe I'm missing something but it seems impossible to seek/read a specific substring efficiently that may span multiple BGZF blocks unless you store the offset of each block indexed to the string. |
Does this mean you can print out the problem offset so we can make a tiny test-case using this human FASTA file and just Note I suspect both @mdshw5 do you have the length in bytes of the string you want to read, as well as the (BGZF virtual) offset of its start point? That is quite well tested because that's how Biopython's SeqIO get_raw functionality works on BGZF compressed files. However, it did take some careful coding to ensure I could use |
FWIW, I also ran into the same problem today, it seems.
|
When reading many BGZF blocks within a single .read() and/or .readline() call the old code could fail with a RecursionError. The new test case triggered this with a BGZF file made of many small blocks. This was the root cause of these issues using Bio/bgzf.py within pyfaidx: mdshw5/pyfaidx#125 mdshw5/pyfaidx#131 Closes issue #1701
I can confirm that the solution in biopython/biopython#1701 fixes this issue. Due to my implementation there is still a large performance penalty for fetching small substrings near the end of a record, and I'll open an issue to remind me to explore a solution. |
Great. You'll be looking forward to Biopython 1.73 then which will be the first release to include this fix. Thanks everyone 👍 |
I'll add a version check around the BGZF code then. |
:) |
Hi,
I used
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
and queredpyfaidx_handle['12'][111803961:111803962]
. But the runtime is very slow and, with further investigation, I learned that I was getting maximum recursion error. I think using uncompressed.fa
is faster and safer as of now. Thoughts?Kind regards,
Kwat
The text was updated successfully, but these errors were encountered: