Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGZF recursion issue #125

Closed
KwatMDPhD opened this issue Sep 13, 2017 · 11 comments
Closed

BGZF recursion issue #125

KwatMDPhD opened this issue Sep 13, 2017 · 11 comments
Assignees

Comments

@KwatMDPhD
Copy link
Contributor

Hi,

I used Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz and quered pyfaidx_handle['12'][111803961:111803962]. But the runtime is very slow and, with further investigation, I learned that I was getting maximum recursion error. I think using uncompressed .fa is faster and safer as of now. Thoughts?

Kind regards,
Kwat

@mdshw5
Copy link
Owner

mdshw5 commented Jan 6, 2018

Sorry for ignoring this issue - it slipped through the cracks! I'm going to close it and add you to the other similar issue #131.

@mdshw5 mdshw5 closed this as completed Jan 6, 2018
@mdshw5
Copy link
Owner

mdshw5 commented Jan 7, 2018

Just to double-check: did you compress this file with bgzip? There should be a check for the BGZF magic and you should get a pyfaidx.UnsupportedCompressionFormat error on a normal gzip file, but I just want to be thorough.

@mdshw5 mdshw5 reopened this Jan 7, 2018
@mdshw5
Copy link
Owner

mdshw5 commented Jan 7, 2018

I'm definitely able to reproduce your issue. (thanks @peterjc for pointing out the source file)

$ wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
--2018-01-06 20:21:07--  http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Resolving ftp.ensembl.org... 193.62.193.8
Connecting to ftp.ensembl.org|193.62.193.8|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 881214396 (840M) [application/x-gzip]
Saving to: 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz'

Homo_sapiens.GRCh38.dna.primary_assemb 100%[============================================================================>] 840.39M   393KB/s    in 24m 34s 

2018-01-06 20:45:41 (584 KB/s) - 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz' saved [881214396/881214396]

$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 
$ bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa 
$ python
Python 2.7.10 (default, Feb  7 2017, 00:08:15) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyfaidx import Fasta
>>> pyfaidx_handle = Fasta('Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz')
>>> pyfaidx_handle['12'][111803961:111803962]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 698, in __getitem__
    return self._fa.get_seq(self.name, start + 1, stop)[::step]
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 887, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 532, in fetch
    seq = self.from_file(name, start, end)
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 574, in from_file
    chunk_seq = self.file.read(chunk).decode()
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
    return data + self.read(size)
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
    return data + self.read(size)
...
...
...

yada, yada, yada ...

 File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
    return data + self.read(size)
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 644, in read
    self._load_block()  # will reset offsets
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 576, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 411, in _load_bgzf_block
    if magic != _bgzf_magic:
RuntimeError: maximum recursion depth exceeded in cmp
>>> 

It appears that the line lengths are all 61 bytes, so this isn't a long line issue. I'll add some debug flags to pyfaidx and see if maybe it's a bug in how I'm indexing the file and calculating offsets. @peterjc: what would happen if I tried to read from a virtual offset that doesn't exist (out of bounds read)?

@mdshw5
Copy link
Owner

mdshw5 commented Jan 7, 2018

Specifically I'm indexing a BGZF compressed file by calling the tell method to get the current offset into the beginning of each FASTA record:

thisoffset = fastafile.tell() if self._bgzf else offset

This returns a virtual offset which I then pass to the BGZF seek method, reading from the known virtual offset of the record start, read until the end position, and slice the beginning before returning (can't seek directly to the requested start coordinate because it might be in a different BGZF block).

pyfaidx/pyfaidx/__init__.py

Lines 570 to 575 in 8168b37

with self.lock:
if self._bgzf: # We can't add to virtual offsets, so we need to read from the beginning of the record and trim the beginning if needed
self.file.seek(i.offset)
chunk = start0 + newlines_before + newlines_inside + seq_len
chunk_seq = self.file.read(chunk).decode()
seq = chunk_seq[start0 + newlines_before:]

Is there something obviously wrong with this?

@mdshw5
Copy link
Owner

mdshw5 commented Jan 7, 2018

Ah, I think I know what's happening here. I'm calling BgzfReader.read() and asking for the data between chromosome 12 locations [1: 111803962]. This is due to the way I've decided to avoid potentially adding/subtracting virtual offsets in my Faidx.fetch() method. I think BgzfReader will hit the interpreter recursion limit any time you ask it to read a large string from a file with a large number of newline characters, such as this example.

It's not trivial, but I could try to come up with a way to be more efficient with my BGZF reads. @peterjc: maybe I'm missing something but it seems impossible to seek/read a specific substring efficiently that may span multiple BGZF blocks unless you store the offset of each block indexed to the string.

@peterjc
Copy link

peterjc commented Jan 7, 2018

Does this mean you can print out the problem offset so we can make a tiny test-case using this human FASTA file and just Bio/bgzf.py? That's be very helpful as I right now (continuing our discussion from #131, without having played with the code), I am puzzled as to what might be going wrong here.

Note I suspect both .readline() and .read() may be affected by this recursion issue (I left a TODO comment in both methods).

@mdshw5 do you have the length in bytes of the string you want to read, as well as the (BGZF virtual) offset of its start point? That is quite well tested because that's how Biopython's SeqIO get_raw functionality works on BGZF compressed files. However, it did take some careful coding to ensure I could use .tell() to get the virtual offset needed while building the index (i.e. avoiding any virtual pointer arithmetic because is not supported).

@UnkindPartition
Copy link

FWIW, I also ran into the same problem today, it seems.

% faidx -o "xxx.fa" -b "xxx.narrowPeak" "hg38.fa.gz"
Traceback (most recent call last):
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/bin/faidx", line 10, in <module>
    sys.exit(main())
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 197, in main
    write_sequence(args)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 52, in write_sequence
    for line in fetch_sequence(args, fasta, name, start, end):
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 69, in fetch_sequence
    sequence = fasta[name][start:end]
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 789, in __getitem__
    return self._fa.get_seq(self.name, start + 1, stop)[::step]
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1015, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 613, in fetch
    seq = self.from_file(name, start, end)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 656, in from_file
    chunk_seq = self.file.read(chunk).decode()
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  [Previous line repeated 983 more times]
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 648, in read
    self._load_block()  # will reset offsets
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 577, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 412, in _load_bgzf_block
    if magic != _bgzf_magic:
RecursionError: maximum recursion depth exceeded in comparison

peterjc pushed a commit to biopython/biopython that referenced this issue Aug 23, 2018
When reading many BGZF blocks within a single .read() and/or
.readline() call the old code could fail with a RecursionError.

The new test case triggered this with a BGZF file made of
many small blocks.

This was the root cause of these issues using Bio/bgzf.py within
pyfaidx:

mdshw5/pyfaidx#125
mdshw5/pyfaidx#131

Closes issue #1701
@mdshw5
Copy link
Owner

mdshw5 commented Aug 23, 2018

I can confirm that the solution in biopython/biopython#1701 fixes this issue. Due to my implementation there is still a large performance penalty for fetching small substrings near the end of a record, and I'll open an issue to remind me to explore a solution.

@peterjc
Copy link

peterjc commented Aug 23, 2018

Great. You'll be looking forward to Biopython 1.73 then which will be the first release to include this fix.

Thanks everyone 👍

@mdshw5
Copy link
Owner

mdshw5 commented Aug 23, 2018

I'll add a version check around the BGZF code then.

@mdshw5 mdshw5 closed this as completed in 7b835e7 Aug 23, 2018
@KwatMDPhD
Copy link
Contributor Author

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants