BGZF recursion issue #125

KwatMDPhD · 2017-09-13T11:53:55Z

Hi,

I used Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz and quered pyfaidx_handle['12'][111803961:111803962]. But the runtime is very slow and, with further investigation, I learned that I was getting maximum recursion error. I think using uncompressed .fa is faster and safer as of now. Thoughts?

Kind regards,
Kwat

The text was updated successfully, but these errors were encountered:

mdshw5 · 2018-01-06T17:53:07Z

Sorry for ignoring this issue - it slipped through the cracks! I'm going to close it and add you to the other similar issue #131.

mdshw5 · 2018-01-07T01:16:49Z

Just to double-check: did you compress this file with bgzip? There should be a check for the BGZF magic and you should get a pyfaidx.UnsupportedCompressionFormat error on a normal gzip file, but I just want to be thorough.

mdshw5 · 2018-01-07T03:41:30Z

I'm definitely able to reproduce your issue. (thanks @peterjc for pointing out the source file)

$ wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
--2018-01-06 20:21:07--  http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Resolving ftp.ensembl.org... 193.62.193.8
Connecting to ftp.ensembl.org|193.62.193.8|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 881214396 (840M) [application/x-gzip]
Saving to: 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz'

Homo_sapiens.GRCh38.dna.primary_assemb 100%[============================================================================>] 840.39M   393KB/s    in 24m 34s 

2018-01-06 20:45:41 (584 KB/s) - 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz' saved [881214396/881214396]

$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 
$ bgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa 
$ python
Python 2.7.10 (default, Feb  7 2017, 00:08:15) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyfaidx import Fasta
>>> pyfaidx_handle = Fasta('Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz')
>>> pyfaidx_handle['12'][111803961:111803962]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 698, in __getitem__
    return self._fa.get_seq(self.name, start + 1, stop)[::step]
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 887, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 532, in fetch
    seq = self.from_file(name, start, end)
  File "/Development/pyfaidx/pyfaidx/__init__.py", line 574, in from_file
    chunk_seq = self.file.read(chunk).decode()
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
    return data + self.read(size)
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
    return data + self.read(size)
...
...
...

yada, yada, yada ...

 File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 651, in read
    return data + self.read(size)
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 644, in read
    self._load_block()  # will reset offsets
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 576, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/Development/pyfaidx/venv/lib/python2.7/site-packages/Bio/bgzf.py", line 411, in _load_bgzf_block
    if magic != _bgzf_magic:
RuntimeError: maximum recursion depth exceeded in cmp
>>>

It appears that the line lengths are all 61 bytes, so this isn't a long line issue. I'll add some debug flags to pyfaidx and see if maybe it's a bug in how I'm indexing the file and calculating offsets. @peterjc: what would happen if I tried to read from a virtual offset that doesn't exist (out of bounds read)?

mdshw5 · 2018-01-07T03:49:53Z

Specifically I'm indexing a BGZF compressed file by calling the tell method to get the current offset into the beginning of each FASTA record:

pyfaidx/pyfaidx/__init__.py

Line 476 in 8168b37

thisoffset = fastafile.tell() if self._bgzf else offset

This returns a virtual offset which I then pass to the BGZF seek method, reading from the known virtual offset of the record start, read until the end position, and slice the beginning before returning (can't seek directly to the requested start coordinate because it might be in a different BGZF block).

pyfaidx/pyfaidx/__init__.py

Lines 570 to 575 in 8168b37

    
           with self.lock: 
        
               if self._bgzf:  # We can't add to virtual offsets, so we need to read from the beginning of the record and trim the beginning if needed 
        
                   self.file.seek(i.offset) 
        
                   chunk = start0 + newlines_before + newlines_inside + seq_len 
        
                   chunk_seq = self.file.read(chunk).decode() 
        
                   seq = chunk_seq[start0 + newlines_before:]

Is there something obviously wrong with this?

mdshw5 · 2018-01-07T04:23:16Z

Ah, I think I know what's happening here. I'm calling BgzfReader.read() and asking for the data between chromosome 12 locations [1: 111803962]. This is due to the way I've decided to avoid potentially adding/subtracting virtual offsets in my Faidx.fetch() method. I think BgzfReader will hit the interpreter recursion limit any time you ask it to read a large string from a file with a large number of newline characters, such as this example.

It's not trivial, but I could try to come up with a way to be more efficient with my BGZF reads. @peterjc: maybe I'm missing something but it seems impossible to seek/read a specific substring efficiently that may span multiple BGZF blocks unless you store the offset of each block indexed to the string.

peterjc · 2018-01-07T22:55:02Z

Does this mean you can print out the problem offset so we can make a tiny test-case using this human FASTA file and just Bio/bgzf.py? That's be very helpful as I right now (continuing our discussion from #131, without having played with the code), I am puzzled as to what might be going wrong here.

Note I suspect both .readline() and .read() may be affected by this recursion issue (I left a TODO comment in both methods).

@mdshw5 do you have the length in bytes of the string you want to read, as well as the (BGZF virtual) offset of its start point? That is quite well tested because that's how Biopython's SeqIO get_raw functionality works on BGZF compressed files. However, it did take some careful coding to ensure I could use .tell() to get the virtual offset needed while building the index (i.e. avoiding any virtual pointer arithmetic because is not supported).

UnkindPartition · 2018-07-02T08:11:56Z

FWIW, I also ran into the same problem today, it seems.

% faidx -o "xxx.fa" -b "xxx.narrowPeak" "hg38.fa.gz"
Traceback (most recent call last):
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/bin/faidx", line 10, in <module>
    sys.exit(main())
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 197, in main
    write_sequence(args)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 52, in write_sequence
    for line in fetch_sequence(args, fasta, name, start, end):
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/cli.py", line 69, in fetch_sequence
    sequence = fasta[name][start:end]
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 789, in __getitem__
    return self._fa.get_seq(self.name, start + 1, stop)[::step]
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1015, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 613, in fetch
    seq = self.from_file(name, start, end)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 656, in from_file
    chunk_seq = self.file.read(chunk).decode()
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 655, in read
    return data + self.read(size)
  [Previous line repeated 983 more times]
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 648, in read
    self._load_block()  # will reset offsets
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 577, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/mnt/fls01-home01/mbmhtrcb/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 412, in _load_bgzf_block
    if magic != _bgzf_magic:
RecursionError: maximum recursion depth exceeded in comparison

When reading many BGZF blocks within a single .read() and/or .readline() call the old code could fail with a RecursionError. The new test case triggered this with a BGZF file made of many small blocks. This was the root cause of these issues using Bio/bgzf.py within pyfaidx: mdshw5/pyfaidx#125 mdshw5/pyfaidx#131 Closes issue #1701

mdshw5 · 2018-08-23T15:04:50Z

I can confirm that the solution in biopython/biopython#1701 fixes this issue. Due to my implementation there is still a large performance penalty for fetching small substrings near the end of a record, and I'll open an issue to remind me to explore a solution.

peterjc · 2018-08-23T15:07:08Z

Great. You'll be looking forward to Biopython 1.73 then which will be the first release to include this fix.

Thanks everyone 👍

mdshw5 · 2018-08-23T15:11:49Z

I'll add a version check around the BGZF code then.

KwatMDPhD · 2018-10-24T02:23:29Z

:)

mdshw5 self-assigned this Sep 13, 2017

arivers mentioned this issue Jan 5, 2018

Recursion error when with bfgz files #131

Closed

mdshw5 closed this as completed Jan 6, 2018

mdshw5 reopened this Jan 7, 2018

mdshw5 mentioned this issue Feb 9, 2018

Large faidx. Huge memory usage? #137

Closed

peterjc mentioned this issue Jul 2, 2018

BGZF RecursionError biopython/biopython#1701

Open

mdshw5 closed this as completed in 7b835e7 Aug 23, 2018

mdshw5 mentioned this issue Nov 26, 2018

ImportError: BioPython >= 1.73 must be installed to read block gzip files. #147

Closed

davmlaw mentioned this issue Jul 26, 2019

BGZip slow performance near end of chromosomes #153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BGZF recursion issue #125

BGZF recursion issue #125

KwatMDPhD commented Sep 13, 2017

mdshw5 commented Jan 6, 2018

mdshw5 commented Jan 7, 2018

mdshw5 commented Jan 7, 2018

mdshw5 commented Jan 7, 2018 •

edited

Loading

mdshw5 commented Jan 7, 2018

peterjc commented Jan 7, 2018

UnkindPartition commented Jul 2, 2018

mdshw5 commented Aug 23, 2018

peterjc commented Aug 23, 2018

mdshw5 commented Aug 23, 2018

KwatMDPhD commented Oct 24, 2018

BGZF recursion issue #125

BGZF recursion issue #125

Comments

KwatMDPhD commented Sep 13, 2017

mdshw5 commented Jan 6, 2018

mdshw5 commented Jan 7, 2018

mdshw5 commented Jan 7, 2018

mdshw5 commented Jan 7, 2018 • edited Loading

mdshw5 commented Jan 7, 2018

peterjc commented Jan 7, 2018

UnkindPartition commented Jul 2, 2018

mdshw5 commented Aug 23, 2018

peterjc commented Aug 23, 2018

mdshw5 commented Aug 23, 2018

KwatMDPhD commented Oct 24, 2018

mdshw5 commented Jan 7, 2018 •

edited

Loading