BufferedInputBase readline reading buffer twice #371

ELind77 · 2019-10-30T22:59:19Z

Hi,

BufferedInputBase may run through the buffer multiple times while running readline:

https://github.com/RaRe-Technologies/smart_open/blob/1770807b0c8da2a34f749afa17ff70a715bc85de/smart_open/s3.py#L307-L315

I think we can remove this inefficiency if we use find instead of index. E.g.:

while not (self._eof and len(self._buffer) == 0):
    remaining_buffer = self._buffer.peek()
    terminator_index = remaining_buffer.find(self._line_terminator)
    if terminator_index > -1:
        the_line.write(self._read_from_buffer(terminator_index + 1))
        break
    else:
        the_line.write(self._read_from_buffer())
        self._fill_buffer()

Does this make sense or am I missing something?

-- Eric

The text was updated successfully, but these errors were encountered:

mpenkov · 2019-10-31T02:49:00Z

Thank you for raising this.

I'm not sure I'm seeing the same problem as you are. The code you posted does the following:

read stuff into a temproary buffer using peek
call index will find the first occurrence of a newline
read up to and including the newline
return

This sounds like intended behavior to me. What am I missing?

Could you write a unit test that demonstrates it?

ELind77 · 2019-10-31T18:35:01Z

It's not a bug, just an inefficiency. The comment in the code excerpt I posted calls it out specifically:

In the worst case, we're reading the unread part of self._buffer twice here, once in the if condition and once when calling index. This is sub-optimal, but better than the alternative: wrapping .index in a try..except, because that is slower.

The call to __contains__ is going to iterate over every element in the buffer until EOL or the end of the buffer and if EOL is found index will iterate over all of the same elements again. Even though these are built-ins, it's still a performance hit, especially for large buffers. Using find removes the extra iteration.

mpenkov · 2019-11-01T01:13:10Z

Oh, I see. Thank you for your explanation.

Are you able to make a PR? My preference would be to extract that logic into a separate function, e.g. _read_next_line.

mpenkov added this to the 1.9.0 milestone Nov 3, 2019

mpenkov mentioned this issue Mar 9, 2020

implemented efficient readline for ByteBuffer #426

Merged

5 tasks

mpenkov self-assigned this Mar 11, 2020

mpenkov closed this as completed in #426 Mar 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BufferedInputBase readline reading buffer twice #371

BufferedInputBase readline reading buffer twice #371

ELind77 commented Oct 30, 2019

mpenkov commented Oct 31, 2019

ELind77 commented Oct 31, 2019

mpenkov commented Nov 1, 2019

BufferedInputBase readline reading buffer twice #371

BufferedInputBase readline reading buffer twice #371

Comments

ELind77 commented Oct 30, 2019

mpenkov commented Oct 31, 2019

ELind77 commented Oct 31, 2019

mpenkov commented Nov 1, 2019