Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BufferedInputBase readline reading buffer twice #371

Closed
ELind77 opened this issue Oct 30, 2019 · 3 comments · Fixed by #426
Closed

BufferedInputBase readline reading buffer twice #371

ELind77 opened this issue Oct 30, 2019 · 3 comments · Fixed by #426
Assignees
Milestone

Comments

@ELind77
Copy link

ELind77 commented Oct 30, 2019

Hi,

BufferedInputBase may run through the buffer multiple times while running readline:

https://github.com/RaRe-Technologies/smart_open/blob/1770807b0c8da2a34f749afa17ff70a715bc85de/smart_open/s3.py#L307-L315

I think we can remove this inefficiency if we use find instead of index. E.g.:

while not (self._eof and len(self._buffer) == 0):
    remaining_buffer = self._buffer.peek()
    terminator_index = remaining_buffer.find(self._line_terminator)
    if terminator_index > -1:
        the_line.write(self._read_from_buffer(terminator_index + 1))
        break
    else:
        the_line.write(self._read_from_buffer())
        self._fill_buffer()

Does this make sense or am I missing something?

-- Eric

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 31, 2019

Thank you for raising this.

I'm not sure I'm seeing the same problem as you are. The code you posted does the following:

  • read stuff into a temproary buffer using peek
  • call index will find the first occurrence of a newline
  • read up to and including the newline
  • return

This sounds like intended behavior to me. What am I missing?

Could you write a unit test that demonstrates it?

@ELind77
Copy link
Author

ELind77 commented Oct 31, 2019

It's not a bug, just an inefficiency. The comment in the code excerpt I posted calls it out specifically:

In the worst case, we're reading the unread part of self._buffer twice here, once in the if condition and once when calling index. This is sub-optimal, but better than the alternative: wrapping .index in a try..except, because that is slower.

The call to __contains__ is going to iterate over every element in the buffer until EOL or the end of the buffer and if EOL is found index will iterate over all of the same elements again. Even though these are built-ins, it's still a performance hit, especially for large buffers. Using find removes the extra iteration.

@mpenkov
Copy link
Collaborator

mpenkov commented Nov 1, 2019

Oh, I see. Thank you for your explanation.

Are you able to make a PR? My preference would be to extract that logic into a separate function, e.g. _read_next_line.

@mpenkov mpenkov added this to the 1.9.0 milestone Nov 3, 2019
@mpenkov mpenkov self-assigned this Mar 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants