Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-89550: Buffer GzipFile.write to reduce execution time by ~15% #101251

Merged
merged 17 commits into from
May 8, 2023

Conversation

CCLDArjun
Copy link
Contributor

@CCLDArjun CCLDArjun commented Jan 22, 2023

Currently, all of this is done for every write() call:
https://github.com/python/cpython/blob/main/Lib/gzip.py#L266-L288

This pr stores GzipFile.write() data into a buffer to postpone most of the compression work to when the buffer flushes. This way, we can call into zlib and File.write() less frequently.

benchmarking with timeit:

../python -m timeit -s 'import gzip
' 'with gzip.open("./Python-3.11.1.tgz", "rb") as infile:
' '    with gzip.open("./Python-3.11.1new.tgz", "wb") as outfile:
' '        for line in infile:
' '            outfile.write(line)
' '
'
1 loop, best of 5: 6.36 sec per loop

vs

../python -m timeit -s 'import gzipold as gzip
' 'with gzip.open("./Python-3.11.1.tgz", "rb") as infile:
' '    with gzip.open("./Python-3.11.1new.tgz", "wb") as outfile:
' '        for line in infile:
' '            outfile.write(line)
' '
'
1 loop, best of 5: 7.68 sec per loop

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@CCLDArjun CCLDArjun changed the title Buffer gzip writes to reduce execution time by ~15% gh-89550: Buffer GzipFile.write to reduce execution time by ~15% Jan 22, 2023
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Jan 22, 2023
@rhpvorderman
Copy link
Contributor

Wow @CCLDArjun such a simple and effective solution. My solution was to split out a _GzipWriter (just like _GzipReader) and try to work from there, but I never got that to work easily and it was quite some extra code too. I am so glad you made this. I hope this will get merged soon!

@CCLDArjun
Copy link
Contributor Author

My solution was to split out a _GzipWriter (just like _GzipReader) and try to work from there

Initially that was my idea too, where I subclassed _pyio.BufferedWriter to override its flush methods but that also wasn't working easily and was pretty messy: 943ca9c

@rhpvorderman
Copy link
Contributor

I am glad that you were able to come up with a better idea. I have been hacking at the gzip module pretty thoroughly but I could not come up with this. I am in awe 😀 . I will port this to python-isal and I have recently started on making a zlib-ng module too. That way people will not have to wait for 3.12 to ship to get all this fast compression.

@orsenthil
Copy link
Member

This looks good to me. A core developer in the expert list will review before merging.

@rhpvorderman
Copy link
Contributor

@orsenthil This is still labeled "awaiting review" rather then "awaiting core review" does that matter?

@arhadthedev
Copy link
Member

@rhpvorderman Only review messages from the Python team count.

@Yhg1s, @gpshead as zipfile experts (not quite the same but still).

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this change. It's great how cleanly this can be added. :)

Lib/gzip.py Outdated Show resolved Hide resolved
Lib/gzip.py Outdated Show resolved Hide resolved
Lib/gzip.py Show resolved Hide resolved
@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@gpshead gpshead self-assigned this Feb 18, 2023
@CCLDArjun
Copy link
Contributor Author

@gpshead tests are passing! Can you re review?

Lib/gzip.py Show resolved Hide resolved
@rhpvorderman
Copy link
Contributor

ping

It would be great if this gets into 3.12. The measured execution difference of 15% is at gzip level 9 which is the default. At a more modest compression level such as 1, the difference can be a couple of hundreds of percents.

@CCLDArjun
Copy link
Contributor Author

pinging @gpshead for a final review

@gpshead gpshead enabled auto-merge (squash) May 8, 2023 17:29
@gpshead gpshead merged commit 9af4854 into python:main May 8, 2023
jbower-fb pushed a commit to jbower-fb/cpython-jbowerfb that referenced this pull request May 8, 2023
python#101251)

Use `io.BufferedWriter` to buffer gzip writes.

---------

Co-authored-by: Alex Waygood <[email protected]>
Co-authored-by: Gregory P. Smith <[email protected]>
carljm added a commit to carljm/cpython that referenced this pull request May 9, 2023
* main: (47 commits)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  pythonGH-104145: Use fully-qualified cross reference types for the bisect module (python#104172)
  pythongh-103193: Improve `getattr_static` test coverage (python#104286)
  Trim trailing whitespace and test on CI (python#104275)
  pythongh-102500: Remove mention of bytes shorthand (python#104281)
  pythongh-97696: Improve and fix documentation for asyncio eager tasks (python#104256)
  pythongh-99108: Replace SHA3 implementation HACL* version (python#103597)
  pythongh-104273: Remove redundant len() calls in argparse function (python#104274)
  pythongh-64660: Don't hardcode Argument Clinic return converter result variable name (python#104200)
  pythongh-104265 Disallow instantiation of `_csv.Reader` and `_csv.Writer` (python#104266)
  pythonGH-102613: Improve performance of `pathlib.Path.rglob()` (pythonGH-104244)
  pythongh-103650: Fix perf maps address format (python#103651)
  pythonGH-89812: Churn `pathlib.Path` methods (pythonGH-104243)
  ...
carljm added a commit to carljm/cpython that referenced this pull request May 9, 2023
* main: (29 commits)
  pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277)
  pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298)
  pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320)
  require-pr-label.yml: Add missing "permissions:" (python#104309)
  pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939)
  pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  pythonGH-104145: Use fully-qualified cross reference types for the bisect module (python#104172)
  pythongh-103193: Improve `getattr_static` test coverage (python#104286)
  Trim trailing whitespace and test on CI (python#104275)
  pythongh-102500: Remove mention of bytes shorthand (python#104281)
  pythongh-97696: Improve and fix documentation for asyncio eager tasks (python#104256)
  pythongh-99108: Replace SHA3 implementation HACL* version (python#103597)
  ...
carljm added a commit to carljm/cpython that referenced this pull request May 9, 2023
* main: (156 commits)
  pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304)
  pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441)
  pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096)
  pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217)
  pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312)
  pythongh-104240: return code unit metadata from codegen (python#104300)
  pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277)
  pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298)
  pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320)
  require-pr-label.yml: Add missing "permissions:" (python#104309)
  pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939)
  pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  ...
carljm added a commit to carljm/cpython that referenced this pull request May 9, 2023
* main: (35 commits)
  pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304)
  pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441)
  pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096)
  pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217)
  pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312)
  pythongh-104240: return code unit metadata from codegen (python#104300)
  pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277)
  pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298)
  pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320)
  require-pr-label.yml: Add missing "permissions:" (python#104309)
  pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939)
  pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181)
  pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188)
  pythonGH-104308: socket.getnameinfo should release the GIL (python#104307)
  pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311)
  pythongh-99113: A Per-Interpreter GIL! (pythongh-104210)
  pythonGH-104284: Fix documentation gettext build (python#104296)
  pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251)
  pythongh-104223: Fix issues with inheriting from buffer classes (python#104227)
  pythongh-99108: fix typo in Modules/Setup (python#104293)
  ...
Yhg1s added a commit to Yhg1s/cpython that referenced this pull request Jun 19, 2023
GzipFile.flush() to not flush the compressor (nor pass along the zip_mode
argument).
Yhg1s added a commit to Yhg1s/cpython that referenced this pull request Jun 19, 2023
…h() to

not flush the compressor (nor pass along the zip_mode argument).
gpshead pushed a commit that referenced this pull request Jun 19, 2023
Fix a regression introduced in GH-101251, causing GzipFile.flush() to
not flush the compressor (nor pass along the zip_mode argument).
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jun 19, 2023
…onGH-105910)

Fix a regression introduced in pythonGH-101251, causing GzipFile.flush() to
not flush the compressor (nor pass along the zip_mode argument).
(cherry picked from commit 1858db7)

Co-authored-by: T. Wouters <[email protected]>
Yhg1s added a commit that referenced this pull request Jun 19, 2023
#105920)

GH-105808: Fix a regression introduced in GH-101251 (GH-105910)

Fix a regression introduced in GH-101251, causing GzipFile.flush() to
not flush the compressor (nor pass along the zip_mode argument).
(cherry picked from commit 1858db7)

Co-authored-by: T. Wouters <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants