Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why zip is slower than list comprehension when list is very long? #102724

Closed
HydrogenSulfate opened this issue Mar 15, 2023 · 7 comments
Closed
Labels
performance Performance or resource usage

Comments

@HydrogenSulfate
Copy link

HydrogenSulfate commented Mar 15, 2023

Bug report

I have found that zip is much slower than list comprehension when the list is very long.
This issue is same as lululxvi/deepxde#1149

import time
import random

n = 1252319
array = [[random.randint(0, n), random.randint(0, n)] for _ in range(n)]

s = time.perf_counter()
res1 = zip(*array)
e1 = time.perf_counter()
print(f"zip(...) cost: {e1 - s}")

s = time.perf_counter()
res1 = list(res1)
e2 = time.perf_counter()
print(f"list(...) cost: {e2 - s}")

s = time.perf_counter()
res2 = [tuple([p[0] for p in array]), tuple([p[1] for p in array])]
e2 = time.perf_counter()
print(f"list comprehension cost: {e2 - s}")

print(res1 == res2)

"""
python3.7 
zip(...) cost: 0.6884705810807645
list(...) cost: 0.0964503069408238
list comprehension cost: 0.1501797849778086
True

python3.11
zip(...) cost: 0.6185596429277211
list(...) cost: 0.08116732793860137
list comprehension cost: 0.12146345409564674
True
"""

Your environment

  • CPython versions tested on: python3.7 and python3.11
  • Operating system and architecture: Ubuntu 18.04.6 LTS
@HydrogenSulfate HydrogenSulfate added the type-bug An unexpected behavior, bug, or error label Mar 15, 2023
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Mar 15, 2023
@pochmann
Copy link
Contributor

pochmann commented Mar 15, 2023

Mostly because of the garbage collection cost caused by the n list iterators used by zip. If you disable it with import gc; gc.disable(), you'll see the zip becomes about as fast as the list comprehension.

gc enabled:

zip(...) cost: 0.6406553536653519
list(...) cost: 0.0762333800084889
list comprehension cost: 0.15329286875203252
True

gc disabled:

zip(...) cost: 0.10067165410146117
list(...) cost: 0.07565007032826543
list comprehension cost: 0.15406962297856808
True

@picnixz
Copy link
Member

picnixz commented Mar 15, 2023

I would suggest using timeit.timeit instead because garbage collecting is disabled automatically. Here are my results (tested using Python 3.9.7):

from random import random
from timeit import timeit

n = 1_000_000
array = [[random(), random()] for _ in range(n)]

r1 = list(zip(*array))
r2 = [tuple([p[0] for p in array]), tuple([p[1] for p in array])]
assert r1 == r2

s1 = 'zip(*array)'
t1 = timeit(s1, globals={'array': array}, number=100)
print(s1, t1)  # t1 ~ 5.97

s2 = '[tuple([p[0] for p in array]), tuple([p[1] for p in array])]'
t2 = timeit(s2, globals={'array': array}, number=100)
print(s2, t2)  # t2 ~ 9.71

@pochmann
Copy link
Contributor

pochmann commented Mar 15, 2023

@picnixz I think that's improper. The legitimate reason that timeit disables garbage collection is to remove effects of garbage collection caused by unrelated code. But in this case, it's caused by the code in question. The timeit documentation also says:

Note: By default, timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. The disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string.

And if you want to make it faster by disabling garbage collection, then disable it explicitly, not by using a timing method that does it for you. You don't use timeit to run code in production, after all.

@pochmann
Copy link
Contributor

@picnixz Also, you timed only zip(*array). Should time list(zip(*array)).

@picnixz
Copy link
Member

picnixz commented Mar 15, 2023

@pochmann

Thank you for pinpointing this (I actually don't suggest disabling the garbage collection using timeit, but rather suggest to use it to time performances).

Also, you timed only zip(*array). Should time list(zip(*array)).

Well, that's what the OP essentially marked as being slow (consuming the zip iterator does not dominate the complexity in this case).

@pochmann
Copy link
Contributor

but rather suggest to use it to time performances

Ok, though it doesn't really time "performance" but "performance with disabled garbage collection". Which isn't really realistic/meaningful unless their actual usage also is with disabled garbage collection.

@methane methane removed the type-bug An unexpected behavior, bug, or error label Mar 17, 2023
@methane
Copy link
Member

methane commented Mar 17, 2023

Sorry, I removed my wrong observation.
zip(*array) calls PyObject_GetIter() for each item in the array. It triggers GC many times.

I close this issue because it is duplicate of #100403.

@methane methane closed this as not planned Won't fix, can't repro, duplicate, stale Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

5 participants