Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby.count() regression #4344

Open
anmyachev opened this issue Mar 24, 2022 · 4 comments
Open

groupby.count() regression #4344

anmyachev opened this issue Mar 24, 2022 · 4 comments
Labels
Blocked ❌ A pull request that is blocked External Pull requests and issues from people who do not regularly contribute to modin P2 Minor bugs or low-priority feature requests Performance 🚀 Performance related issues and pull requests. Regression ↩️ Something that used to work but doesn't anymore

Comments

@anmyachev
Copy link
Collaborator

anmyachev commented Mar 24, 2022

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Modin version (modin.__version__): 006ccd0 vs 0.12.0
  • Python version: 3.8.12
  • Engine: Ray
  • Code we can use to reproduce: MODIN_NPARTITIONS=32 python reproducer.py

reproducer.py:

import string
import random

import modin.pandas as pd
import numpy as np
from time import time

rows = 5000000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)

df = pd.DataFrame(res)

for _ in range(5):
    start = time()
    # group by string column
    result = df.groupby(3, sort=False).count()
    print(f"end: {time()-start}")

Describe the problem

~2.2x slower

Source code / logs

0.12.0
UserWarning: The size of /dev/shm is too small (365270204416 bytes). The required size at least half of RAM (405121079296 bytes). Please, delete files in /dev/shm or increase size of /dev/shm with --shm-size in Docker. Also, you can set the required memory size for each Ray worker in bytes to MODIN_MEMORY environment variable.
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
end: 8.011138439178467
end: 7.805281162261963
end: 8.223145008087158
end: 8.279038429260254
end: 8.183176755905151


006ccd0
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
UserWarning: `df.groupby(categorical_by, sort=False)` implementation has mismatches with pandas:
the groupby keys will be sorted anyway, although the 'sort=False' was passed. See the following issue for more details: https://github.com/modin-project/modin/issues/3571.
end: 18.63240075111389
end: 17.186938524246216
end: 16.810182809829712
end: 17.209285974502563
end: 18.92437243461609
@anmyachev anmyachev added Performance 🚀 Performance related issues and pull requests. Regression ↩️ Something that used to work but doesn't anymore External Pull requests and issues from people who do not regularly contribute to modin labels Mar 24, 2022
@anmyachev anmyachev self-assigned this Mar 25, 2022
@anmyachev
Copy link
Collaborator Author

Regression commit: c67a7c5

@anmyachev
Copy link
Collaborator Author

Blocked by #3571.

@anmyachev anmyachev added the Blocked ❌ A pull request that is blocked label Mar 25, 2022
@jeffreykennethli
Copy link

Hi @anmyachev it looks like #3571 is merged, is this still blocked?

@anmyachev
Copy link
Collaborator Author

Hi @jeffreykennethli #3571 should be open until pandas-dev/pandas#44099 is merged.

@vnlitvinov vnlitvinov added the P2 Minor bugs or low-priority feature requests label Aug 29, 2022
@anmyachev anmyachev removed their assignment Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked ❌ A pull request that is blocked External Pull requests and issues from people who do not regularly contribute to modin P2 Minor bugs or low-priority feature requests Performance 🚀 Performance related issues and pull requests. Regression ↩️ Something that used to work but doesn't anymore
Projects
None yet
Development

No branches or pull requests

3 participants