Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe memory issues with rolling and group_by #18525

Closed
2 tasks done
MariusMerkleQC opened this issue Sep 2, 2024 · 14 comments · Fixed by #21403
Closed
2 tasks done

Severe memory issues with rolling and group_by #18525

MariusMerkleQC opened this issue Sep 2, 2024 · 14 comments · Fixed by #21403
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@MariusMerkleQC
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from datetime import datetime

import resource

N = 100_000

df = pl.DataFrame(
    data=[
        [datetime(2021, 1, 1, 20, 0, 0)] * N,
        ["a"] * N,
        [i for i in range(N)],
    ],
    schema={"timestamp": pl.Datetime, "category": pl.Utf8, "value": pl.Float64},
)

df_rolled_group_by = df.rolling(
    index_column="timestamp",
    period="5y",
    # group_by="category",
).agg(pl.col("value").mean().alias("cumulative_mean_by_category"))

peak_memory_gb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / (2**30)

print(f"Peak memory in GB: {peak_memory_gb}")

Log output

No response

Issue description

When using group_by in the rolling() operation, memory consumption skyrockets, even for small data frames. When using N=100_000, the memory reaches the following peak:

  • group_by=None: 0.18 GB
  • group_by="category": 18.01 GB

Expected behavior

The peak memory when using group_by=None and group_by="category" should be similar.

Installed versions

--------Version info---------
Polars:              1.6.0
Index type:          UInt32
Platform:            macOS-14.6.1-arm64-arm-64bit
Python:              3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:32:50) [Clang 16.0.6 ]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               4.2.2
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.6.1
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.2
pyarrow              15.0.2
pydantic             2.8.2
pyiceberg            <not installed>
sqlalchemy           2.0.32
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.0
None
@MariusMerkleQC MariusMerkleQC added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 2, 2024
@natchi92
Copy link

Do we know what the underlying issue here ? Would be great to get this resolved :)

@tobiasdekker92
Copy link

tobiasdekker92 commented Nov 27, 2024

I think exactly the same holds for the join_asof function if you include the by argument. Would be nice to also take this implementation into account when solving this bug.

@Chuck321123
Copy link

I dont get it, I get 0.2Gb. Is this fixed now as of 1.22.0?

@gillan-krishna
Copy link

I dont get it, I get 0.2Gb. Is this fixed now as of 1.22.0?

Got the same issue, is it actually fixed in 1.22.0?

@MariusMerkleQC
Copy link
Author

MariusMerkleQC commented Feb 18, 2025

I dont get it, I get 0.2Gb. Is this fixed now as of 1.22.0?

Hm, the memory issues are unchanged for me in 1.22.0. Are you sure that you commented out # group_by="category" above? @Chuck321123 @gillan-krishna

@gillan-krishna
Copy link

gillan-krishna commented Feb 18, 2025

@MariusMerkleQC memory issue persists with 1.22.0, can confirm.

import polars as pl
from datetime import datetime

import resource

N = 60_000

df = pl.DataFrame(
    data=[
        [datetime(2021, 1, 1, 20, 0, 0)] * N,
        ["a"] * N,
        [i for i in range(N)],
    ],
    schema={"timestamp": pl.Datetime, "category": pl.Utf8, "value": pl.Float64},
)

df_rolled_group_by = df.rolling(
    index_column="timestamp",
    period="5y",
    # group_by="category",
).agg(pl.col("value").mean().alias("cumulative_mean_by_category"))

peak_memory_gb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / (2**30)

print(f"Peak memory in GB: {peak_memory_gb}")


group_by=None: Peak memory in GB: 0.000143393874168396
group_by="category": Peak memory in GB: 0.014384772628545761

Additionally, I also monitored the memory usage using htop. it seems like the usage is much beyond what's shown here.

Image

@gillan-krishna
Copy link

gillan-krishna commented Feb 18, 2025

Rn I'm trying #21132 (comment) tip by @ritchie46
Install jemalloc, followed by this to run above script.
LD_PRELOAD=/usr/local/lib/libjemalloc.so MALLOC_CONF="dirty_decay_ms:5000,muzzy_decay_ms:5000,background_thread:true,metadata_thp:auto" python pl_debug.py

has not worked for me, the peak memory has stayed the same. Not very convinced I'm doing it right though

@cmdlineluser
Copy link
Contributor

Just forwarding a reply from Ritchie:

That seems a problem with the implementation, not jemalloc. Will put it on my stack.
The dedicated expressions are better, I would recommend using them when you can, until we have rolling in our new streaming engine.

The expressions:

e.g. .rolling_mean_by()

df.with_columns(
    pl.col("value").rolling_mean_by("timestamp", "5y").over("category")
)

@gillan-krishna Perhaps you can try the expressions if they are suitable for your use case.

@ritchie46
Copy link
Member

This was a pathological memory explosion issue in the implemenation. This will be fixed in new release.

@gillan-krishna
Copy link

@ritchie46 does the new implementation v1.23.0 require the columns to be sorted in a different manner than before? I'm getting a
ComputeError: input data is not sorted
for the same implementation - sorted by column within group.
Have tried removing duplicated inputs in the sorted column to make it strictly ascending, but still gives me the same error

@c-peters c-peters added the accepted Ready for implementation label Feb 24, 2025
@c-peters c-peters added this to Backlog Feb 24, 2025
@c-peters c-peters moved this to Done in Backlog Feb 24, 2025
@ritchie46
Copy link
Member

Your data is sorted within the group? Got a repro? (I might know what it is, just want to be sure).

@mhhhsn
Copy link

mhhhsn commented Feb 24, 2025

I am also getting ComputeError: input data is not sorted after this was included in 1.23.0. This is the minimal example that throws this error for me

import polars as pl

df = pl.DataFrame(
    {
        "n": [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10],
        "col1": ["A", "B"] * 11,
    }
)

print(df.rolling("n", period="1i", group_by="col1").agg())

@cmdlineluser
Copy link
Contributor

I think #21444 is where it is being fixed.

@ritchie46
Copy link
Member

Yes, fixed in #21444

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants