Severe memory issues with `rolling` and `group_by` #18525

MariusMerkleQC · 2024-09-02T16:07:30Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from datetime import datetime

import resource

N = 100_000

df = pl.DataFrame(
    data=[
        [datetime(2021, 1, 1, 20, 0, 0)] * N,
        ["a"] * N,
        [i for i in range(N)],
    ],
    schema={"timestamp": pl.Datetime, "category": pl.Utf8, "value": pl.Float64},
)

df_rolled_group_by = df.rolling(
    index_column="timestamp",
    period="5y",
    # group_by="category",
).agg(pl.col("value").mean().alias("cumulative_mean_by_category"))

peak_memory_gb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / (2**30)

print(f"Peak memory in GB: {peak_memory_gb}")

Log output

No response

Issue description

When using group_by in the rolling() operation, memory consumption skyrockets, even for small data frames. When using N=100_000, the memory reaches the following peak:

group_by=None: 0.18 GB
group_by="category": 18.01 GB

Expected behavior

The peak memory when using group_by=None and group_by="category" should be similar.

Installed versions

--------Version info---------
Polars:              1.6.0
Index type:          UInt32
Platform:            macOS-14.6.1-arm64-arm-64bit
Python:              3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:32:50) [Clang 16.0.6 ]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               4.2.2
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.6.1
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.2
pyarrow              15.0.2
pydantic             2.8.2
pyiceberg            <not installed>
sqlalchemy           2.0.32
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.0
None

The text was updated successfully, but these errors were encountered:

natchi92 · 2024-09-24T14:32:12Z

Do we know what the underlying issue here ? Would be great to get this resolved :)

tobiasdekker92 · 2024-11-27T09:14:23Z

I think exactly the same holds for the join_asof function if you include the by argument. Would be nice to also take this implementation into account when solving this bug.

Chuck321123 · 2025-02-09T20:05:47Z

I dont get it, I get 0.2Gb. Is this fixed now as of 1.22.0?

gillan-krishna · 2025-02-13T10:25:51Z

I dont get it, I get 0.2Gb. Is this fixed now as of 1.22.0?

Got the same issue, is it actually fixed in 1.22.0?

MariusMerkleQC · 2025-02-18T17:46:14Z

I dont get it, I get 0.2Gb. Is this fixed now as of 1.22.0?

Hm, the memory issues are unchanged for me in 1.22.0. Are you sure that you commented out # group_by="category" above? @Chuck321123 @gillan-krishna

gillan-krishna · 2025-02-18T18:43:15Z

@MariusMerkleQC memory issue persists with 1.22.0, can confirm.

import polars as pl
from datetime import datetime

import resource

N = 60_000

df = pl.DataFrame(
    data=[
        [datetime(2021, 1, 1, 20, 0, 0)] * N,
        ["a"] * N,
        [i for i in range(N)],
    ],
    schema={"timestamp": pl.Datetime, "category": pl.Utf8, "value": pl.Float64},
)

df_rolled_group_by = df.rolling(
    index_column="timestamp",
    period="5y",
    # group_by="category",
).agg(pl.col("value").mean().alias("cumulative_mean_by_category"))

peak_memory_gb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / (2**30)

print(f"Peak memory in GB: {peak_memory_gb}")

group_by=None: Peak memory in GB: 0.000143393874168396
group_by="category": Peak memory in GB: 0.014384772628545761

Additionally, I also monitored the memory usage using htop. it seems like the usage is much beyond what's shown here.

gillan-krishna · 2025-02-18T19:20:35Z

Rn I'm trying #21132 (comment) tip by @ritchie46
Install jemalloc, followed by this to run above script.
LD_PRELOAD=/usr/local/lib/libjemalloc.so MALLOC_CONF="dirty_decay_ms:5000,muzzy_decay_ms:5000,background_thread:true,metadata_thp:auto" python pl_debug.py

has not worked for me, the peak memory has stayed the same. Not very convinced I'm doing it right though

cmdlineluser · 2025-02-21T10:04:14Z

Just forwarding a reply from Ritchie:

That seems a problem with the implementation, not jemalloc. Will put it on my stack.
The dedicated expressions are better, I would recommend using them when you can, until we have rolling in our new streaming engine.

The expressions:

https://docs.pola.rs/api/python/dev/search.html?q=rolling_+by

e.g. .rolling_mean_by()

df.with_columns(
    pl.col("value").rolling_mean_by("timestamp", "5y").over("category")
)

@gillan-krishna Perhaps you can try the expressions if they are suitable for your use case.

ritchie46 · 2025-02-22T12:17:17Z

This was a pathological memory explosion issue in the implemenation. This will be fixed in new release.

gillan-krishna · 2025-02-24T07:30:13Z

@ritchie46 does the new implementation v1.23.0 require the columns to be sorted in a different manner than before? I'm getting a
ComputeError: input data is not sorted
for the same implementation - sorted by column within group.
Have tried removing duplicated inputs in the sorted column to make it strictly ascending, but still gives me the same error

ritchie46 · 2025-02-24T13:00:19Z

Your data is sorted within the group? Got a repro? (I might know what it is, just want to be sure).

mhhhsn · 2025-02-24T22:49:30Z

I am also getting ComputeError: input data is not sorted after this was included in 1.23.0. This is the minimal example that throws this error for me

import polars as pl

df = pl.DataFrame(
    {
        "n": [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10],
        "col1": ["A", "B"] * 11,
    }
)

print(df.rolling("n", period="1i", group_by="col1").agg())

cmdlineluser · 2025-02-24T22:59:39Z

I think #21444 is where it is being fixed.

ritchie46 · 2025-02-25T08:48:04Z

Yes, fixed in #21444

MariusMerkleQC added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 2, 2024

ritchie46 self-assigned this Feb 21, 2025

ritchie46 mentioned this issue Feb 22, 2025

perf: Fix pathologic rolling + group-by performance and memory explosion #21403

Merged

ritchie46 closed this as completed in #21403 Feb 22, 2025

c-peters added the accepted Ready for implementation label Feb 24, 2025

c-peters added this to Backlog Feb 24, 2025

c-peters moved this to Done in Backlog Feb 24, 2025

ritchie46 mentioned this issue Feb 25, 2025

test: Add test for rolling stability sort #21456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe memory issues with `rolling` and `group_by` #18525

Severe memory issues with `rolling` and `group_by` #18525

MariusMerkleQC commented Sep 2, 2024

natchi92 commented Sep 24, 2024

tobiasdekker92 commented Nov 27, 2024 •

edited

Loading

Chuck321123 commented Feb 9, 2025

gillan-krishna commented Feb 13, 2025

MariusMerkleQC commented Feb 18, 2025 •

edited

Loading

gillan-krishna commented Feb 18, 2025 •

edited

Loading

gillan-krishna commented Feb 18, 2025 •

edited

Loading

cmdlineluser commented Feb 21, 2025

ritchie46 commented Feb 22, 2025

gillan-krishna commented Feb 24, 2025

ritchie46 commented Feb 24, 2025

mhhhsn commented Feb 24, 2025

cmdlineluser commented Feb 24, 2025

ritchie46 commented Feb 25, 2025

Severe memory issues with rolling and group_by #18525

Severe memory issues with rolling and group_by #18525

Comments

MariusMerkleQC commented Sep 2, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

natchi92 commented Sep 24, 2024

tobiasdekker92 commented Nov 27, 2024 • edited Loading

Chuck321123 commented Feb 9, 2025

gillan-krishna commented Feb 13, 2025

MariusMerkleQC commented Feb 18, 2025 • edited Loading

gillan-krishna commented Feb 18, 2025 • edited Loading

gillan-krishna commented Feb 18, 2025 • edited Loading

cmdlineluser commented Feb 21, 2025

ritchie46 commented Feb 22, 2025

gillan-krishna commented Feb 24, 2025

ritchie46 commented Feb 24, 2025

mhhhsn commented Feb 24, 2025

cmdlineluser commented Feb 24, 2025

ritchie46 commented Feb 25, 2025

Severe memory issues with `rolling` and `group_by` #18525

Severe memory issues with `rolling` and `group_by` #18525

tobiasdekker92 commented Nov 27, 2024 •

edited

Loading

MariusMerkleQC commented Feb 18, 2025 •

edited

Loading

gillan-krishna commented Feb 18, 2025 •

edited

Loading

gillan-krishna commented Feb 18, 2025 •

edited

Loading