Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_sorted produces incorrect results when merge column is sorted in descending order #21464

Closed
2 tasks done
Matt711 opened this issue Feb 25, 2025 · 5 comments · Fixed by #21501
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Matt711
Copy link
Contributor

Matt711 commented Feb 25, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

When merging two sorted lazyframes or dataframes (descending order), polars produces incorrect results.

In [1]: import polars as pl

In [2]: df0 = pl.LazyFrame(
   ...:     {
   ...:         "name": ["steve", "elise", "bob"],
   ...:         "age": [42, 44, 18],
   ...:         "height": [1, 2, 3]
   ...:     }
   ...: ).sort("age", descending=True)
   ...: df1 = pl.LazyFrame(
   ...:     {
   ...:         "name": ["anna", "megan", "steve", "thomas"],
   ...:         "age": [21, 33, 42, 20],
   ...:         "height": [5,5,5,5]
   ...:     }
   ...: ).sort("age", descending=True)
   ...: 
   ...: print("CPU", df0.merge_sorted(df1, key="age").collect())
CPU shape: (7, 3)
┌────────┬─────┬────────┐
│ nameageheight │
│ ---------    │
│ stri64i64    │
╞════════╪═════╪════════╡
│ steve425      │
│ megan335      │
│ anna215      │
│ thomas205      │
│ elise442      │
│ steve421      │
│ bob183      │
└────────┴─────┴────────┘

Log output

Issue description

Copied the example from https://docs.pola.rs/api/python/dev/reference/lazyframe/api/polars.LazyFrame.merge_sorted.html

Expected behavior

I expect

CPU shape: (7, 3)
┌────────┬─────┬────────┐
│ nameageheight │
│ ---------    │
│ stri64i64    │
╞════════╪═════╪════════╡
│ elise442      │
│ steve421      │
│ steve425      │
│ megan335      │
│ anna215      │
│ thomas205      │
│ bob183      │
└────────┴─────┴────────┘

Installed versions

--------Version info---------
Polars:              1.23.0
Index type:          UInt32
Platform:            Linux-5.4.0-182-generic-x86_64-with-glibc2.35
Python:              3.12.9 | packaged by conda-forge | (main, Feb 14 2025, 08:00:06) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  1.4.0
altair               <not installed>
azure.identity       <not installed>
boto3                1.36.23
cloudpickle          3.1.1
connectorx           0.4.2
deltalake            0.24.0
fastexcel            0.12.1
fsspec               2025.2.0
gevent               24.11.1
google.auth          2.38.0
great_tables         0.16.1
matplotlib           3.10.0
numpy                2.1.3
openpyxl             3.1.5
pandas               2.2.3
polars_cloud         <not installed>
pyarrow              19.0.1
pydantic             2.10.6
pyiceberg            <not installed>
sqlalchemy           2.0.38
torch                2.5.1.post306
xlsx2csv             0.8.4
xlsxwriter           3.2.2
@Matt711 Matt711 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 25, 2025
@ToukoH
Copy link

ToukoH commented Feb 25, 2025

To me it looks like the issue might arise because get_merge_indicator at polars/crates/polars-ops/src/frame/join/merge_sorted.rs implicitly assumes both iterators are sorted in ascending order. The current implementation uses comparisons like a <= current_b and b >= a, which only work correctly with ascending data. A toy example would be to walk through the implementation with two lists, say a=[5, 3] and b=[4, 2]. It's obvious what this should return, but following the function's logic, you'll actually get [4, 2, 5, 3]. I set the ages in the example to match these numbers, and that's exactly what it returns.

@ritchie46
Copy link
Member

It is "sort of" documented in the docstring:

It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.

We should clarify that this means in ascending order.

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 27, 2025

We should clarify that this means in ascending order.

Opened #21501

@coastalwhite
Copy link
Collaborator

I am pretty sure merge_sorted in the new streaming engine accepts both ways. There are also issues in merge_sorted about nulls and whether they are allowed.

@Matt711
Copy link
Contributor Author

Matt711 commented Feb 27, 2025

I am pretty sure merge_sorted in the new streaming engine accepts both ways. There are also issues in merge_sorted about nulls and whether they are allowed.

I get the same results with new streaming engine.

In [8]: df0.merge_sorted(df1, key="age").collect(new_streaming=True)
Out[8]: 
shape: (7, 2)
┌────────┬─────┐
│ name   ┆ age │
│ ---    ┆ --- │
│ str    ┆ i64 │
╞════════╪═════╡
│ steve  ┆ 42  │
│ megan  ┆ 33  │
│ anna   ┆ 21  │
│ thomas ┆ 20  │
│ elise  ┆ 44  │
│ steve  ┆ 42  │
│ bob    ┆ 18  │
└────────┴─────┘

I filed a feature request #21511

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
4 participants