Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy/Eager right-join is giving different results #19772

Closed
2 tasks done
david-waterworth opened this issue Nov 14, 2024 · 1 comment · Fixed by #19775 or #19898
Closed
2 tasks done

Lazy/Eager right-join is giving different results #19772

david-waterworth opened this issue Nov 14, 2024 · 1 comment · Fixed by #19775 or #19898
Assignees
Labels
A-optimizer Area: plan optimization accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@david-waterworth
Copy link

david-waterworth commented Nov 14, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

If I perform a right join between two lazy frames, and then filter by a value in the left table I get different results if I collect before or after the filter, i.e.

t1 = pl.DataFrame({"id": [1, 2, 3, 4], "lvalue": ["a", "b", "c", "d"]}).lazy()
t2 = pl.DataFrame({"id": [1, 2, 3, 4, 5, 6], "rvalue": ["a", "b", "c", "d", "e", "f"]}).lazy()
print(t1.join(t2, on="id", how="right").filter(pl.col("lvalue") == "c").collect())
shape: (6, 3)
┌────────┬─────┬────────┐
│ lvalue ┆ id  ┆ rvalue │
│ ---    ┆ --- ┆ ---    │
│ str    ┆ i64 ┆ str    │
╞════════╪═════╪════════╡
│ null   ┆ 1   ┆ a      │
│ null   ┆ 2   ┆ b      │
│ c      ┆ 3   ┆ c      │
│ null   ┆ 4   ┆ d      │
│ null   ┆ 5   ┆ e      │
│ null   ┆ 6   ┆ f      │
└────────┴─────┴────────┘
print(t1.join(t2, on="id", how="right").collect().filter(pl.col("lvalue") == "c"))
shape: (1, 3)
┌────────┬─────┬────────┐
│ lvalue ┆ id  ┆ rvalue │
│ ---    ┆ --- ┆ ---    │
│ str    ┆ i64 ┆ str    │
╞════════╪═════╪════════╡
│ c      ┆ 3   ┆ c      │
└────────┴─────┴────────┘

Log output

No response

Issue description

NA

Expected behavior

The result should only include rows that match the filter predicate applied after the join - i.e. the rows where lvalue is null should not be returned. If I'm reading it correctly, the plan for the first version has moved the filter to before the join, which I don't think it should be given it's a right join?

Installed versions

--------Version info---------
Polars:              1.13.1
Index type:          UInt32
Platform:            Linux-5.15.0-125-generic-x86_64-with-glibc2.35
Python:              3.10.15 (main, Oct  8 2024, 00:25:34) [Clang 18.1.8 ]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  1.2.0
altair               5.4.1
cloudpickle          <not installed>
connectorx           0.3.3
deltalake            <not installed>
fastexcel            0.12.0
fsspec               <not installed>
gevent               <not installed>
great_tables         0.13.0
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                2.1.2
openpyxl             <not installed>
pandas               2.2.3
pyarrow              17.0.0
pydantic             2.9.2
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@david-waterworth david-waterworth added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 14, 2024
@nameexhaustion nameexhaustion added P-high Priority: high A-optimizer Area: plan optimization and removed needs triage Awaiting prioritization by a maintainer labels Nov 14, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Nov 14, 2024
@nameexhaustion nameexhaustion added the accepted Ready for implementation label Nov 14, 2024
@nameexhaustion nameexhaustion self-assigned this Nov 14, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Nov 14, 2024
@yiteng-guo
Copy link

Hi I don't think this issue is fully fixed

In [26]: t1.join(t2, on="id", how="right").filter(pl.col("lvalue") != "a").collect()
Out[26]:
shape: (6, 3)
┌────────┬─────┬────────┐
│ lvalue ┆ id  ┆ rvalue │
│ ---    ┆ --- ┆ ---    │
│ str    ┆ i64 ┆ str    │
╞════════╪═════╪════════╡
│ null   ┆ 1   ┆ a      │
│ b      ┆ 2   ┆ b      │
│ c      ┆ 3   ┆ c      │
│ d      ┆ 4   ┆ d      │
│ null   ┆ 5   ┆ e      │
│ null   ┆ 6   ┆ f      │
└────────┴─────┴────────┘

In [27]: t1.collect().join(t2.collect(), on="id", how="right").filter(pl.col("lvalue") != "a")
Out[27]:
shape: (3, 3)
┌────────┬─────┬────────┐
│ lvalue ┆ id  ┆ rvalue │
│ ---    ┆ --- ┆ ---    │
│ str    ┆ i64 ┆ str    │
╞════════╪═════╪════════╡
│ b      ┆ 2   ┆ b      │
│ c      ┆ 3   ┆ c      │
│ d      ┆ 4   ┆ d      │
└────────┴─────┴────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-optimizer Area: plan optimization accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
3 participants