-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SELECT 0 from table_name return 1 line instead of expected number_of_lines(table_name) #18404
Comments
The same applies to python API I expected |
@ritchie46 @orlp can a mod delete/ban the malware link above? Edit: I've reported it as malware abuse, not sure how quickly they act. @AlexeyDmitriev for the python API, 2 lines is intended I believe. See the discussion in #17107. |
Actually I saw that issue, but didn't quite get what is expected and what is not |
So--the below is in reference to
In your example, the frame has height 2, so selecting a literal value be also length 2. In the example in the linked issue, an empty frame has height 0, so selecting a literal from there will be length zero. This is consistent with the definition: import polar as pl
# In an empty frame, literal returns length 1
pl.DataFrame().with_columns(pl.lit(1))
shape: (1, 1)
# ┌─────────┐
# │ literal │
# │ --- │
# │ i32 │
# ╞═════════╡
# │ 1 │
# └─────────┘
# As long as we have any columns, literal returns that length
# In this case, that length is 0.
pl.DataFrame({"a": []}).with_columns(pl.lit(1))
shape: (0, 2)
# ┌──────┬─────────┐
# │ a ┆ literal │
# │ --- ┆ --- │
# │ null ┆ i32 │
# ╞══════╪═════════╡
# └──────┴─────────┘ Now, with pl.DataFrame({"a": []}).select(pl.lit(1))
shape: (1, 1)
┌─────────┐
│ literal │
│ --- │
│ i32 │
╞═════════╡
│ 1 │
└─────────┘ I am not sure what to think about that. |
In Polars (conceptually speaking) So in these two examples >>> df = pl.DataFrame({"x": []}, schema={"x": pl.Int32})
>>> df.select(pl.lit(0))
shape: (1, 1)
┌─────────┐
│ literal │
│ --- │
│ i32 │
╞═════════╡
│ 0 │
└─────────┘
>>> df.select(s=pl.col.x.sum())
shape: (1, 1)
┌─────┐
│ s │
│ --- │
│ i32 │
╞═════╡
│ 0 │
└─────┘ In these four examples the scalar values are broadcasted along with the other column, >>> df = pl.DataFrame({"x": [1, 2]}, schema={"x": pl.Int32})
>>> df.select(pl.col.x, pl.lit(0))
shape: (2, 2)
┌─────┬─────────┐
│ x ┆ literal │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════════╡
│ 1 ┆ 0 │
│ 2 ┆ 0 │
└─────┴─────────┘
>>> df.select(pl.col.x, s=pl.col.x.sum())
shape: (2, 2)
┌─────┬─────┐
│ x ┆ s │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 1 ┆ 3 │
│ 2 ┆ 3 │
└─────┴─────┘
>>> df = pl.DataFrame({"x": []}, schema={"x": pl.Int32})
>>> df.select(pl.col.x, pl.lit(0))
shape: (0, 2)
┌─────┬─────────┐
│ x ┆ literal │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════════╡
└─────┴─────────┘
>>> df.select(pl.col.x, s=pl.col.x.sum())
shape: (0, 2)
┌─────┬─────┐
│ x ┆ s │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
└─────┴─────┘ This is why Polars has different behavior when you use As for the original issue, I don't know exactly how to match SQL's behavior here without a lot of work on our end... † I'm still not 100% sure what we should do in the >>> df = pl.DataFrame({"x": []}, schema={"x": pl.Int32})
>>> df.with_columns(x=0)
shape: (0, 1)
┌─────┐
│ x │
│ --- │
│ i32 │
╞═════╡
└─────┘ This might however be problematic for the new streaming engine, so the jury isn't out on this one yet. |
Thanks for the clarification @orlp, makes perfect sense. |
@orlp thanks for explanation, I can see now that this behaviour make sense (in particular you allow selecting aggregates and non-aggregates with simpler syntax then in classic sql as in your Personally, I'd find more intuitive if there were to separate functions to select something for each line of the df (accept both, always broadcasted) and select an expression (accepts only scalars, doesn't)
But I guess, that wouldn't work with the semantics of functions like Anyway, it would be good if this was explained in some obvious place in the docs, because when I tried to see explanation I failed. Also, is it possible to create expression that means "0 distributed for each row even if nothing else is selected"?
BTW the fact that pl.lit is scalar also does something strange with joins |
@AlexeyDmitriev you mean this? #9603 |
I added somewhat simpler example there. |
Kind of. You'll have to select a column to broadcast with, then drop that column after the broadcast: >>> df = pl.DataFrame({"x": [1, 2, 3]})
>>> df.select(pl.first(), pl.lit(0)).drop(pl.first())
shape: (3, 1)
┌─────────┐
│ literal │
│ --- │
│ i32 │
╞═════════╡
│ 0 │
│ 0 │
│ 0 │
└─────────┘ But I don't believe you can literally do it as a stand-alone expression. EDIT: you can actually make such an expression by broadcasting by abusing >>> df.select(pl.when(True).then(0).otherwise(pl.first()))
shape: (3, 1)
┌─────────┐
│ literal │
│ --- │
│ i64 │
╞═════════╡
│ 0 │
│ 0 │
│ 0 │
└─────────┘ Still relies on broadcasting with an existing column but that column never has to show up in your result. |
Checks
Reproducible example
Log output
Issue description
So, when you select constant from a table you get the value 1 times.
Expected behavior
When you do the same in classical DBMS such as postgresql you get the number of lines equal to the number of lines.
That was what I expected as well.
I understand the polars may have expected different semantics, but I haven't find relevant place in the docs which would mention this case.
Installed versions
The text was updated successfully, but these errors were encountered: