SELECT 0 from table_name return 1 line instead of expected number_of_lines(table_name) #18404

AlexeyDmitriev · 2024-08-27T15:45:54Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
L = pl.DataFrame({'a': [1,2]})
pl.sql('SELECT 0 FROM L').collect()

Log output

shape: (1, 1)
literal
i32
0

Issue description

So, when you select constant from a table you get the value 1 times.

Expected behavior

When you do the same in classical DBMS such as postgresql you get the number of lines equal to the number of lines.
That was what I expected as well.

I understand the polars may have expected different semantics, but I haven't find relevant place in the docs which would mention this case.

Installed versions

--------Version info---------
Polars:               1.5.0
Index type:           UInt32
Platform:             Linux-5.15.0-113-generic-x86_64-with-glibc2.31
Python:               3.11.7 (main, Apr 20 2024, 11:23:46) [GCC 13.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.6.1
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.7.1
nest_asyncio:         1.6.0
numpy:                1.23.4
openpyxl:             3.1.2
pandas:               2.0.0rc1
pyarrow:              16.0.0
pydantic:             2.6.0
pyiceberg:            <not installed>
sqlalchemy:           2.0.1
torch:                2.3.1+cu121
xlsx2csv:             <not installed>
xlsxwriter:           1.1.1

The text was updated successfully, but these errors were encountered:

AlexeyDmitriev · 2024-08-27T15:55:47Z

The same applies to python API

I expected L.select(pl.lit(0)) to return 2 lines, not 1 as it's returned.
But that's harder to compare with other DBMSes

mcrumiller · 2024-08-27T16:54:22Z

@ritchie46 @orlp can a mod delete/ban the malware link above? Edit: I've reported it as malware abuse, not sure how quickly they act.

@AlexeyDmitriev for the python API, 2 lines is intended I believe. See the discussion in #17107.

AlexeyDmitriev · 2024-08-28T12:09:52Z

Actually I saw that issue, but didn't quite get what is expected and what is not

mcrumiller · 2024-08-28T12:54:03Z

So--the below is in reference to with_columns, which apparently has different behavior than select.

pl.lit does not have super obvious intended behavior, but in essence it is supposed to mean "a literal value that fills the height of the columns in the DataFrame." This has the confusion that if no columns are present, a height of 1 is returned, but it is not inconsistent.

In your example, the frame has height 2, so selecting a literal value be also length 2. In the example in the linked issue, an empty frame has height 0, so selecting a literal from there will be length zero. This is consistent with the definition:

import polar as pl

# In an empty frame, literal returns length 1
pl.DataFrame().with_columns(pl.lit(1))
shape: (1, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ i32     │
# ╞═════════╡
# │ 1       │
# └─────────┘

# As long as we have any columns, literal returns that length
# In this case, that length is 0.
pl.DataFrame({"a": []}).with_columns(pl.lit(1))
shape: (0, 2)
# ┌──────┬─────────┐
# │ a    ┆ literal │
# │ ---  ┆ ---     │
# │ null ┆ i32     │
# ╞══════╪═════════╡
# └──────┴─────────┘

Now, with select, we do indeed get a literal of length 1, even if our columns are length 0:

pl.DataFrame({"a": []}).select(pl.lit(1))
shape: (1, 1)
┌─────────┐
│ literal │
│ ---     │
│ i32     │
╞═════════╡
│ 1       │
└─────────┘

I am not sure what to think about that.

orlp · 2024-08-28T14:28:43Z

In Polars (conceptually speaking) pl.lit(0) is a scalar expression, just like pl.col.x.sum() is. Scalar expressions are returned as a new row by themselves if all columns are scalar, and are broadcasted to the other columns' lengths otherwise.

So in these two examples pl.lit(0) and pl.col.x.sum() are returned as a new row:

>>> df = pl.DataFrame({"x": []}, schema={"x": pl.Int32})
>>> df.select(pl.lit(0))
shape: (1, 1)
┌─────────┐
│ literal │
│ ---     │
│ i32     │
╞═════════╡
│ 0       │
└─────────┘
>>> df.select(s=pl.col.x.sum())
shape: (1, 1)
┌─────┐
│ s   │
│ --- │
│ i32 │
╞═════╡
│ 0   │
└─────┘

In these four examples the scalar values are broadcasted along with the other column, x, which in this case means the result has zero rows when x is empty:

>>> df = pl.DataFrame({"x": [1, 2]}, schema={"x": pl.Int32})
>>> df.select(pl.col.x, pl.lit(0))
shape: (2, 2)
┌─────┬─────────┐
│ x   ┆ literal │
│ --- ┆ ---     │
│ i32 ┆ i32     │
╞═════╪═════════╡
│ 1   ┆ 0       │
│ 2   ┆ 0       │
└─────┴─────────┘
>>> df.select(pl.col.x, s=pl.col.x.sum())
shape: (2, 2)
┌─────┬─────┐
│ x   ┆ s   │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 3   │
└─────┴─────┘

>>> df = pl.DataFrame({"x": []}, schema={"x": pl.Int32})
>>> df.select(pl.col.x, pl.lit(0))
shape: (0, 2)
┌─────┬─────────┐
│ x   ┆ literal │
│ --- ┆ ---     │
│ i32 ┆ i32     │
╞═════╪═════════╡
└─────┴─────────┘
>>> df.select(pl.col.x, s=pl.col.x.sum())
shape: (0, 2)
┌─────┬─────┐
│ x   ┆ s   │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
└─────┴─────┘

This is why Polars has different behavior when you use with_columns †.

As for the original issue, I don't know exactly how to match SQL's behavior here without a lot of work on our end...

† I'm still not 100% sure what we should do in the with_columns case if we end up overwriting every original column in the with_columns expression. You could imagine that there's always a hidden null column for with_columns to broadcast along, which is what the current engine does:

>>> df = pl.DataFrame({"x": []}, schema={"x": pl.Int32})
>>> df.with_columns(x=0)
shape: (0, 1)
┌─────┐
│ x   │
│ --- │
│ i32 │
╞═════╡
└─────┘

This might however be problematic for the new streaming engine, so the jury isn't out on this one yet.

mcrumiller · 2024-08-28T14:31:26Z

Thanks for the clarification @orlp, makes perfect sense.

AlexeyDmitriev · 2024-08-28T15:33:42Z

@orlp thanks for explanation, I can see now that this behaviour make sense (in particular you allow selecting aggregates and non-aggregates with simpler syntax then in classic sql as in your df.select(pl.col.x, s=pl.col.x.sum()) example)

Personally, I'd find more intuitive if there were to separate functions to select something for each line of the df (accept both, always broadcasted) and select an expression (accepts only scalars, doesn't)

This way it's easier to understand the shape of what is returned when you read the code (even if you don't know the specific things user passed)

def f(expr):
    ...
    some_df.select(expr) # you know this is always the same shape as some_df, even if user passed literal
    ...

You can always remove unnecessary column safely, e.g. df.select(x=x, y=y).select('y') is the same as df.select(y=y)

But I guess, that wouldn't work with the semantics of functions like Expr.top_k (because it's neither of my options)
And also, that would be completely changing the API people got used to

Anyway, it would be good if this was explained in some obvious place in the docs, because when I tried to see explanation I failed.

Also, is it possible to create expression that means "0 distributed for each row even if nothing else is selected"?

As for the original issue, I don't know exactly how to match SQL's behavior here without a lot of work on our end...
For me personally I only care for python API and showed sql behaviour only because with sql there's a system to compare to

BTW the fact that pl.lit is scalar also does something strange with joins

mcrumiller · 2024-08-28T15:37:42Z

BTW the fact that pl.lit is scalar also does something strange with joins

@AlexeyDmitriev you mean this? #9603

AlexeyDmitriev · 2024-08-28T15:45:27Z

@AlexeyDmitriev you mean this? #9603

I added somewhat simpler example there.

orlp · 2024-08-28T16:02:16Z

Also, is it possible to create expression that means "0 distributed for each row even if nothing else is selected"?

Kind of. You'll have to select a column to broadcast with, then drop that column after the broadcast:

>>> df = pl.DataFrame({"x": [1, 2, 3]})
>>> df.select(pl.first(), pl.lit(0)).drop(pl.first())
shape: (3, 1)
┌─────────┐
│ literal │
│ ---     │
│ i32     │
╞═════════╡
│ 0       │
│ 0       │
│ 0       │
└─────────┘

But I don't believe you can literally do it as a stand-alone expression.

EDIT: you can actually make such an expression by broadcasting by abusing pl.when:

>>> df.select(pl.when(True).then(0).otherwise(pl.first()))
shape: (3, 1)
┌─────────┐
│ literal │
│ ---     │
│ i64     │
╞═════════╡
│ 0       │
│ 0       │
│ 0       │
└─────────┘

Still relies on broadcasting with an existing column but that column never has to show up in your result.

AlexeyDmitriev added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 27, 2024

AlexeyDmitriev changed the title ~~SELECT * from table_name return 1 line instead of expected number_of_lines(table_name)~~ SELECT 0 from table_name return 1 line instead of expected number_of_lines(table_name) Aug 27, 2024

AlexeyDmitriev mentioned this issue Sep 11, 2024

Empty dataframe with "cross-row" operation contaminates sum and product #18689

Open

2 tasks

AlexeyDmitriev mentioned this issue Dec 6, 2024

[SQL] SELECT literal from dataframe return only 1 row instead of the dataframe height #20058

Closed

2 tasks

nameexhaustion mentioned this issue Dec 10, 2024

fix: Ensure height is maintained in SQL SELECT 1 FROM #20241

Merged

ritchie46 closed this as completed in #20241 Dec 10, 2024

c-peters added the accepted Ready for implementation label Dec 14, 2024

c-peters assigned nameexhaustion Dec 14, 2024

c-peters moved this to Done in Backlog Dec 14, 2024

c-peters added this to Backlog Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SELECT 0 from table_name return 1 line instead of expected number_of_lines(table_name) #18404

SELECT 0 from table_name return 1 line instead of expected number_of_lines(table_name) #18404

AlexeyDmitriev commented Aug 27, 2024 •

edited

Loading

AlexeyDmitriev commented Aug 27, 2024

mcrumiller commented Aug 27, 2024 •

edited

Loading

AlexeyDmitriev commented Aug 28, 2024

mcrumiller commented Aug 28, 2024

orlp commented Aug 28, 2024 •

edited

Loading

mcrumiller commented Aug 28, 2024

AlexeyDmitriev commented Aug 28, 2024 •

edited

Loading

mcrumiller commented Aug 28, 2024

AlexeyDmitriev commented Aug 28, 2024

orlp commented Aug 28, 2024 •

edited

Loading

SELECT 0 from table_name return 1 line instead of expected number_of_lines(table_name) #18404

SELECT 0 from table_name return 1 line instead of expected number_of_lines(table_name) #18404

Comments

AlexeyDmitriev commented Aug 27, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

AlexeyDmitriev commented Aug 27, 2024

mcrumiller commented Aug 27, 2024 • edited Loading

AlexeyDmitriev commented Aug 28, 2024

mcrumiller commented Aug 28, 2024

orlp commented Aug 28, 2024 • edited Loading

mcrumiller commented Aug 28, 2024

AlexeyDmitriev commented Aug 28, 2024 • edited Loading

mcrumiller commented Aug 28, 2024

AlexeyDmitriev commented Aug 28, 2024

orlp commented Aug 28, 2024 • edited Loading

AlexeyDmitriev commented Aug 27, 2024 •

edited

Loading

mcrumiller commented Aug 27, 2024 •

edited

Loading

orlp commented Aug 28, 2024 •

edited

Loading

AlexeyDmitriev commented Aug 28, 2024 •

edited

Loading

orlp commented Aug 28, 2024 •

edited

Loading