`pl.duration(days=pl.all())` also passes values to `hours=` #19007

cmdlineluser · 2024-09-29T14:37:41Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"a": [1], "b": [2]})

df.select(pl.duration(days=pl.all()))
# shape: (1, 1)
# ┌──────────────┐
# │ literal      │
# │ ---          │
# │ duration[μs] │
# ╞══════════════╡
# │ 1d 2h        │
# └──────────────┘

Log output

No response

Issue description

I think this may actually be another form of:

pl.all().list.count_matches results in a SchemaError #18968

It seems to be doing

pl.duration(col(a), col(b))

Instead of

pl.duration(col(a))
pl.duration(col(b))

Expected behavior

I was expecting a literal duplicate error, but thought adding .name.keep() would be the same as:

df.select(
    pl.duration(days="a").name.keep(),
    pl.duration(days="b").name.keep()
)

# shape: (1, 2)
# ┌──────────────┬──────────────┐
# │ a            ┆ b            │
# │ ---          ┆ ---          │
# │ duration[μs] ┆ duration[μs] │
# ╞══════════════╪══════════════╡
# │ 1d           ┆ 2d           │
# └──────────────┴──────────────┘

Installed versions

--------Version info---------
Polars:               1.8.2
Index type:           UInt32
Platform:             macOS-13.6.1-arm64-arm-64bit
Python:               3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

mcrumiller · 2024-09-30T02:29:33Z

Looks like all the args get consumed:

df = pl.DataFrame(
    {
        "a": [1, 2],
        "b": [3, 4],
        "c": [5, 6],
        "d": [7, 8],
        "e": [9, 10],
        "f": [11, 12],
        "g": [13, 14],
    }
)
df.select(pl.duration(days=pl.all()))
# shape: (2, 1)
# ┌─────────────────────┐
# │ literal             │
# │ ---                 │
# │ duration[μs]        │
# ╞═════════════════════╡
# │ 1d 3h 5m 7s 9011µs  │
# │ 2d 4h 6m 8s 10012µs │
# └─────────────────────┘

cmdlineluser · 2024-09-30T08:31:21Z

@mcrumiller Yeah, it looks like it is the same thing as the issue with some of the .list.* methods.

This should output 7 columns, right?

df = pl.DataFrame(
    {
        "a": [[1, 2]],
        "b": [[3, 4]],
        "c": [[5, 6]],
        "d": [[7, 8]],
        "e": [[9, 10]],
        "f": [[11, 12]],
        "g": [[13, 14]],
    }
)

df.select(pl.all().list.concat(539)) 
# shape: (1, 1)
# ┌──────────────────────────────────────────────────────┐
# │ a                                                    │
# │ ---                                                  │
# │ list[i64]                                            │
# ╞══════════════════════════════════════════════════════╡
# │ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 539] │ 
# └──────────────────────────────────────────────────────┘

It seems to be related to the INPUT_WILDCARD_EXPANSION flag during expr_expansion.

cmdlineluser · 2024-10-05T14:22:06Z

Not sure if I'm on the right track here - but this also seems to be the cause of the linked struct issue:

df = pl.DataFrame({"X": {"a": 1}, "Y": {"a": 3}})

df.select(pl.all().struct.with_fields())
# shape: (1, 1)
# ┌───────────┐
# │ X         │
# │ ---       │
# │ struct[2] │
# ╞═══════════╡
# │ {1,{3}}   │
# └───────────┘

From what I can tell, it looks like if INPUT_WILDCARD_EXPANSION is enabled, it ends up with a recursive call in rewrite_projections which seems to end up "flattening" eveything into a single expression.

polars/crates/polars-plan/src/plans/conversion/expr_expansion.rs

Line 660 in cafc163

pub(crate) fn rewrite_projections(

rewrite_projections()
  expand_function_inputs()
    rewrite_projections()

It's a bit tricky trying to debug what's happening.

siddharth-vi · 2024-10-08T19:02:52Z

I was looking into this and other issues with pl.all() (#18968). What is happening is that for these functions INPUT_WILDCARD_EXPANSION is set to true, as @cmdlineluser has mentioned.
The definition of this flag is here -

polars/crates/polars-plan/src/plans/options.rs

Lines 145 to 155 in ee9bafb

    
                       /// There can be two ways of expanding wildcards: 
        
                       /// 
        
                       /// Say the schema is 'a', 'b' and there is a function `f`. In this case, `f('*')` can expand 
        
                       /// to: 
        
                       /// 1. `f('a', 'b')` 
        
                       /// 2. `f('a'), f('b')` 
        
                       /// 
        
                       /// Setting this to true, will lead to behavior 1. 
        
                       /// 
        
                       /// This also accounts for regex expansion. 
        
                       const INPUT_WILDCARD_EXPANSION = 1 << 4;

For these functions I think the second expansion is correct i.e this flag should be disabled.

It's a bit tricky trying to debug what's happening.

Below is my understanding .
There are two places where wildcard expansion happens :-

polars/crates/polars-plan/src/plans/conversion/expr_expansion.rs

Line 673 in ee9bafb

expr = expand_function_inputs(expr, schema, opt_flags)?;

and

polars/crates/polars-plan/src/plans/conversion/expr_expansion.rs

Line 682 in ee9bafb

replace_and_add_to_results(expr, flags, &mut result, schema, keys, opt_flags)?;

The first expansion is conditional on the flag being true.
With flag set as true, the inputs to the function get expanded ; F(*,other) becomes F(a,b,c,other). When it reaches the second line there are no wildcards left hence nothing happens.
With the flag disabled, the first expansion does nothing; F(*,other) stays F(*,other). In the second expansion F(*,other) becomes F(a,other), F(b,other), F(c,other).

I tried removing this flag and it seems to be working correctly. All tests are also passing.

df = pl.DataFrame({"a": [1], "b": [2]})

df.select(pl.duration(days=pl.all()).name.keep() )


# shape: (1, 2)
# ┌──────────────┬──────────────┐
# │ a            ┆ b            │
# │ ---          ┆ ---          │
# │ duration[μs] ┆ duration[μs] │
# ╞══════════════╪══════════════╡
# │ 1d           ┆ 2d           │
# └──────────────┴──────────────┘



df = pl.DataFrame({"a": [[1, 2]], "b": [[3, 4]]})
df.select(pl.all().list.count_matches(3))


# shape: (1, 2)
# ┌──────────────┬──────────────┐
# │ a            ┆ b            │
# │ ---          ┆ ---          │
# │ u32 ┆ u32] │
# ╞══════════════╪══════════════╡
# │ 0           ┆ 1           │
# └──────────────┴──────────────┘

If this sounds sensible I can raise a PR for this.

barak1412 · 2024-10-22T15:40:04Z

I am not sure what is the proper logic.
For instance, this is a code from one of the unit tests:

df = pl.DataFrame([[{"a": 1, "b": 2}], [{"c": 3, "d": None}]])
    result = df.select(pl.concat_list(pl.all()).alias("as_list"))
    assert result.to_dict(as_series=False) == {
        "as_list": [
            [
                {"a": 1, "b": 2, "c": None, "d": None},
                {"a": None, "b": None, "c": 3, "d": None},
            ]
        ]
    }

So pl.concat_list(pl.all()) should expand , but pl.duration(pl.all()) shouldn't?
I mean, API-wise, as a user, how should I guess when it is going to expand?

I think it is only clear in the scope of a namespace - pl.all().list.set_intersection(...).

mcrumiller · 2024-10-22T15:50:01Z

So pl.concat_list(pl.all()) should expand

Yes, the input arguments are columns that all get used in the same operation.

but pl.duration(pl.all())

we have different types of arguments here, which causes a problem. The issue is that using df.function(x=pl.all(), y=None) causes what one might call an "argument overflow" error, in that the y variable is overwritten by the pl.all expansion.

barak1412 · 2024-10-22T16:04:11Z

I get the "argument overflow" thing, but I still don't get -

When you see pl.func(pl.all()), as a user, what do you expect to see as a result?

[pl.func(col1), pl.func(col2), ...]
or
pl.func(col1, col2, ..)

siddharth-vi · 2024-10-22T16:24:02Z

We can make a distinction on how pl.func(pl.all()) should behave :-

If it is a function which conceptually can take infinite arguments, such as max_horizontal,concat_list, the result should be pl.func(col1, col2, ..)
If it is a function which has fixed number of arguments, such as pl.duration, the result should be [pl.func(col1), pl.func(col2), ...]

I think this is how the functions have been designed till now, based on the docs for these functions. If required, we can make the docs more clear there so that users know how the wildcard expansion will behave.

mcrumiller · 2024-10-22T16:34:11Z

In general, the first one, but the Polars API often allows for (and even prefers) the second when we're dealing with variadic arguments. For example, pl.col(...).function().over(a, b, c, d, ...). In that case, calling .over(pl.all()) is more a less a stand-in for .over([a, b, c, d, ...]), noting that in the lazy engine these columns may not even exist yet. The issue here is that pl.all is itself a singular expression, and the expansion is not happening in python.

Most polars functions that accept variadic columns do something like the following (but not always):

def foo(a, *extra_args):
    if isinstance(a, list):
        args = a
        if extra_args:
            raise ValueError("Must supply single list or args individually")
    else:
        args = [a, *extra_args]
    return sum(args)

my_list = [1, 2, 3]

foo(1, 2, 3)  # 6
foo(my_list)  # 6
foo(*my_list)  # 6

This is fine, because foo expects all of the inputs to be the same type and are used as if they were a single list. But if instead we had:

def foo(a, b=None, c=None):
    ...

then calling foo(pl.all()), one would not expect b and c to be overwritten, as pl.all() is a single expression. But it's overflowing.

My guess is that the decision to include variadics is to allow for generators as inputs to functions, as in:

my_func(x for x in my_list)

where instead of simply iterating a list we might be doing something more complex. But this ends up with some of these side-effects.

barak1412 · 2024-10-24T16:48:34Z

@mcrumiller @siddharth-vi thanks.

In my opinion, generally, for consistency, we need make a clear distinction -

pl.all().<namespace>.<function>(), should be translated to [col('c1').<namespace>.<function>(), col('c2').<namespace>.<function>(), ..].
pl.<function>(pl.all()), should be translated to pl.<function>(col('c1'), col('c2', ...).
over(pl.all()), should be translated to over(col('c1'), col('c2'), ...).

The issue about python keyword arguments should be fixed with the above in mind, in my opinion.
For instance, pl.duration(pl.all()) may be disallowed, but pl.all().dt.duration() does.

@ritchie46 @orlp Do you have any clear statement about this subject? thanks.

mcrumiller · 2024-10-24T17:18:17Z

pl.all() is just an expression that means "all columns" and its usage depends on the context, but it is not meant to interact at all with the way python parses arguments, variadic or not. Due to an implementation detail, right the internal expansion of pl.all() spills over into other arguments when it shouldn't. The distinction you spelled out is already the intended usage and doesn't require any modification of behavior, aside from the bug fix.

cmdlineluser added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 29, 2024

orlp added accepted Ready for implementation P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Oct 2, 2024

github-project-automation bot added this to Backlog Oct 2, 2024

github-project-automation bot moved this to Ready in Backlog Oct 2, 2024

cmdlineluser mentioned this issue Oct 4, 2024

Unexpected behaviour when applying struct.with_fields to multiple columns #18461

Open

2 tasks

siddharth-vi mentioned this issue Oct 25, 2024

fix: Correct wildcard expansion for functions #19449

Merged

ritchie46 closed this as completed in #19449 Oct 26, 2024

github-project-automation bot moved this from Ready to Done in Backlog Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pl.duration(days=pl.all())` also passes values to `hours=` #19007

`pl.duration(days=pl.all())` also passes values to `hours=` #19007

cmdlineluser commented Sep 29, 2024 •

edited

Loading

mcrumiller commented Sep 30, 2024 •

edited

Loading

cmdlineluser commented Sep 30, 2024

cmdlineluser commented Oct 5, 2024

siddharth-vi commented Oct 8, 2024 •

edited

Loading

barak1412 commented Oct 22, 2024

mcrumiller commented Oct 22, 2024 •

edited

Loading

barak1412 commented Oct 22, 2024 •

edited

Loading

siddharth-vi commented Oct 22, 2024

mcrumiller commented Oct 22, 2024 •

edited

Loading

barak1412 commented Oct 24, 2024

mcrumiller commented Oct 24, 2024

pl.duration(days=pl.all()) also passes values to hours= #19007

pl.duration(days=pl.all()) also passes values to hours= #19007

Comments

cmdlineluser commented Sep 29, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

mcrumiller commented Sep 30, 2024 • edited Loading

cmdlineluser commented Sep 30, 2024

cmdlineluser commented Oct 5, 2024

siddharth-vi commented Oct 8, 2024 • edited Loading

barak1412 commented Oct 22, 2024

mcrumiller commented Oct 22, 2024 • edited Loading

barak1412 commented Oct 22, 2024 • edited Loading

siddharth-vi commented Oct 22, 2024

mcrumiller commented Oct 22, 2024 • edited Loading

barak1412 commented Oct 24, 2024

mcrumiller commented Oct 24, 2024

`pl.duration(days=pl.all())` also passes values to `hours=` #19007

`pl.duration(days=pl.all())` also passes values to `hours=` #19007

cmdlineluser commented Sep 29, 2024 •

edited

Loading

mcrumiller commented Sep 30, 2024 •

edited

Loading

siddharth-vi commented Oct 8, 2024 •

edited

Loading

mcrumiller commented Oct 22, 2024 •

edited

Loading

barak1412 commented Oct 22, 2024 •

edited

Loading

mcrumiller commented Oct 22, 2024 •

edited

Loading