Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.duration(days=pl.all()) also passes values to hours= #19007

Closed
2 tasks done
cmdlineluser opened this issue Sep 29, 2024 · 11 comments · Fixed by #19449
Closed
2 tasks done

pl.duration(days=pl.all()) also passes values to hours= #19007

cmdlineluser opened this issue Sep 29, 2024 · 11 comments · Fixed by #19449
Labels
accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Sep 29, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"a": [1], "b": [2]})

df.select(pl.duration(days=pl.all()))
# shape: (1, 1)
# ┌──────────────┐
# │ literal      │
# │ ---          │
# │ duration[μs] │
# ╞══════════════╡
# │ 1d 2h        │
# └──────────────┘

Log output

No response

Issue description

I think this may actually be another form of:

It seems to be doing

pl.duration(col(a), col(b))

Instead of

pl.duration(col(a))
pl.duration(col(b))

Expected behavior

I was expecting a literal duplicate error, but thought adding .name.keep() would be the same as:

df.select(
    pl.duration(days="a").name.keep(),
    pl.duration(days="b").name.keep()
)

# shape: (1, 2)
# ┌──────────────┬──────────────┐
# │ a            ┆ b            │
# │ ---          ┆ ---          │
# │ duration[μs] ┆ duration[μs] │
# ╞══════════════╪══════════════╡
# │ 1d           ┆ 2d           │
# └──────────────┴──────────────┘

Installed versions

--------Version info---------
Polars:               1.8.2
Index type:           UInt32
Platform:             macOS-13.6.1-arm64-arm-64bit
Python:               3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.2
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@cmdlineluser cmdlineluser added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Sep 29, 2024
@mcrumiller
Copy link
Contributor

mcrumiller commented Sep 30, 2024

Looks like all the args get consumed:

df = pl.DataFrame(
    {
        "a": [1, 2],
        "b": [3, 4],
        "c": [5, 6],
        "d": [7, 8],
        "e": [9, 10],
        "f": [11, 12],
        "g": [13, 14],
    }
)
df.select(pl.duration(days=pl.all()))
# shape: (2, 1)
# ┌─────────────────────┐
# │ literal             │
# │ ---                 │
# │ duration[μs]        │
# ╞═════════════════════╡
# │ 1d 3h 5m 7s 9011µs  │
# │ 2d 4h 6m 8s 10012µs │
# └─────────────────────┘

@cmdlineluser
Copy link
Contributor Author

@mcrumiller Yeah, it looks like it is the same thing as the issue with some of the .list.* methods.

This should output 7 columns, right?

df = pl.DataFrame(
    {
        "a": [[1, 2]],
        "b": [[3, 4]],
        "c": [[5, 6]],
        "d": [[7, 8]],
        "e": [[9, 10]],
        "f": [[11, 12]],
        "g": [[13, 14]],
    }
)

df.select(pl.all().list.concat(539)) 
# shape: (1, 1)
# ┌──────────────────────────────────────────────────────┐
# │ a                                                    │
# │ ---                                                  │
# │ list[i64]                                            │
# ╞══════════════════════════════════════════════════════╡
# │ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 539] │ 
# └──────────────────────────────────────────────────────┘

It seems to be related to the INPUT_WILDCARD_EXPANSION flag during expr_expansion.

@orlp orlp added accepted Ready for implementation P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Oct 2, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Oct 2, 2024
@cmdlineluser
Copy link
Contributor Author

Not sure if I'm on the right track here - but this also seems to be the cause of the linked struct issue:

df = pl.DataFrame({"X": {"a": 1}, "Y": {"a": 3}})

df.select(pl.all().struct.with_fields())
# shape: (1, 1)
# ┌───────────┐
# │ X         │
# │ ---       │
# │ struct[2] │
# ╞═══════════╡
# │ {1,{3}}   │
# └───────────┘

From what I can tell, it looks like if INPUT_WILDCARD_EXPANSION is enabled, it ends up with a recursive call in rewrite_projections which seems to end up "flattening" eveything into a single expression.

pub(crate) fn rewrite_projections(

rewrite_projections()
  expand_function_inputs()
    rewrite_projections()

It's a bit tricky trying to debug what's happening.

@siddharth-vi
Copy link
Contributor

siddharth-vi commented Oct 8, 2024

I was looking into this and other issues with pl.all() (#18968). What is happening is that for these functions INPUT_WILDCARD_EXPANSION is set to true, as @cmdlineluser has mentioned.
The definition of this flag is here -

/// There can be two ways of expanding wildcards:
///
/// Say the schema is 'a', 'b' and there is a function `f`. In this case, `f('*')` can expand
/// to:
/// 1. `f('a', 'b')`
/// 2. `f('a'), f('b')`
///
/// Setting this to true, will lead to behavior 1.
///
/// This also accounts for regex expansion.
const INPUT_WILDCARD_EXPANSION = 1 << 4;

For these functions I think the second expansion is correct i.e this flag should be disabled.

It's a bit tricky trying to debug what's happening.

Below is my understanding .
There are two places where wildcard expansion happens :-

expr = expand_function_inputs(expr, schema, opt_flags)?;

and
replace_and_add_to_results(expr, flags, &mut result, schema, keys, opt_flags)?;

The first expansion is conditional on the flag being true.
With flag set as true, the inputs to the function get expanded ; F(*,other) becomes F(a,b,c,other). When it reaches the second line there are no wildcards left hence nothing happens.
With the flag disabled, the first expansion does nothing; F(*,other) stays F(*,other). In the second expansion F(*,other) becomes F(a,other), F(b,other), F(c,other).

I tried removing this flag and it seems to be working correctly. All tests are also passing.

df = pl.DataFrame({"a": [1], "b": [2]})

df.select(pl.duration(days=pl.all()).name.keep() )


# shape: (1, 2)
# ┌──────────────┬──────────────┐
# │ a            ┆ b            │
# │ ---          ┆ ---          │
# │ duration[μs] ┆ duration[μs] │
# ╞══════════════╪══════════════╡
# │ 1d           ┆ 2d           │
# └──────────────┴──────────────┘



df = pl.DataFrame({"a": [[1, 2]], "b": [[3, 4]]})
df.select(pl.all().list.count_matches(3))


# shape: (1, 2)
# ┌──────────────┬──────────────┐
# │ a            ┆ b            │
# │ ---          ┆ ---          │
# │ u32 ┆ u32] │
# ╞══════════════╪══════════════╡
# │ 0           ┆ 1           │
# └──────────────┴──────────────┘

If this sounds sensible I can raise a PR for this.

@barak1412
Copy link
Contributor

I am not sure what is the proper logic.
For instance, this is a code from one of the unit tests:

df = pl.DataFrame([[{"a": 1, "b": 2}], [{"c": 3, "d": None}]])
    result = df.select(pl.concat_list(pl.all()).alias("as_list"))
    assert result.to_dict(as_series=False) == {
        "as_list": [
            [
                {"a": 1, "b": 2, "c": None, "d": None},
                {"a": None, "b": None, "c": 3, "d": None},
            ]
        ]
    }

So pl.concat_list(pl.all()) should expand , but pl.duration(pl.all()) shouldn't?
I mean, API-wise, as a user, how should I guess when it is going to expand?

I think it is only clear in the scope of a namespace - pl.all().list.set_intersection(...).

@mcrumiller
Copy link
Contributor

mcrumiller commented Oct 22, 2024

So pl.concat_list(pl.all()) should expand

Yes, the input arguments are columns that all get used in the same operation.

but pl.duration(pl.all())

we have different types of arguments here, which causes a problem. The issue is that using df.function(x=pl.all(), y=None) causes what one might call an "argument overflow" error, in that the y variable is overwritten by the pl.all expansion.

@barak1412
Copy link
Contributor

barak1412 commented Oct 22, 2024

I get the "argument overflow" thing, but I still don't get -

When you see pl.func(pl.all()), as a user, what do you expect to see as a result?

  • [pl.func(col1), pl.func(col2), ...]
    or
  • pl.func(col1, col2, ..)

@siddharth-vi
Copy link
Contributor

We can make a distinction on how pl.func(pl.all()) should behave :-

  • If it is a function which conceptually can take infinite arguments, such as max_horizontal,concat_list, the result should be pl.func(col1, col2, ..)
  • If it is a function which has fixed number of arguments, such as pl.duration, the result should be [pl.func(col1), pl.func(col2), ...]

I think this is how the functions have been designed till now, based on the docs for these functions. If required, we can make the docs more clear there so that users know how the wildcard expansion will behave.

@mcrumiller
Copy link
Contributor

mcrumiller commented Oct 22, 2024

In general, the first one, but the Polars API often allows for (and even prefers) the second when we're dealing with variadic arguments. For example, pl.col(...).function().over(a, b, c, d, ...). In that case, calling .over(pl.all()) is more a less a stand-in for .over([a, b, c, d, ...]), noting that in the lazy engine these columns may not even exist yet. The issue here is that pl.all is itself a singular expression, and the expansion is not happening in python.

Most polars functions that accept variadic columns do something like the following (but not always):

def foo(a, *extra_args):
    if isinstance(a, list):
        args = a
        if extra_args:
            raise ValueError("Must supply single list or args individually")
    else:
        args = [a, *extra_args]
    return sum(args)

my_list = [1, 2, 3]

foo(1, 2, 3)  # 6
foo(my_list)  # 6
foo(*my_list)  # 6

This is fine, because foo expects all of the inputs to be the same type and are used as if they were a single list. But if instead we had:

def foo(a, b=None, c=None):
    ...

then calling foo(pl.all()), one would not expect b and c to be overwritten, as pl.all() is a single expression. But it's overflowing.


My guess is that the decision to include variadics is to allow for generators as inputs to functions, as in:

my_func(x for x in my_list)

where instead of simply iterating a list we might be doing something more complex. But this ends up with some of these side-effects.

@barak1412
Copy link
Contributor

@mcrumiller @siddharth-vi thanks.

In my opinion, generally, for consistency, we need make a clear distinction -

  1. pl.all().<namespace>.<function>(), should be translated to [col('c1').<namespace>.<function>(), col('c2').<namespace>.<function>(), ..].
  2. pl.<function>(pl.all()), should be translated to pl.<function>(col('c1'), col('c2', ...).
  3. over(pl.all()), should be translated to over(col('c1'), col('c2'), ...).

The issue about python keyword arguments should be fixed with the above in mind, in my opinion.
For instance, pl.duration(pl.all()) may be disallowed, but pl.all().dt.duration() does.

@ritchie46 @orlp Do you have any clear statement about this subject? thanks.

@mcrumiller
Copy link
Contributor

pl.all() is just an expression that means "all columns" and its usage depends on the context, but it is not meant to interact at all with the way python parses arguments, variadic or not. Due to an implementation detail, right the internal expansion of pl.all() spills over into other arguments when it shouldn't. The distinction you spelled out is already the intended usage and doesn't require any modification of behavior, aside from the bug fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants