Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pl.concat_arr: similar to pl.concat_list but returns an array data type #13846

Closed
mkleinbort-ic opened this issue Jan 19, 2024 · 3 comments · Fixed by #20999
Closed

Add pl.concat_arr: similar to pl.concat_list but returns an array data type #13846

mkleinbort-ic opened this issue Jan 19, 2024 · 3 comments · Fixed by #20999
Labels
A-dtype Area: data types in general A-dtype-list/array Area: list/array data type enhancement New feature or an improvement of an existing feature

Comments

@mkleinbort-ic
Copy link

Description

Since the creation of the pl.Array data type it is a good way to approach "list" columns where all the lists have equal lenghts.

The most common way I see this happening is when one does a pl.concat_list('col_1', 'col_2', ..., 'col_n') to get an column where all lists have n elements.

One can do pl.concat_list('col_1', 'col_2', ..., 'col_n').list.to_array(width=n) but that requires knowledge of n (which might be dynamic when using selectors instead of explicit column names)

Moreover, the list format is (probably) not as optimized as the array format for this usecase.

So... how about a pl.concat_arr (or pl.concat_array) to leverage these benefits?

@mkleinbort-ic mkleinbort-ic added the enhancement New feature or an improvement of an existing feature label Jan 19, 2024
@mkleinbort-ic
Copy link
Author

It came to mind because some code I'm working on had the issue that

    .select(y_pred_samples = pl.struct(pl.all())) # Combine into a struct column

was 1min faster than

    .select(y_pred_samples = pl.concat_list(pl.all()).list.to_array(width=100)) # Combine into an array column

but at the end of the day I'm just going to be working with min/max/mean of the array values

@deanm0000
Copy link
Collaborator

I think having a means to create an array is good but I don't think the timing is relevant. Creating a struct is as close to free as you can get since it doesn't have to check anything but making a list or array requires it to ensure the data types are compatible for each row.

@mkleinbort-ic
Copy link
Author

Tbh I'm still waiting to see the benefits of the pl.Array data type. I love structs, but it's not easy to do operations on the values (.min(), .max(), .mean(), etc...)

@deanm0000 deanm0000 added A-dtype-list/array Area: list/array data type A-dtype Area: data types in general labels Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype Area: data types in general A-dtype-list/array Area: list/array data type enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants