Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Series|Expr.rank #1342

Merged
merged 25 commits into from
Jan 7, 2025
Merged

feat: add Series|Expr.rank #1342

merged 25 commits into from
Jan 7, 2025

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Nov 9, 2024

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

This PR provides initial support for rank method. I will start it as a draft due to a bunch of shortcomings:

  • pandas:
  • pyarrow:
    • does not support polars default method (namely, "average"), therefore if rank is called without specifying another method, it will end up raising an error
    • does not implement rank in a group by context
    • I am using pyarrow.compute.rank which is available but not exposed/documented (?)
  • dask: it just does not support ranking
  • polars:
    • group by context always returns an aggregate, in this case a list of ranks - which is fairly useless as it is a list of increasing/descreasing range until the size of the group

@github-actions github-actions bot added the enhancement New feature or request label Nov 9, 2024
Comment on lines +747 to +749
# crazy workaround for the case of `na_option="keep"` and nullable
# integer dtypes. This should be supported in pandas > 3.0
# https://github.com/pandas-dev/pandas/issues/56976
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the workaround.

@MarcoGorelli I was not able to properly use the pandas like util function get_dtype_backend to figure out the nullable backend. It should not really matter as the non-nullable backend would not result in integer type if the series contains nulls anyway

@FBruzzesi
Copy link
Member Author

Hey @adamblake, this is an initial implementation to support rank. In the description I tried to explain all the shortcomings and the challenges I am facing.

@FBruzzesi FBruzzesi changed the title feat: ass Series|Expr.rank feat: add Series|Expr.rank Nov 10, 2024
@FBruzzesi FBruzzesi marked this pull request as ready for review December 26, 2024 19:26
@FBruzzesi
Copy link
Member Author

I will need to come back to the failing tests: it seems that rank for pandas version 2.0.x with pyarrow backend fails for integer dtype but it works fine for floats

@MarcoGorelli
Copy link
Member

Nice, thanks for sticking with this!

Does it risk being annoying if the default ('average') isn't available in several backends?

Could we not set any default, so the user is 'forced' to specify a method (like we do in quantile)?

@FBruzzesi
Copy link
Member Author

Nice, thanks for sticking with this!

Of course! Adding the possibility to using it in over context is a great addition for the use case reported in the issue in the first place!

Does it risk being annoying if the default ('average') isn't available in several backends?

For now this is just pyarrow, right?

Could we not set any default, so the user is 'forced' to specify a method (like we do in quantile)?

I don't have a strong opinion on this, I am fine to give more responsibility to the user

@MarcoGorelli
Copy link
Member

For now this is just pyarrow, right?

you're right sorry, i thought it would be the same for pyarrow-backed pandas. thinking ahead, I think duckdb doesn't support it either

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @FBruzzesi !

@FBruzzesi FBruzzesi merged commit 5c0a33a into main Jan 7, 2025
24 checks passed
@FBruzzesi FBruzzesi deleted the feat/expr-rank branch January 7, 2025 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: Support polars.Expr.rank
2 participants