-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: move scalar_funcs into spark-expr #712
chore: move scalar_funcs into spark-expr #712
Conversation
@@ -0,0 +1,186 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is an extract of what used to be scalar_funcs.rs
. The create_comet_physical_expr
isn't easily reusable for others so it seems reasonable to keep it here.
It could be nice to provide ready-made ScalarUDFs for these in spark-expr and a function to register all of them into the session context, like DF's default functions do. However the way these take in the output data_type makes that a tad challenging, so I didn't do it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I wonder if having comet
in the file name is a bit redundant though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It kinda is, but I had it there just to distinguish from spark-expr/src/scalar_funcs.rs
and as what this file does is related to CometScalarUDFs 🤷
@@ -1413,6 +1413,14 @@ impl RecordBatchStream for EmptyStream { | |||
} | |||
} | |||
|
|||
fn pmod(hash: u32, n: usize) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was not used anywhere else so I moved it here
} | ||
|
||
pub fn chr(args: &[ArrayRef]) -> Result<ArrayRef> { | ||
fn chr(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes are not necessary so I can revert them if that's preferrable. However given we already have the ScalarUDFImpl here, seems like a waste to not use it (and also means the CometScalarUDF wraps a function that wraps a ScalarUDFImpl, maybe it has some nanoseconds of perf impact)
FYI @andygrove - thanks for doing the pre-work to split out the spark-exprs package! |
34fdec0
to
5feb777
Compare
|
||
#[inline] | ||
pub(crate) fn spark_compatible_murmur3_hash<T: AsRef<[u8]>>(data: T, seed: u32) -> u32 { | ||
pub fn spark_compatible_murmur3_hash<T: AsRef<[u8]>>(data: T, seed: u32) -> u32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other spark functions, we should probably rename this to spark_murmur3_hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done! 6780bb6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I reverted the rename since this is different form spark_murmur3_hash
which we also have (that operates on the ColumnarValues, while this handles a single value).
Preferably this wouldn't be pub
but it's used in the core
crate for non-expression stuff (like shuffles) so I think it has to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left a couple comments on naming, and it would be nice if all the public functions had rustdocs, but I would be fine with merging this without those changes.
spark_compatible_xxhash64 -> spark_xxhash64
we have separate spark_xxhash64 and spark_murmur3_hash functions which align with the name, these should collide
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #712 +/- ##
============================================
+ Coverage 33.57% 33.60% +0.03%
Complexity 830 830
============================================
Files 110 110
Lines 42608 42564 -44
Branches 9352 9361 +9
============================================
- Hits 14306 14304 -2
+ Misses 25347 25300 -47
- Partials 2955 2960 +5 ☔ View full report in Codecov by Sentry. |
(cherry picked from commit b04baa5)
Which issue does this PR close?
Part of #659
Rationale for this change
Moves scalar_funcs defined for Comet into spark-expr crate to facilitate use in other projects.
What changes are included in this PR?
How are these changes tested?
Existing CI