Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: move scalar_funcs into spark-expr #712

Merged
merged 12 commits into from
Jul 28, 2024

Conversation

Blizzara
Copy link
Contributor

Which issue does this PR close?

Part of #659

Rationale for this change

Moves scalar_funcs defined for Comet into spark-expr crate to facilitate use in other projects.

What changes are included in this PR?

How are these changes tested?

Existing CI

@Blizzara Blizzara changed the title chore: move scalar_funcs and some hashing stuff into spark-expr chore: move scalar_funcs into spark-expr Jul 24, 2024
@@ -0,0 +1,186 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor Author

@Blizzara Blizzara Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is an extract of what used to be scalar_funcs.rs. The create_comet_physical_expr isn't easily reusable for others so it seems reasonable to keep it here.

It could be nice to provide ready-made ScalarUDFs for these in spark-expr and a function to register all of them into the session context, like DF's default functions do. However the way these take in the output data_type makes that a tad challenging, so I didn't do it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I wonder if having comet in the file name is a bit redundant though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinda is, but I had it there just to distinguish from spark-expr/src/scalar_funcs.rs and as what this file does is related to CometScalarUDFs 🤷

@@ -1413,6 +1413,14 @@ impl RecordBatchStream for EmptyStream {
}
}

fn pmod(hash: u32, n: usize) -> usize {
Copy link
Contributor Author

@Blizzara Blizzara Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was not used anywhere else so I moved it here

@Blizzara Blizzara marked this pull request as ready for review July 24, 2024 12:09
}

pub fn chr(args: &[ArrayRef]) -> Result<ArrayRef> {
fn chr(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are not necessary so I can revert them if that's preferrable. However given we already have the ScalarUDFImpl here, seems like a waste to not use it (and also means the CometScalarUDF wraps a function that wraps a ScalarUDFImpl, maybe it has some nanoseconds of perf impact)

@Blizzara
Copy link
Contributor Author

FYI @andygrove - thanks for doing the pre-work to split out the spark-exprs package!

@Blizzara Blizzara force-pushed the avo/move-scalar-funcs-to-spark-expr branch from 34fdec0 to 5feb777 Compare July 24, 2024 12:33

#[inline]
pub(crate) fn spark_compatible_murmur3_hash<T: AsRef<[u8]>>(data: T, seed: u32) -> u32 {
pub fn spark_compatible_murmur3_hash<T: AsRef<[u8]>>(data: T, seed: u32) -> u32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with other spark functions, we should probably rename this to spark_murmur3_hash

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done! 6780bb6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I reverted the rename since this is different form spark_murmur3_hash which we also have (that operates on the ColumnarValues, while this handles a single value).

Preferably this wouldn't be pub but it's used in the core crate for non-expression stuff (like shuffles) so I think it has to be.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left a couple comments on naming, and it would be nice if all the public functions had rustdocs, but I would be fine with merging this without those changes.

@Blizzara
Copy link
Contributor Author

LGTM. I left a couple comments on naming, and it would be nice if all the public functions had rustdocs, but I would be fine with merging this without those changes.

Thanks! I did the rename in 6780bb6, and added docs in ac29169 and 8bb99ea

Blizzara added 3 commits July 25, 2024 09:33
we have separate spark_xxhash64 and spark_murmur3_hash functions which align with the name, these should collide
@codecov-commenter
Copy link

codecov-commenter commented Jul 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 33.60%. Comparing base (ded3dd6) to head (abf05cd).
Report is 5 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #712      +/-   ##
============================================
+ Coverage     33.57%   33.60%   +0.03%     
  Complexity      830      830              
============================================
  Files           110      110              
  Lines         42608    42564      -44     
  Branches       9352     9361       +9     
============================================
- Hits          14306    14304       -2     
+ Misses        25347    25300      -47     
- Partials       2955     2960       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andygrove andygrove merged commit b04baa5 into apache:main Jul 28, 2024
74 checks passed
@Blizzara Blizzara deleted the avo/move-scalar-funcs-to-spark-expr branch July 28, 2024 16:06
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants