-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Support singular form of years, months, days, and hours functions #12117
base: main
Are you sure you want to change the base?
Conversation
.put("years", new YearsFunction()) | ||
.put("year", new YearsFunction()) | ||
.put("months", new MonthsFunction()) | ||
.put("month", new MonthsFunction()) | ||
.put("days", new DaysFunction()) | ||
.put("day", new DaysFunction()) | ||
.put("hours", new HoursFunction()) | ||
.put("hour", new HoursFunction()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These function classes contain examples with plural styles in javadoc. Can we add another example or just update them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Let me update the javadoc of those functions to add examples with the singular form.
The functions also have name()
methods that return the plural form. I'm not sure if we should change the name though. For now, my thinking is to leave the name alone, but just to support using the singular form (and documenting the use in javadoc). If there is a doc page that should be updated, I'll be happy to update it if you let me know.
@@ -39,6 +39,9 @@ public void testDates() { | |||
assertThat(scalarSql("SELECT system.days(date('2017-12-01'))")) | |||
.as("Expected to produce 2017-12-01") | |||
.isEqualTo(Date.valueOf("2017-12-01")); | |||
assertThat(scalarSql("SELECT system.day(date('2017-12-01'))")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also update negative test cases? e.g. testWrongNumberOfArguments
The error message always uses plural names because function name is hard-coded in name()
and canonicalName()
methods in each function class. We may want to add a constructor taking the function name to report the correct function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't repeat all the test cases, as it does not seem necessary. I just added a sample of each (posititive) case with the singular form to demonstrate that the form can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It hides the fact that these functions can throw different function names which users specified. By the way, I'm not requesting repeating all the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see what you're getting at.
I decided to add constructors to the functions so that we know what form (singular or plural) is being used; plural is the default to preserve existing behavior.
.put("years", new YearsFunction()) | ||
.put("year", new YearsFunction()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could extract a constant instead of initializing two instances.
(This comment might be conflicted with my another comment in TestSparkDaysFunction)
Also, use constants for the functions in SparkFunctions.
@@ -31,6 +31,8 @@ | |||
* A Spark function implementation for the Iceberg day transform. | |||
* | |||
* <p>Example usage: {@code SELECT system.days('source_col')}. | |||
* | |||
* <p>Alternate form: {@code SELECT system.day('source_col')}. | |||
*/ | |||
public class DaysFunction extends UnaryUnboundFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebyhr if I understand you correctly, you also suggest introducing a constructor that takes a String name
and then have name()
return this name, right?
So new DaysFunction("day")
would return "day" when its name()
is called, while new DaysFunction("days")
would return "days".
I think that introduces more complexity than it's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want the function to be constructed with an arbitrary name. We really just want to name to be either "day" or "days" in this case. So we can have a constructor with a boolean.
…forms. Singular is the default.
Add tests for using the functions in queries.
@ebyhr thank you for your reviews. |
YearsFunction.class, new YearsFunction(), | ||
MonthsFunction.class, new MonthsFunction(), | ||
DaysFunction.class, new DaysFunction(), | ||
HoursFunction.class, new HoursFunction(), | ||
BucketFunction.class, new BucketFunction(), | ||
TruncateFunction.class, new TruncateFunction()); | ||
YearsFunction.class, YEARS_FUNCTION, | ||
MonthsFunction.class, MONTHS_FUNCTION, | ||
DaysFunction.class, DAYS_FUNCTION, | ||
HoursFunction.class, HOURS_FUNCTION, | ||
BucketFunction.class, BUCKET_FUNCTION, | ||
TruncateFunction.class, TRUNCATE_FUNCTION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to make a choice here. CLASS_TO_FUNCTIONS
is used by loadFunctionsByClass(Class)
and that is called by the ReplaceStaticInvoke
rule. The StaticInvoke
contains a class and that is all we have to go by; we do not know if the class belongs to an instance with singular
equal to true or false. So I map the classes to function instances of the plural form (to preserve existing behavior).
I didn't feel it worthwhile to have two classes instead of one to distinguish between singular and plural forms.
The tradeoff is that if we inspect the internals of a Spark LogicalPlan
and we look at an ApplyFunctionExpression
corresponding to e.g., applying YearsFunction.TimestampToYearsFunction
, and asks it for its name, we will get "years" (which comes from YearsFunction.TimestampToYearsFunction::name()
) regardless of whether we called year
or years
in our SQL query. (This happens in the verification part of some tests.)
Since Iceberg 1.4 (#8192), Spark has supported using the singular form (
year
,month
,day
,hour
) of partition transforms to be consistent with the spec, while still supporting the plural form of those transforms for backward compatibility.However,
SparkFunctions
still supported only the plural form (years
,months
,days
,hours
) for functions that can be used in queries.For consistency, we add the singular form of those functions so the singular form can be used in queries as well.