Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Support singular form of years, months, days, and hours functions #12117

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

wypoon
Copy link
Contributor

@wypoon wypoon commented Jan 27, 2025

Since Iceberg 1.4 (#8192), Spark has supported using the singular form (year, month, day, hour) of partition transforms to be consistent with the spec, while still supporting the plural form of those transforms for backward compatibility.
However, SparkFunctions still supported only the plural form (years, months, days, hours) for functions that can be used in queries.
For consistency, we add the singular form of those functions so the singular form can be used in queries as well.

@github-actions github-actions bot added the spark label Jan 27, 2025
Comment on lines 35 to 42
.put("years", new YearsFunction())
.put("year", new YearsFunction())
.put("months", new MonthsFunction())
.put("month", new MonthsFunction())
.put("days", new DaysFunction())
.put("day", new DaysFunction())
.put("hours", new HoursFunction())
.put("hour", new HoursFunction())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These function classes contain examples with plural styles in javadoc. Can we add another example or just update them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Let me update the javadoc of those functions to add examples with the singular form.
The functions also have name() methods that return the plural form. I'm not sure if we should change the name though. For now, my thinking is to leave the name alone, but just to support using the singular form (and documenting the use in javadoc). If there is a doc page that should be updated, I'll be happy to update it if you let me know.

@@ -39,6 +39,9 @@ public void testDates() {
assertThat(scalarSql("SELECT system.days(date('2017-12-01'))"))
.as("Expected to produce 2017-12-01")
.isEqualTo(Date.valueOf("2017-12-01"));
assertThat(scalarSql("SELECT system.day(date('2017-12-01'))"))
Copy link
Contributor

@ebyhr ebyhr Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also update negative test cases? e.g. testWrongNumberOfArguments
The error message always uses plural names because function name is hard-coded in name() and canonicalName() methods in each function class. We may want to add a constructor taking the function name to report the correct function name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't repeat all the test cases, as it does not seem necessary. I just added a sample of each (posititive) case with the singular form to demonstrate that the form can be used.

Copy link
Contributor

@ebyhr ebyhr Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hides the fact that these functions can throw different function names which users specified. By the way, I'm not requesting repeating all the test cases.

Copy link
Contributor Author

@wypoon wypoon Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you're getting at.
I decided to add constructors to the functions so that we know what form (singular or plural) is being used; plural is the default to preserve existing behavior.

Comment on lines 35 to 36
.put("years", new YearsFunction())
.put("year", new YearsFunction())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could extract a constant instead of initializing two instances.
(This comment might be conflicted with my another comment in TestSparkDaysFunction)

Also, use constants for the functions in SparkFunctions.
@@ -31,6 +31,8 @@
* A Spark function implementation for the Iceberg day transform.
*
* <p>Example usage: {@code SELECT system.days('source_col')}.
*
* <p>Alternate form: {@code SELECT system.day('source_col')}.
*/
public class DaysFunction extends UnaryUnboundFunction {
Copy link
Contributor Author

@wypoon wypoon Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr if I understand you correctly, you also suggest introducing a constructor that takes a String name and then have name() return this name, right?
So new DaysFunction("day") would return "day" when its name() is called, while new DaysFunction("days") would return "days".
I think that introduces more complexity than it's worth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want the function to be constructed with an arbitrary name. We really just want to name to be either "day" or "days" in this case. So we can have a constructor with a boolean.

@wypoon
Copy link
Contributor Author

wypoon commented Jan 28, 2025

@Fokko @nastra would you mind reviewing this?

Add tests for using the functions in queries.
@wypoon
Copy link
Contributor Author

wypoon commented Jan 31, 2025

@ebyhr thank you for your reviews.
@szehon-ho @amogh-jahagirdar would you mind taking a look? It would be nice to get this into 1.8.0.

Comment on lines -44 to +66
YearsFunction.class, new YearsFunction(),
MonthsFunction.class, new MonthsFunction(),
DaysFunction.class, new DaysFunction(),
HoursFunction.class, new HoursFunction(),
BucketFunction.class, new BucketFunction(),
TruncateFunction.class, new TruncateFunction());
YearsFunction.class, YEARS_FUNCTION,
MonthsFunction.class, MONTHS_FUNCTION,
DaysFunction.class, DAYS_FUNCTION,
HoursFunction.class, HOURS_FUNCTION,
BucketFunction.class, BUCKET_FUNCTION,
TruncateFunction.class, TRUNCATE_FUNCTION);
Copy link
Contributor Author

@wypoon wypoon Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make a choice here. CLASS_TO_FUNCTIONS is used by loadFunctionsByClass(Class) and that is called by the ReplaceStaticInvoke rule. The StaticInvoke contains a class and that is all we have to go by; we do not know if the class belongs to an instance with singular equal to true or false. So I map the classes to function instances of the plural form (to preserve existing behavior).
I didn't feel it worthwhile to have two classes instead of one to distinguish between singular and plural forms.
The tradeoff is that if we inspect the internals of a Spark LogicalPlan and we look at an ApplyFunctionExpression corresponding to e.g., applying YearsFunction.TimestampToYearsFunction, and asks it for its name, we will get "years" (which comes from YearsFunction.TimestampToYearsFunction::name()) regardless of whether we called year or years in our SQL query. (This happens in the verification part of some tests.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants