Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support polymorphic scalar comparison functions in the multi-stage query engine #13711

Merged

Conversation

yashmayya
Copy link
Collaborator

@yashmayya yashmayya commented Jul 30, 2024

  • Fixes pinot 1.1.0 - sql parsing error when filtering constant value in multi-stage engine #13699
  • The function registry was recently refactored in Refactor function registry for multi-stage engine #13573 which also added support for polymorphic scalar functions.
  • This patch adds polymorphic scalar function implementations for comparison functions - =, !=, >, >=, <, <= which are probably the most useful ones.
  • Currently, the function registry's lookup by argument type is only used in the v2 engine although eventually all usages of the lookup by argument number will be moved to the lookup by argument type (see Use argument type to lookup function for literal only query #13673 for instance). When that's done, these polymorphic scalar functions will automatically be supported by the v1 engine as well. Until that's done though, these new polymorphic scalar functions will default to the double based scalar comparison functions when looking up by argument number for backward compatibility.
  • Note that the polymorphic scalar comparison functions only support comparing two arguments of the same type. In case arguments of two different types need to be compared, either an implicit cast will be added by Calcite or else an explicit cast should be added by the user. This behavior is similar to Postgres.
  • All the scalar comparison functions are marked as null intolerant because they should return null if any argument is null. Again, this matches Postgres behavior.

@yashmayya yashmayya added enhancement bugfix multi-stage Related to the multi-stage query engine labels Jul 30, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jul 30, 2024

Codecov Report

Attention: Patch coverage is 64.37768% with 83 lines in your changes missing coverage. Please review.

Project coverage is 61.97%. Comparing base (59551e4) to head (fe04917).
Report is 895 commits behind head on master.

Files Patch % Lines
...ion/scalar/comparison/NotEqualsScalarFunction.java 64.28% 18 Missing and 2 partials ⚠️
...nction/scalar/comparison/EqualsScalarFunction.java 66.07% 18 Missing and 1 partial ⚠️
...r/comparison/GreaterThanOrEqualScalarFunction.java 65.38% 9 Missing ⚠️
...n/scalar/comparison/GreaterThanScalarFunction.java 65.38% 9 Missing ⚠️
...alar/comparison/LessThanOrEqualScalarFunction.java 65.38% 9 Missing ⚠️
...tion/scalar/comparison/LessThanScalarFunction.java 65.38% 9 Missing ⚠️
...omparison/PolymorphicComparisonScalarFunction.java 11.11% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13711      +/-   ##
============================================
+ Coverage     61.75%   61.97%   +0.22%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2567     +131     
  Lines        133233   141691    +8458     
  Branches      20636    22015    +1379     
============================================
+ Hits          82274    87814    +5540     
- Misses        44911    47197    +2286     
- Partials       6048     6680     +632     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.92% <64.37%> (+0.21%) ⬆️
java-21 61.86% <64.37%> (+0.23%) ⬆️
skip-bytebuffers-false 61.96% <64.37%> (+0.21%) ⬆️
skip-bytebuffers-true 61.83% <64.37%> (+34.10%) ⬆️
temurin 61.97% <64.37%> (+0.22%) ⬆️
unittests 61.97% <64.37%> (+0.22%) ⬆️
unittests1 46.33% <64.37%> (-0.56%) ⬇️
unittests2 27.83% <0.00%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yashmayya yashmayya marked this pull request as ready for review July 30, 2024 09:56
@yashmayya yashmayya requested review from gortiz and Jackie-Jiang July 30, 2024 09:56
@gortiz
Copy link
Contributor

gortiz commented Jul 30, 2024

Note that the polymorphic scalar comparison functions only support comparing two arguments of the same type. In case arguments of two different types need to be compared, an explicit cast should be added by the user. This behavior is similar to Postgres.

AFAIU that is not 100% correct. Calcite will add implicit casts in case they can be used.

gortiz
gortiz previously approved these changes Jul 30, 2024
@yashmayya
Copy link
Collaborator Author

yashmayya commented Jul 30, 2024

AFAIU that is not 100% correct. Calcite will add implicit casts in case they can be used.

Yeah, that's a valid point, but in this case it'll still mean that the scalar comparison functions should only be called with arguments of the same type. Also, I just checked and even Postgres does use some implicit casts (integer / real literal comparisons for instance). I'll update the wording, thanks!

@yashmayya
Copy link
Collaborator Author

Based on our offline discussion, I've also pushed a commit adding polymorphic support for the >, >=, <, <= operators in this PR itself.

@yashmayya
Copy link
Collaborator Author

yashmayya commented Jul 30, 2024

There's a bug in PostAggregationHandler that is causing the query SELECT AirlineID, CASE WHEN Sum(ArrDelay) < 0 THEN 0 WHEN SUM(ArrDelay) > 0 THEN SUM(ArrDelay) END AS SumArrDelay FROM mytable GROUP BY AirlineID to fail in testHardcodedQueries with the following error:

java.lang.IllegalArgumentException: Unsupported function: less_than with argument types: [DOUBLE, STRING]
	at org.apache.pinot.core.query.postaggregation.PostAggregationFunction.<init>(PostAggregationFunction.java:45) ~[classes/:?]
	at org.apache.pinot.core.query.reduce.PostAggregationHandler$PostAggregationValueExtractor.<init>(PostAggregationHandler.java:164) ~[classes/:?]
	at org.apache.pinot.core.query.reduce.PostAggregationHandler.getValueExtractor(PostAggregationHandler.java:136) ~[classes/:?]
	at org.apache.pinot.core.query.reduce.PostAggregationHandler$PostAggregationValueExtractor.<init>(PostAggregationHandler.java:160) ~[classes/:?]
	at org.apache.pinot.core.query.reduce.PostAggregationHandler.getValueExtractor(PostAggregationHandler.java:136) ~[classes/:?]
	at org.apache.pinot.core.query.reduce.PostAggregationHandler.<init>(PostAggregationHandler.java:77) ~[classes/:?]
	at org.apache.pinot.core.query.reduce.GroupByDataTableReducer.processSingleFinalResult(GroupByDataTableReducer.java:427) ~[classes/:?]
	at org.apache.pinot.core.query.reduce.GroupByDataTableReducer.reduceAndSetResults(GroupByDataTableReducer.java:121) ~[classes/:?]
	at org.apache.pinot.core.query.reduce.BrokerReduceService.reduceOnDataTable(BrokerReduceService.java:155) ~[classes/:?]
	at org.apache.pinot.broker.requesthandler.SingleConnectionBrokerRequestHandler.processBrokerRequest(SingleConnectionBrokerRequestHandler.java:144) ~[classes/:?]
	at org.apache.pinot.broker.requesthandler.BaseSingleStageBrokerRequestHandler.handleRequest(BaseSingleStageBrokerRequestHandler.java:733) ~[classes/:?]
	at org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.handleRequest(BaseBrokerRequestHandler.java:133) ~[classes/:?]
	at org.apache.pinot.broker.requesthandler.BrokerRequestHandlerDelegate.handleRequest(BrokerRequestHandlerDelegate.java:96) ~[classes/:?]
	at org.apache.pinot.broker.api.resources.PinotClientRequest.executeSqlQuery(PinotClientRequest.java:321) ~[classes/:?]

This is because all literals are being parsed into string literals here -

return new LiteralValueExtractor(expression.getLiteral().getStringValue());

and this wasn't taken into account when #13573 updated PostAggregationFunction to use argument type based function lookup instead of argument count based function lookup. Since this is a v1 engine entity, I've updated it to go back to argument count based function lookup for now, until the literal handling is updated either in #13673 (which adds polymorphic scalar function support to some other v1 engine entities and also has util methods to convert between the Thrift literals and the Pinot types appropriately) or in a separate followup PR.

@yashmayya
Copy link
Collaborator Author

yashmayya commented Jul 30, 2024

Since this is a v1 engine entity, I've updated it to go back to argument count based function lookup for now, until the literal handling is updated either in #13673 (which adds polymorphic scalar function support to some other v1 engine entities and also has util methods to convert between the Thrift literals and the Pinot types appropriately) or in a separate followup PR.

The alternative would be to fall back to the double based scalar comparison function in case of heterogeneous argument types in the polymorphic scalar comparison functions and rely on the FunctionInvoker's type conversion.

I've also raised this broader point up for discussion here - #13673 (comment).

Comment on lines 40 to 41
int numArguments = argumentTypes.length;
FunctionInfo functionInfo = FunctionRegistry.lookupFunctionInfo(canonicalName, numArguments);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be sure this is not going to happen in other cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be an issue in the v2 engine where Calcite will ensure the matching argument types during query compilation itself. And this PostAggregationFunction seems to be the only v1 engine specific piece that was updated in #13573 to use the function lookup by argument type. In case we want to be extra safe though, we could do this - #13711 (comment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we want to be extra safe though, we could do this - #13711 (comment).

To clarify, this will ensure that argument type based function lookup will function the same as the earlier argument count based function lookup in case of heterogeneous argument types in all cases (and we can un-revert the PostAggregationFunction changes). Actually, on thinking about this a bit more, I think we should definitely do so because even if we fix the literal handling in PostAggregationHandler, we'll still run into issues with DOUBLE / INT comparisons and the like in the v1 engine.

@gortiz gortiz dismissed their stale review July 30, 2024 13:25

We found some issues. I'm not sure whether we should rush this change

Comment on lines 50 to 45
// In case of heterogeneous argument types, fall back to double based comparison and allow FunctionInvoker to
// convert argument types for v1 engine support.
if (argumentTypes[0] != argumentTypes[1]) {
return functionInfoForType(DataSchema.ColumnDataType.DOUBLE);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to maintain behavior parity.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big step towards full polymorphism support!

@Nullable
@Override
public PinotSqlFunction toPinotSqlFunction() {
return new PinotSqlFunction(getName(), ReturnTypes.BOOLEAN_FORCE_NULLABLE, new SameOperandTypeChecker(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to register a PinotSqlFunction because all the comparison functions are standard function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'd implemented this since the PinotScalarFunction interface doesn't have a default implementation for the toPinotSqlFunction method (and I hadn't noticed that it is annotated as Nullable with null handled appropriately at the call-site in PinotOperatorTable). I've added a default implementation of the method in the interface that returns null and removed this overridden implementation.

Copy link
Collaborator Author

@yashmayya yashmayya Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang this change caused a test failure in NullHandlingIntegrationTest::testNullLiteralSelectionInV2. The query SELECT greater_than(null, 1) FROM mytable fails with:

No match found for function signature greater_than(<NULL>, <NUMERIC>)

So I guess the standard function definition doesn't tolerate nulls. I've reverted the change because in Postgres for instance, SELECT null > 1 FROM mytable is a valid query and does return null - our function implementations can also handle nulls through the FunctionInvoker's null intolerant function handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I know the reason. greater_than cannot be used explicitly. Can you try SELECT null > 1 FROM mytable and see if it works? The test is testing some non-standard SQL behavior, and I don't know if we want to keep the old behavior. Basically when querying SELECT null > 1 FROM mytable it will match the Calcite standard SqlOperator; when querying SELECT greater_than(null, 1) FROM mytable it will match this custom SqlOperator which might have different behavior thus causing confusion.
If we want to keep the support of explicit greater_than, we can add it under PinotOperatorTable.STANDARD_OPERATORS_WITH_ALIASES. But I doubt if anyone is using it this way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right, this actually isn't about nullability. Even a query like SELECT greater_than(2, 1) FROM mytable fails with the error: No match found for function signature greater_than(<NUMERIC>, <NUMERIC>) if we remove this overridden method because we wouldn't be registering the scalar function in the operator table anymore. And yeah, SELECT null > 1 FROM mytable works just fine.

If we want to keep the support of explicit greater_than, we can add it under PinotOperatorTable.STANDARD_OPERATORS_WITH_ALIASES. But I doubt if anyone is using it this way.

Makes sense, I don't see any harm in retaining support for this explicit syntax, so I've added aliases for all these comparison operators.

@@ -704,6 +704,97 @@ public void testVariadicFunction() throws Exception {
assertEquals(jsonNode.get("numRowsResultSet").asInt(), 3);
}

@Test
public void testPolymorphicScalarComparisonFunctions() throws Exception {
// Queries written this way will trigger the PinotEvaluateLiteralRule which will call the scalar equals function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will SELECT ... WHERE 'test' = 'test' trigger the same rule?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, in such queries the filter is simply removed even before any optimization rules are applied in this part of the SqlToRelConverter -

rootNode = converter.trimUnusedFields(false, rootNode);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calcite is being smart lol. Let's add a comment explaining this

*/
public abstract class PolymorphicComparisonScalarFunction implements PinotScalarFunction {

protected static final double DOUBLE_COMPARISON_TOLERANCE = 1e-7d;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced in this PR, but I don't think this is the standard SQL behavior. I guess it might be introduced to workaround the float - double casting problem.
Should we consider fixing it to follow the standard behavior? We can keep the existing behavior in getFunctionInfo(int numArguments) for backward compatibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting using the = and != operators directly instead of doing a "fuzzy" comparison?

We can keep the existing behavior in getFunctionInfo(int numArguments) for backward compatibility

Currently, we're simply returning the double based comparison function in this case for backward compatibility, so we'll need to create a new "fuzzy" equals / notEquals implementation and return that instead? Also I thought we're planning to move all usages of getFunctionInfo(int numArguments) to getFunctionInfo(ColumnDataType[] argumentTypes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We need this for now so that single-stage engine keeps existing behavior. Can you also check if standard SQL uses exact match or fuzzy match?

Copy link
Collaborator Author

@yashmayya yashmayya Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is exact match which is why it isn't recommended to compare float / double values directly. In other standard databases users should either use some delta value explicitly themselves during comparison or else use the decimal / numeric types for exact precision comparison. I've pushed a commit with the suggested changes.

// Set nullable parameters to false for each function because the return value should be null if any argument
// is null
TYPE_FUNCTION_INFO_MAP.put(DataSchema.ColumnDataType.INT, new FunctionInfo(
NotEqualsScalarFunction.class.getMethod("intNotEquals", int.class, int.class),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to directly access a method without reflection?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the FunctionInfo class is using java.lang.reflect.Method to store the method reference and this is invoked directly in FunctionInvoker via Method::invoke, so I don't think it's possible to avoid using reflection here. Although since this is in a static initializer block that will only be executed once in a JVM when each of these classes is initialized, it shouldn't be much of a performance concern right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, performance is not a concern. I also didn't find a better way, but I do want to see if it is possible to improve readability by directly referencing the method :-P

@yashmayya yashmayya force-pushed the polymorphic-scalar-comparison-functions branch from dd9f138 to a605b3b Compare August 14, 2024 15:50
Copy link
Collaborator Author

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review Jackie!

@@ -704,6 +704,97 @@ public void testVariadicFunction() throws Exception {
assertEquals(jsonNode.get("numRowsResultSet").asInt(), 3);
}

@Test
public void testPolymorphicScalarComparisonFunctions() throws Exception {
// Queries written this way will trigger the PinotEvaluateLiteralRule which will call the scalar equals function
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, in such queries the filter is simply removed even before any optimization rules are applied in this part of the SqlToRelConverter -

rootNode = converter.trimUnusedFields(false, rootNode);

@Nullable
@Override
public PinotSqlFunction toPinotSqlFunction() {
return new PinotSqlFunction(getName(), ReturnTypes.BOOLEAN_FORCE_NULLABLE, new SameOperandTypeChecker(2));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'd implemented this since the PinotScalarFunction interface doesn't have a default implementation for the toPinotSqlFunction method (and I hadn't noticed that it is annotated as Nullable with null handled appropriately at the call-site in PinotOperatorTable). I've added a default implementation of the method in the interface that returns null and removed this overridden implementation.

// Set nullable parameters to false for each function because the return value should be null if any argument
// is null
TYPE_FUNCTION_INFO_MAP.put(DataSchema.ColumnDataType.INT, new FunctionInfo(
NotEqualsScalarFunction.class.getMethod("intNotEquals", int.class, int.class),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the FunctionInfo class is using java.lang.reflect.Method to store the method reference and this is invoked directly in FunctionInvoker via Method::invoke, so I don't think it's possible to avoid using reflection here. Although since this is in a static initializer block that will only be executed once in a JVM when each of these classes is initialized, it shouldn't be much of a performance concern right?

*/
public abstract class PolymorphicComparisonScalarFunction implements PinotScalarFunction {

protected static final double DOUBLE_COMPARISON_TOLERANCE = 1e-7d;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting using the = and != operators directly instead of doing a "fuzzy" comparison?

We can keep the existing behavior in getFunctionInfo(int numArguments) for backward compatibility

Currently, we're simply returning the double based comparison function in this case for backward compatibility, so we'll need to create a new "fuzzy" equals / notEquals implementation and return that instead? Also I thought we're planning to move all usages of getFunctionInfo(int numArguments) to getFunctionInfo(ColumnDataType[] argumentTypes)

…Function; remove overridden method in PolymorphicComparisonScalarFunction"

This reverts commit 2a35d83.
…with tolerance in getFunctionInfo(int numArguments) for backward compatibility
…; remove overridden method in PolymorphicComparisonScalarFunction; add aliases for standard comparison operators in operator table

This reverts commit 0ca3257.
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

@Jackie-Jiang Jackie-Jiang merged commit d0e041c into apache:master Aug 19, 2024
19 of 20 checks passed

// In case of heterogeneous argument types, fall back to double based comparison and allow FunctionInvoker to
// convert argument types for v1 engine support.
if (argumentTypes[0] != argumentTypes[1]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we compare the stored type here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had initially decided against that because I wasn't sure we wanted to allow for things like direct timestamp <-> long comparisons. But thinking about it a little more, that should anyway be handled at a different (higher) layer than this. So we can definitely update this check to compare the stored types here.

// In case of heterogeneous argument types, fall back to double based comparison and allow FunctionInvoker to
// convert argument types for v1 engine support.
if (argumentTypes[0] != argumentTypes[1]) {
return functionInfoForType(ColumnDataType.DOUBLE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the backward compatible method here (with tolerance)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that makes sense since the point of this was to maintain backward compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix enhancement multi-stage Related to the multi-stage query engine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pinot 1.1.0 - sql parsing error when filtering constant value in multi-stage engine
4 participants