-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support polymorphic scalar comparison functions in the multi-stage query engine #13711
Support polymorphic scalar comparison functions in the multi-stage query engine #13711
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13711 +/- ##
============================================
+ Coverage 61.75% 61.97% +0.22%
+ Complexity 207 198 -9
============================================
Files 2436 2567 +131
Lines 133233 141691 +8458
Branches 20636 22015 +1379
============================================
+ Hits 82274 87814 +5540
- Misses 44911 47197 +2286
- Partials 6048 6680 +632
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
AFAIU that is not 100% correct. Calcite will add implicit casts in case they can be used. |
Yeah, that's a valid point, but in this case it'll still mean that the scalar comparison functions should only be called with arguments of the same type. Also, I just checked and even Postgres does use some implicit casts (integer / real literal comparisons for instance). I'll update the wording, thanks! |
Based on our offline discussion, I've also pushed a commit adding polymorphic support for the |
There's a bug in
This is because all literals are being parsed into string literals here - pinot/pinot-core/src/main/java/org/apache/pinot/core/query/reduce/PostAggregationHandler.java Line 111 in 835949b
and this wasn't taken into account when #13573 updated |
The alternative would be to fall back to the I've also raised this broader point up for discussion here - #13673 (comment). |
...-core/src/main/java/org/apache/pinot/core/query/postaggregation/PostAggregationFunction.java
Outdated
Show resolved
Hide resolved
int numArguments = argumentTypes.length; | ||
FunctionInfo functionInfo = FunctionRegistry.lookupFunctionInfo(canonicalName, numArguments); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be sure this is not going to happen in other cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be an issue in the v2 engine where Calcite will ensure the matching argument types during query compilation itself. And this PostAggregationFunction
seems to be the only v1 engine specific piece that was updated in #13573 to use the function lookup by argument type. In case we want to be extra safe though, we could do this - #13711 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case we want to be extra safe though, we could do this - #13711 (comment).
To clarify, this will ensure that argument type based function lookup will function the same as the earlier argument count based function lookup in case of heterogeneous argument types in all cases (and we can un-revert the PostAggregationFunction
changes). Actually, on thinking about this a bit more, I think we should definitely do so because even if we fix the literal handling in PostAggregationHandler
, we'll still run into issues with DOUBLE
/ INT
comparisons and the like in the v1 engine.
We found some issues. I'm not sure whether we should rush this change
// In case of heterogeneous argument types, fall back to double based comparison and allow FunctionInvoker to | ||
// convert argument types for v1 engine support. | ||
if (argumentTypes[0] != argumentTypes[1]) { | ||
return functionInfoForType(DataSchema.ColumnDataType.DOUBLE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to maintain behavior parity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big step towards full polymorphism support!
.../org/apache/pinot/common/function/scalar/comparison/PolymorphicComparisonScalarFunction.java
Outdated
Show resolved
Hide resolved
@Nullable | ||
@Override | ||
public PinotSqlFunction toPinotSqlFunction() { | ||
return new PinotSqlFunction(getName(), ReturnTypes.BOOLEAN_FORCE_NULLABLE, new SameOperandTypeChecker(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to register a PinotSqlFunction
because all the comparison functions are standard function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I'd implemented this since the PinotScalarFunction
interface doesn't have a default implementation for the toPinotSqlFunction
method (and I hadn't noticed that it is annotated as Nullable
with null
handled appropriately at the call-site in PinotOperatorTable
). I've added a default implementation of the method in the interface that returns null
and removed this overridden implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jackie-Jiang this change caused a test failure in NullHandlingIntegrationTest::testNullLiteralSelectionInV2
. The query SELECT greater_than(null, 1) FROM mytable
fails with:
No match found for function signature greater_than(<NULL>, <NUMERIC>)
So I guess the standard function definition doesn't tolerate nulls. I've reverted the change because in Postgres for instance, SELECT null > 1 FROM mytable
is a valid query and does return null
- our function implementations can also handle nulls through the FunctionInvoker
's null intolerant function handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I know the reason. greater_than
cannot be used explicitly. Can you try SELECT null > 1 FROM mytable
and see if it works? The test is testing some non-standard SQL behavior, and I don't know if we want to keep the old behavior. Basically when querying SELECT null > 1 FROM mytable
it will match the Calcite standard SqlOperator
; when querying SELECT greater_than(null, 1) FROM mytable
it will match this custom SqlOperator
which might have different behavior thus causing confusion.
If we want to keep the support of explicit greater_than
, we can add it under PinotOperatorTable.STANDARD_OPERATORS_WITH_ALIASES
. But I doubt if anyone is using it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right, this actually isn't about nullability. Even a query like SELECT greater_than(2, 1) FROM mytable
fails with the error: No match found for function signature greater_than(<NUMERIC>, <NUMERIC>)
if we remove this overridden method because we wouldn't be registering the scalar function in the operator table anymore. And yeah, SELECT null > 1 FROM mytable
works just fine.
If we want to keep the support of explicit greater_than, we can add it under PinotOperatorTable.STANDARD_OPERATORS_WITH_ALIASES. But I doubt if anyone is using it this way.
Makes sense, I don't see any harm in retaining support for this explicit syntax, so I've added aliases for all these comparison operators.
@@ -704,6 +704,97 @@ public void testVariadicFunction() throws Exception { | |||
assertEquals(jsonNode.get("numRowsResultSet").asInt(), 3); | |||
} | |||
|
|||
@Test | |||
public void testPolymorphicScalarComparisonFunctions() throws Exception { | |||
// Queries written this way will trigger the PinotEvaluateLiteralRule which will call the scalar equals function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will SELECT ... WHERE 'test' = 'test'
trigger the same rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, in such queries the filter is simply removed even before any optimization rules are applied in this part of the SqlToRelConverter
-
pinot/pinot-query-planner/src/main/java/org/apache/pinot/query/QueryEnvironment.java
Line 294 in 14e51e8
rootNode = converter.trimUnusedFields(false, rootNode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calcite is being smart lol. Let's add a comment explaining this
...-tests/src/test/java/org/apache/pinot/integration/tests/MultiStageEngineIntegrationTest.java
Outdated
Show resolved
Hide resolved
*/ | ||
public abstract class PolymorphicComparisonScalarFunction implements PinotScalarFunction { | ||
|
||
protected static final double DOUBLE_COMPARISON_TOLERANCE = 1e-7d; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not introduced in this PR, but I don't think this is the standard SQL behavior. I guess it might be introduced to workaround the float - double casting problem.
Should we consider fixing it to follow the standard behavior? We can keep the existing behavior in getFunctionInfo(int numArguments)
for backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting using the =
and !=
operators directly instead of doing a "fuzzy" comparison?
We can keep the existing behavior in
getFunctionInfo(int numArguments)
for backward compatibility
Currently, we're simply returning the double
based comparison function in this case for backward compatibility, so we'll need to create a new "fuzzy" equals / notEquals implementation and return that instead? Also I thought we're planning to move all usages of getFunctionInfo(int numArguments)
to getFunctionInfo(ColumnDataType[] argumentTypes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. We need this for now so that single-stage engine keeps existing behavior. Can you also check if standard SQL uses exact match or fuzzy match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it is exact match which is why it isn't recommended to compare float / double values directly. In other standard databases users should either use some delta value explicitly themselves during comparison or else use the decimal / numeric types for exact precision comparison. I've pushed a commit with the suggested changes.
// Set nullable parameters to false for each function because the return value should be null if any argument | ||
// is null | ||
TYPE_FUNCTION_INFO_MAP.put(DataSchema.ColumnDataType.INT, new FunctionInfo( | ||
NotEqualsScalarFunction.class.getMethod("intNotEquals", int.class, int.class), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to directly access a method without reflection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the FunctionInfo
class is using java.lang.reflect.Method
to store the method reference and this is invoked directly in FunctionInvoker
via Method::invoke
, so I don't think it's possible to avoid using reflection here. Although since this is in a static initializer block that will only be executed once in a JVM when each of these classes is initialized, it shouldn't be much of a performance concern right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, performance is not a concern. I also didn't find a better way, but I do want to see if it is possible to improve readability by directly referencing the method :-P
…ction lookup rather than argument type based function lookup
…parison in case of heterogeneous argument types
…; remove overridden method in PolymorphicComparisonScalarFunction
dd9f138
to
a605b3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review Jackie!
@@ -704,6 +704,97 @@ public void testVariadicFunction() throws Exception { | |||
assertEquals(jsonNode.get("numRowsResultSet").asInt(), 3); | |||
} | |||
|
|||
@Test | |||
public void testPolymorphicScalarComparisonFunctions() throws Exception { | |||
// Queries written this way will trigger the PinotEvaluateLiteralRule which will call the scalar equals function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, in such queries the filter is simply removed even before any optimization rules are applied in this part of the SqlToRelConverter
-
pinot/pinot-query-planner/src/main/java/org/apache/pinot/query/QueryEnvironment.java
Line 294 in 14e51e8
rootNode = converter.trimUnusedFields(false, rootNode); |
@Nullable | ||
@Override | ||
public PinotSqlFunction toPinotSqlFunction() { | ||
return new PinotSqlFunction(getName(), ReturnTypes.BOOLEAN_FORCE_NULLABLE, new SameOperandTypeChecker(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I'd implemented this since the PinotScalarFunction
interface doesn't have a default implementation for the toPinotSqlFunction
method (and I hadn't noticed that it is annotated as Nullable
with null
handled appropriately at the call-site in PinotOperatorTable
). I've added a default implementation of the method in the interface that returns null
and removed this overridden implementation.
.../org/apache/pinot/common/function/scalar/comparison/PolymorphicComparisonScalarFunction.java
Outdated
Show resolved
Hide resolved
...-tests/src/test/java/org/apache/pinot/integration/tests/MultiStageEngineIntegrationTest.java
Outdated
Show resolved
Hide resolved
// Set nullable parameters to false for each function because the return value should be null if any argument | ||
// is null | ||
TYPE_FUNCTION_INFO_MAP.put(DataSchema.ColumnDataType.INT, new FunctionInfo( | ||
NotEqualsScalarFunction.class.getMethod("intNotEquals", int.class, int.class), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the FunctionInfo
class is using java.lang.reflect.Method
to store the method reference and this is invoked directly in FunctionInvoker
via Method::invoke
, so I don't think it's possible to avoid using reflection here. Although since this is in a static initializer block that will only be executed once in a JVM when each of these classes is initialized, it shouldn't be much of a performance concern right?
*/ | ||
public abstract class PolymorphicComparisonScalarFunction implements PinotScalarFunction { | ||
|
||
protected static final double DOUBLE_COMPARISON_TOLERANCE = 1e-7d; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting using the =
and !=
operators directly instead of doing a "fuzzy" comparison?
We can keep the existing behavior in
getFunctionInfo(int numArguments)
for backward compatibility
Currently, we're simply returning the double
based comparison function in this case for backward compatibility, so we'll need to create a new "fuzzy" equals / notEquals implementation and return that instead? Also I thought we're planning to move all usages of getFunctionInfo(int numArguments)
to getFunctionInfo(ColumnDataType[] argumentTypes)
…Function; remove overridden method in PolymorphicComparisonScalarFunction" This reverts commit 2a35d83.
…y a simpler query can't be used
…with tolerance in getFunctionInfo(int numArguments) for backward compatibility
…; remove overridden method in PolymorphicComparisonScalarFunction; add aliases for standard comparison operators in operator table This reverts commit 0ca3257.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
|
||
// In case of heterogeneous argument types, fall back to double based comparison and allow FunctionInvoker to | ||
// convert argument types for v1 engine support. | ||
if (argumentTypes[0] != argumentTypes[1]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we compare the stored type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had initially decided against that because I wasn't sure we wanted to allow for things like direct timestamp <-> long comparisons. But thinking about it a little more, that should anyway be handled at a different (higher) layer than this. So we can definitely update this check to compare the stored types here.
// In case of heterogeneous argument types, fall back to double based comparison and allow FunctionInvoker to | ||
// convert argument types for v1 engine support. | ||
if (argumentTypes[0] != argumentTypes[1]) { | ||
return functionInfoForType(ColumnDataType.DOUBLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the backward compatible method here (with tolerance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that makes sense since the point of this was to maintain backward compatibility.
=
,!=
,>
,>=
,<
,<=
which are probably the most useful ones.double
based scalar comparison functions when looking up by argument number for backward compatibility.null
if any argument isnull
. Again, this matches Postgres behavior.