-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Funnel Count - Multiple Strategies (no partitioning requisites) #11092
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11092 +/- ##
==========================================
Coverage 61.46% 61.46%
+ Complexity 6514 6513 -1
==========================================
Files 2233 2249 +16
Lines 120144 120369 +225
Branches 18234 18253 +19
==========================================
+ Hits 73848 73990 +142
- Misses 40882 40950 +68
- Partials 5414 5429 +15
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 11 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool feature!
Left a couple minor comments, lgtm otherwise.
_sortedAggregationStrategy = new SortedAggregationStrategy(); | ||
|
||
final List<String> settings = Option.SETTINGS.getLiterals(expressions); | ||
Setting.validate(settings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate only one of the strategies are selected and return an error otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some combinations are valid. For example, one could ask for partitioned
together with sorted
or together with tetha_sketch
, or actually ask for all three (would fall-back to theta_sketch
instead of the default bitmap
strategy when a segment is not sorted).
But we could indeed check for invalid combinations instead of just prioritising them.
if (_partitionSetting) { | ||
return _partitionedMergeStrategy; | ||
} | ||
if (_thetaSketchSetting) { | ||
return _thetaSketchMergeStrategy; | ||
} | ||
if (_setSetting) { | ||
return _setMergeStrategy; | ||
} | ||
// default | ||
return _bitmapMergeStrategy; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This selection logic seems to be duplicated in multiple places. Can we centralize it to calculate merge/result/agg strategy in one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main challenge is that the strategy is somewhat dynamic, the first two strategies depend on whether the segment is actually sorted or not. For example the current open segment will not be sorted, only closed segments will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some tests are failing -- please check
*/ | ||
public class FunnelCountAggregationFunction implements AggregationFunction<List<Long>, LongArrayList> { | ||
public class FunnelCountAggregationFunction implements AggregationFunction<Object, LongArrayList> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are losing some type specification here by moving to Object. Is it possible to be creating an abstract type specific to our functions, and use it here>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unfortunately java has no support for union types (well, only in exception catch clauses). I can create a wrapper if you think that helps, as each strategy is effectively using a different underlying aggregation type.
final AggregationStrategy _bitmapAggregationStrategy; | ||
final AggregationStrategy _sortedAggregationStrategy; | ||
|
||
final MergeStrategy _thetaSketchMergeStrategy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can this be moved to a child class for better readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the strategy building/selection logic? I can move it out, yes.
int length = a.size(); | ||
Preconditions.checkState(length == b.size(), "The two operand arrays are not of the same size! provided %s, %s", | ||
length, b.size()); | ||
public Object merge(Object a, Object b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be an abstract type, known to this class and MergeStrategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be a wrapper type, but it would not be known to the underlying merge strategy, I don't think, as each uses a different aggregation type.
I personally think that the abstraction is unnecessary and will create unnecessary garbage collection burden.
Note also that although the AggregationFunction
interface has a type parameter for the intermediate result, everything outside just uses Object
, see for example IndexedTable
.
I considered to propagate the type further up to avoid the use of Object
here, using a generic type instead, moving some of the strategy selection to a separate aggregation function factory class.
There are two main challenges in that approach: (1) segments might be unsorted, making strategy selection dynamic. (2) currently it supports quite a few combinations of strategies (but I can probably make it more dumb and support specific combinations for the sake of readability/maintainability).
private Dictionary getDictionary(Map<ExpressionContext, BlockValSet> blockValSetMap) { | ||
final Dictionary primaryCorrelationDictionary = blockValSetMap.get(_primaryCorrelationCol).getDictionary(); | ||
Preconditions.checkArgument(primaryCorrelationDictionary != null, | ||
"CORRELATE_BY column in FUNNELCOUNT aggregation function not supported, please use a dictionary encoded " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a temporary limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mentioned as a limitation in the documentation for the function (linked above in this PR). I personally have no plans to remove this limitation, as it would be rare to correlate by something other than an actual column. Someone else in the community could obviously contribute the necessary changes, but I think it is a fair limitation for a funnel analytics function. I might work in the future on supporting a secondary set of correlations though, in addition to a primary correlation (eg. correlate by user id + order id). Depending on how that is implemented perhaps I could remove the limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mentioned as a limitation in the documentation for the function (linked above in this PR). I personally have no plans to remove this limitation, as it would be rare to correlate by something other than an actual column. Someone else in the community could obviously contribute the necessary changes, but I think it is a fair limitation for a funnel analytics function. I might work in the future on supporting a secondary set of correlations though, in addition to a primary correlation (eg. correlate by user id + order id). Depending on how that is implemented perhaps I could remove the limitation.
return _bitmapPartitionedResultExtractionStrategy; | ||
} | ||
if (_thetaSketchSetting) { | ||
return _thetaSketchResultExtractionStrategy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IMO looks a bit scary. Are the existing tests exercising the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 5 tests included with this PR exercising for each strategy both aggregation only and aggregation with group-by:
- FunnelCountQueriesBitmapTest
- FunnelCountQueriesPartitionedSortedTest
- FunnelCountQueriesPartitionedTest
- FunnelCountQueriesSetTest
- FunnelCountQueriesThetaSketchTest
|
||
enum Setting { | ||
SET("set"), | ||
BITMAP("bitmap"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use this upstream to also store the MergeStrategy instance instead of having all present in the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I say above, a challenge is that of sorted vs unsorted segments. But I can probably reduce the runtime choice to just two strategies: sorted and unsorted strategy; with the unsorted one being resolved at construction time depending on the strategy settings given.
I believe these are flaky tests unrelated to this PR, is there a way to re-run the failed tests? |
7463654
to
527eb4f
Compare
@atris : I have refactored the PR as discussed
All previous tests for the different strategies remain untouched and continue to pass after the refactoring. Please re-review the PR. |
@atris - Did you have a chance to review the PR after the refactoring? |
Thanks for the contribution! Can you help update the pinot doc for this feature? |
…he#11092) * FUNNEL_COUNT - aggregation strategies * FUNNEL_COUNT - Aggregation Strategies Tests * FUNNEL_COUNT - Aggregation Strategies Tests * Refactor: Move strategy greation into a factory, make funnel count aggregation function parametric * Add license headers * Simplify factory by postponing strategy construction and templetizing sorted split * Fix linter errors --------- Co-authored-by: Dario Liberman <[email protected]>
PR for #10866
This PR adds the remaining funnel count aggregation strategies documented in docs.
In particular, introduces strategies that do not require or make assumptions regarding partitioning configuration:
These have the same characteristics as the respective distinct count aggregation functions:
These complement the already present aggregation strategies:
The first corresponding to SEGMENTPARTITIONEDDISTINCTCOUNT. The latter has no equivalent (tho GAPFILL has a somewhat similar aggregation optimisation for sorted rows).
In order to select the strategy, the user just needs to indicate the desired strategy as a setting parameter, as documented in the link above, for example: