Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Funnel Count - Multiple Strategies (no partitioning requisites) #11092

Merged
merged 8 commits into from
Aug 30, 2023

Conversation

dario-liberman
Copy link
Contributor

@dario-liberman dario-liberman commented Jul 12, 2023

PR for #10866

This PR adds the remaining funnel count aggregation strategies documented in docs.

In particular, introduces strategies that do not require or make assumptions regarding partitioning configuration:

  • theta_sketch
  • bitmap
  • set

These have the same characteristics as the respective distinct count aggregation functions:

These complement the already present aggregation strategies:

  • partitioned
  • partitioned, sorted

The first corresponding to SEGMENTPARTITIONEDDISTINCTCOUNT. The latter has no equivalent (tho GAPFILL has a somewhat similar aggregation optimisation for sorted rows).

In order to select the strategy, the user just needs to indicate the desired strategy as a setting parameter, as documented in the link above, for example:

select 
  FUNNEL_COUNT(
    STEPS(
      url = '/cart/add', 
      url = '/checkout/start', 
      url = '/checkout/confirmation'),
    CORRELATE_BY(user_id),
    SETTINGS('theta_sketch', 'nominalEntries=4096')
  ) AS counts
from user_log 

@codecov-commenter
Copy link

codecov-commenter commented Jul 12, 2023

Codecov Report

Merging #11092 (7e3ae78) into master (575398d) will increase coverage by 0.00%.
The diff coverage is 75.25%.

@@            Coverage Diff             @@
##             master   #11092    +/-   ##
==========================================
  Coverage     61.46%   61.46%            
+ Complexity     6514     6513     -1     
==========================================
  Files          2233     2249    +16     
  Lines        120144   120369   +225     
  Branches      18234    18253    +19     
==========================================
+ Hits          73848    73990   +142     
- Misses        40882    40950    +68     
- Partials       5414     5429    +15     
Flag Coverage Δ
integration1 0.00% <0.00%> (ø)
integration2 0.00% <0.00%> (ø)
java-11 61.45% <75.25%> (+<0.01%) ⬆️
java-17 61.32% <75.25%> (+0.02%) ⬆️
java-20 61.32% <75.25%> (-0.01%) ⬇️
temurin 61.46% <75.25%> (+<0.01%) ⬆️
unittests1 66.97% <75.25%> (-0.02%) ⬇️
unittests2 14.56% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...n/function/funnel/SetResultExtractionStrategy.java 45.45% <45.45%> (ø)
...unction/funnel/ThetaSketchAggregationStrategy.java 50.00% <50.00%> (ø)
...n/funnel/FunnelCountSortedAggregationFunction.java 53.57% <53.57%> (ø)
...unction/funnel/BitmapResultExtractionStrategy.java 58.33% <58.33%> (ø)
...unction/funnel/FunnelCountAggregationFunction.java 65.11% <65.11%> (ø)
...on/funnel/ThetaSketchResultExtractionStrategy.java 71.42% <71.42%> (ø)
...gregation/function/funnel/AggregationStrategy.java 77.77% <77.77%> (ø)
.../funnel/FunnelCountAggregationFunctionFactory.java 82.60% <82.60%> (ø)
...gregation/function/AggregationFunctionFactory.java 81.35% <100.00%> (ø)
...ion/function/funnel/BitmapAggregationStrategy.java 100.00% <100.00%> (ø)
... and 8 more

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@cbalci cbalci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool feature!
Left a couple minor comments, lgtm otherwise.

_sortedAggregationStrategy = new SortedAggregationStrategy();

final List<String> settings = Option.SETTINGS.getLiterals(expressions);
Setting.validate(settings);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate only one of the strategies are selected and return an error otherwise?

Copy link
Contributor Author

@dario-liberman dario-liberman Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some combinations are valid. For example, one could ask for partitioned together with sorted or together with tetha_sketch, or actually ask for all three (would fall-back to theta_sketch instead of the default bitmap strategy when a segment is not sorted).
But we could indeed check for invalid combinations instead of just prioritising them.

Comment on lines 316 to 217
if (_partitionSetting) {
return _partitionedMergeStrategy;
}
if (_thetaSketchSetting) {
return _thetaSketchMergeStrategy;
}
if (_setSetting) {
return _setMergeStrategy;
}
// default
return _bitmapMergeStrategy;
}
Copy link
Contributor

@cbalci cbalci Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This selection logic seems to be duplicated in multiple places. Can we centralize it to calculate merge/result/agg strategy in one place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main challenge is that the strategy is somewhat dynamic, the first two strategies depend on whether the segment is actually sorted or not. For example the current open segment will not be sorted, only closed segments will.

@atris atris self-assigned this Jul 24, 2023
Copy link
Contributor

@atris atris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests are failing -- please check

*/
public class FunnelCountAggregationFunction implements AggregationFunction<List<Long>, LongArrayList> {
public class FunnelCountAggregationFunction implements AggregationFunction<Object, LongArrayList> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are losing some type specification here by moving to Object. Is it possible to be creating an abstract type specific to our functions, and use it here>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately java has no support for union types (well, only in exception catch clauses). I can create a wrapper if you think that helps, as each strategy is effectively using a different underlying aggregation type.

final AggregationStrategy _bitmapAggregationStrategy;
final AggregationStrategy _sortedAggregationStrategy;

final MergeStrategy _thetaSketchMergeStrategy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can this be moved to a child class for better readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the strategy building/selection logic? I can move it out, yes.

int length = a.size();
Preconditions.checkState(length == b.size(), "The two operand arrays are not of the same size! provided %s, %s",
length, b.size());
public Object merge(Object a, Object b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an abstract type, known to this class and MergeStrategy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a wrapper type, but it would not be known to the underlying merge strategy, I don't think, as each uses a different aggregation type.
I personally think that the abstraction is unnecessary and will create unnecessary garbage collection burden.

Note also that although the AggregationFunction interface has a type parameter for the intermediate result, everything outside just uses Object, see for example IndexedTable.

I considered to propagate the type further up to avoid the use of Object here, using a generic type instead, moving some of the strategy selection to a separate aggregation function factory class.
There are two main challenges in that approach: (1) segments might be unsorted, making strategy selection dynamic. (2) currently it supports quite a few combinations of strategies (but I can probably make it more dumb and support specific combinations for the sake of readability/maintainability).

private Dictionary getDictionary(Map<ExpressionContext, BlockValSet> blockValSetMap) {
final Dictionary primaryCorrelationDictionary = blockValSetMap.get(_primaryCorrelationCol).getDictionary();
Preconditions.checkArgument(primaryCorrelationDictionary != null,
"CORRELATE_BY column in FUNNELCOUNT aggregation function not supported, please use a dictionary encoded "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a temporary limitation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mentioned as a limitation in the documentation for the function (linked above in this PR). I personally have no plans to remove this limitation, as it would be rare to correlate by something other than an actual column. Someone else in the community could obviously contribute the necessary changes, but I think it is a fair limitation for a funnel analytics function. I might work in the future on supporting a secondary set of correlations though, in addition to a primary correlation (eg. correlate by user id + order id). Depending on how that is implemented perhaps I could remove the limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mentioned as a limitation in the documentation for the function (linked above in this PR). I personally have no plans to remove this limitation, as it would be rare to correlate by something other than an actual column. Someone else in the community could obviously contribute the necessary changes, but I think it is a fair limitation for a funnel analytics function. I might work in the future on supporting a secondary set of correlations though, in addition to a primary correlation (eg. correlate by user id + order id). Depending on how that is implemented perhaps I could remove the limitation.

return _bitmapPartitionedResultExtractionStrategy;
}
if (_thetaSketchSetting) {
return _thetaSketchResultExtractionStrategy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IMO looks a bit scary. Are the existing tests exercising the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 5 tests included with this PR exercising for each strategy both aggregation only and aggregation with group-by:

  • FunnelCountQueriesBitmapTest
  • FunnelCountQueriesPartitionedSortedTest
  • FunnelCountQueriesPartitionedTest
  • FunnelCountQueriesSetTest
  • FunnelCountQueriesThetaSketchTest


enum Setting {
SET("set"),
BITMAP("bitmap"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use this upstream to also store the MergeStrategy instance instead of having all present in the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I say above, a challenge is that of sorted vs unsorted segments. But I can probably reduce the runtime choice to just two strategies: sorted and unsorted strategy; with the unsorted one being resolved at construction time depending on the strategy settings given.

@dario-liberman
Copy link
Contributor Author

Some tests are failing -- please check

I believe these are flaky tests unrelated to this PR, is there a way to re-run the failed tests?

@dario-liberman
Copy link
Contributor Author

@atris : I have refactored the PR as discussed

  1. moving all inner classes into top level within a dedicated sub-package.
  2. moving input validation as well as strategy creation and selection logic to a factory class.
  3. stronger type-checking by making the main aggregation function parametric.
  4. separating out the sorted strategy selection logic into a dedicated sub-class due to the added complexity; as unlike other strategies that can be selected at construction time, sorting can be decided only at aggregation time based on whether the segment being processed is actually sorted or not (eg. open segments may not be sorted).

All previous tests for the different strategies remain untouched and continue to pass after the refactoring.

Please re-review the PR.

@dario-liberman
Copy link
Contributor Author

@atris - Did you have a chance to review the PR after the refactoring?

@atris atris merged commit d7aed2e into apache:master Aug 30, 2023
@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Aug 30, 2023
@Jackie-Jiang
Copy link
Contributor

Thanks for the contribution! Can you help update the pinot doc for this feature?

KKcorps pushed a commit to KKcorps/incubator-pinot that referenced this pull request Sep 5, 2023
…he#11092)

* FUNNEL_COUNT - aggregation strategies

* FUNNEL_COUNT - Aggregation Strategies Tests

* FUNNEL_COUNT - Aggregation Strategies Tests

* Refactor: Move strategy greation into a factory, make funnel count aggregation function parametric

* Add license headers

* Simplify factory by postponing strategy construction and templetizing sorted split

* Fix linter errors

---------

Co-authored-by: Dario Liberman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants