Funnel Count - Multiple Strategies (no partitioning requisites) #11092

dario-liberman · 2023-07-12T13:33:16Z

This PR adds the remaining funnel count aggregation strategies documented in docs.

In particular, introduces strategies that do not require or make assumptions regarding partitioning configuration:

theta_sketch
bitmap
set

These have the same characteristics as the respective distinct count aggregation functions:

These complement the already present aggregation strategies:

partitioned
partitioned, sorted

The first corresponding to SEGMENTPARTITIONEDDISTINCTCOUNT. The latter has no equivalent (tho GAPFILL has a somewhat similar aggregation optimisation for sorted rows).

In order to select the strategy, the user just needs to indicate the desired strategy as a setting parameter, as documented in the link above, for example:

select 
  FUNNEL_COUNT(
    STEPS(
      url = '/cart/add', 
      url = '/checkout/start', 
      url = '/checkout/confirmation'),
    CORRELATE_BY(user_id),
    SETTINGS('theta_sketch', 'nominalEntries=4096')
  ) AS counts
from user_log

codecov-commenter · 2023-07-12T23:07:08Z

Codecov Report

Merging #11092 (7e3ae78) into master (575398d) will increase coverage by 0.00%.
The diff coverage is 75.25%.

@@            Coverage Diff             @@
##             master   #11092    +/-   ##
==========================================
  Coverage     61.46%   61.46%            
+ Complexity     6514     6513     -1     
==========================================
  Files          2233     2249    +16     
  Lines        120144   120369   +225     
  Branches      18234    18253    +19     
==========================================
+ Hits          73848    73990   +142     
- Misses        40882    40950    +68     
- Partials       5414     5429    +15

Flag	Coverage Δ
integration1	`0.00% <0.00%> (ø)`
integration2	`0.00% <0.00%> (ø)`
java-11	`61.45% <75.25%> (+<0.01%)`	⬆️
java-17	`61.32% <75.25%> (+0.02%)`	⬆️
java-20	`61.32% <75.25%> (-0.01%)`	⬇️
temurin	`61.46% <75.25%> (+<0.01%)`	⬆️
unittests1	`66.97% <75.25%> (-0.02%)`	⬇️
unittests2	`14.56% <0.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
...n/function/funnel/SetResultExtractionStrategy.java	`45.45% <45.45%> (ø)`
...unction/funnel/ThetaSketchAggregationStrategy.java	`50.00% <50.00%> (ø)`
...n/funnel/FunnelCountSortedAggregationFunction.java	`53.57% <53.57%> (ø)`
...unction/funnel/BitmapResultExtractionStrategy.java	`58.33% <58.33%> (ø)`
...unction/funnel/FunnelCountAggregationFunction.java	`65.11% <65.11%> (ø)`
...on/funnel/ThetaSketchResultExtractionStrategy.java	`71.42% <71.42%> (ø)`
...gregation/function/funnel/AggregationStrategy.java	`77.77% <77.77%> (ø)`
.../funnel/FunnelCountAggregationFunctionFactory.java	`82.60% <82.60%> (ø)`
...gregation/function/AggregationFunctionFactory.java	`81.35% <100.00%> (ø)`
...ion/function/funnel/BitmapAggregationStrategy.java	`100.00% <100.00%> (ø)`
... and 8 more

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

cbalci

Cool feature!
Left a couple minor comments, lgtm otherwise.

cbalci · 2023-07-24T16:16:52Z

...in/java/org/apache/pinot/core/query/aggregation/function/FunnelCountAggregationFunction.java

-    _sortedAggregationStrategy = new SortedAggregationStrategy();
+
+    final List<String> settings = Option.SETTINGS.getLiterals(expressions);
+    Setting.validate(settings);


Should we validate only one of the strategies are selected and return an error otherwise?

Some combinations are valid. For example, one could ask for partitioned together with sorted or together with tetha_sketch, or actually ask for all three (would fall-back to theta_sketch instead of the default bitmap strategy when a segment is not sorted).
But we could indeed check for invalid combinations instead of just prioritising them.

cbalci · 2023-07-24T16:24:10Z

...in/java/org/apache/pinot/core/query/aggregation/function/FunnelCountAggregationFunction.java

+    if (_partitionSetting) {
+      return _partitionedMergeStrategy;
+    }
+    if (_thetaSketchSetting) {
+      return _thetaSketchMergeStrategy;
+    }
+    if (_setSetting) {
+      return _setMergeStrategy;
+    }
+    // default
+    return _bitmapMergeStrategy;
  }


This selection logic seems to be duplicated in multiple places. Can we centralize it to calculate merge/result/agg strategy in one place?

The main challenge is that the strategy is somewhat dynamic, the first two strategies depend on whether the segment is actually sorted or not. For example the current open segment will not be sorted, only closed segments will.

atris

Some tests are failing -- please check

atris · 2023-07-24T18:35:47Z

...in/java/org/apache/pinot/core/query/aggregation/function/FunnelCountAggregationFunction.java

 */
-public class FunnelCountAggregationFunction implements AggregationFunction<List<Long>, LongArrayList> {
+public class FunnelCountAggregationFunction implements AggregationFunction<Object, LongArrayList> {


We are losing some type specification here by moving to Object. Is it possible to be creating an abstract type specific to our functions, and use it here>

Yes, unfortunately java has no support for union types (well, only in exception catch clauses). I can create a wrapper if you think that helps, as each strategy is effectively using a different underlying aggregation type.

atris · 2023-07-24T18:37:39Z

...in/java/org/apache/pinot/core/query/aggregation/function/FunnelCountAggregationFunction.java

+  final AggregationStrategy _bitmapAggregationStrategy;
+  final AggregationStrategy _sortedAggregationStrategy;
+
+  final MergeStrategy _thetaSketchMergeStrategy;


Nit: Can this be moved to a child class for better readability?

You mean the strategy building/selection logic? I can move it out, yes.

atris · 2023-07-24T18:38:21Z

...in/java/org/apache/pinot/core/query/aggregation/function/FunnelCountAggregationFunction.java

-    int length = a.size();
-    Preconditions.checkState(length == b.size(), "The two operand arrays are not of the same size! provided %s, %s",
-        length, b.size());
+  public Object merge(Object a, Object b) {


Can this be an abstract type, known to this class and MergeStrategy?

It can be a wrapper type, but it would not be known to the underlying merge strategy, I don't think, as each uses a different aggregation type.
I personally think that the abstraction is unnecessary and will create unnecessary garbage collection burden.

Note also that although the AggregationFunction interface has a type parameter for the intermediate result, everything outside just uses Object, see for example IndexedTable.

I considered to propagate the type further up to avoid the use of Object here, using a generic type instead, moving some of the strategy selection to a separate aggregation function factory class.
There are two main challenges in that approach: (1) segments might be unsorted, making strategy selection dynamic. (2) currently it supports quite a few combinations of strategies (but I can probably make it more dumb and support specific combinations for the sake of readability/maintainability).

atris · 2023-07-24T18:39:59Z

...in/java/org/apache/pinot/core/query/aggregation/function/FunnelCountAggregationFunction.java

+  private Dictionary getDictionary(Map<ExpressionContext, BlockValSet> blockValSetMap) {
+    final Dictionary primaryCorrelationDictionary = blockValSetMap.get(_primaryCorrelationCol).getDictionary();
+    Preconditions.checkArgument(primaryCorrelationDictionary != null,
+        "CORRELATE_BY column in FUNNELCOUNT aggregation function not supported, please use a dictionary encoded "


Is this a temporary limitation?

It is mentioned as a limitation in the documentation for the function (linked above in this PR). I personally have no plans to remove this limitation, as it would be rare to correlate by something other than an actual column. Someone else in the community could obviously contribute the necessary changes, but I think it is a fair limitation for a funnel analytics function. I might work in the future on supporting a secondary set of correlations though, in addition to a primary correlation (eg. correlate by user id + order id). Depending on how that is implemented perhaps I could remove the limitation.

atris · 2023-07-24T18:40:37Z

...in/java/org/apache/pinot/core/query/aggregation/function/FunnelCountAggregationFunction.java

+        return _bitmapPartitionedResultExtractionStrategy;
+    }
+    if (_thetaSketchSetting) {
+      return _thetaSketchResultExtractionStrategy;


This IMO looks a bit scary. Are the existing tests exercising the code?

There are 5 tests included with this PR exercising for each strategy both aggregation only and aggregation with group-by:

FunnelCountQueriesBitmapTest

FunnelCountQueriesPartitionedSortedTest

FunnelCountQueriesPartitionedTest

FunnelCountQueriesSetTest

FunnelCountQueriesThetaSketchTest

atris · 2023-07-24T18:46:25Z

...in/java/org/apache/pinot/core/query/aggregation/function/FunnelCountAggregationFunction.java

+
+  enum Setting {
+    SET("set"),
+    BITMAP("bitmap"),


We could use this upstream to also store the MergeStrategy instance instead of having all present in the class?

As I say above, a challenge is that of sorted vs unsorted segments. But I can probably reduce the runtime choice to just two strategies: sorted and unsorted strategy; with the unsorted one being resolved at construction time depending on the strategy settings given.

dario-liberman · 2023-07-24T19:11:25Z

Some tests are failing -- please check

I believe these are flaky tests unrelated to this PR, is there a way to re-run the failed tests?

…gregation function parametric

dario-liberman · 2023-08-11T12:19:36Z

@atris : I have refactored the PR as discussed

moving all inner classes into top level within a dedicated sub-package.
moving input validation as well as strategy creation and selection logic to a factory class.
stronger type-checking by making the main aggregation function parametric.
separating out the sorted strategy selection logic into a dedicated sub-class due to the added complexity; as unlike other strategies that can be selected at construction time, sorting can be decided only at aggregation time based on whether the segment being processed is actually sorted or not (eg. open segments may not be sorted).

All previous tests for the different strategies remain untouched and continue to pass after the refactoring.

Please re-review the PR.

… sorted split

dario-liberman · 2023-08-18T06:09:46Z

@atris - Did you have a chance to review the PR after the refactoring?

Jackie-Jiang · 2023-08-30T18:55:00Z

Thanks for the contribution! Can you help update the pinot doc for this feature?

…he#11092) * FUNNEL_COUNT - aggregation strategies * FUNNEL_COUNT - Aggregation Strategies Tests * FUNNEL_COUNT - Aggregation Strategies Tests * Refactor: Move strategy greation into a factory, make funnel count aggregation function parametric * Add license headers * Simplify factory by postponing strategy construction and templetizing sorted split * Fix linter errors --------- Co-authored-by: Dario Liberman <[email protected]>

dario-liberman mentioned this pull request Jul 12, 2023

FUNNEL_COUNT Aggregation Function #10867

Merged

cbalci approved these changes Jul 24, 2023

View reviewed changes

atris self-assigned this Jul 24, 2023

atris requested changes Jul 24, 2023

View reviewed changes

darioliberman and others added 4 commits August 11, 2023 13:17

FUNNEL_COUNT - aggregation strategies

ff0db6f

FUNNEL_COUNT - Aggregation Strategies Tests

6d3f9fa

FUNNEL_COUNT - Aggregation Strategies Tests

43c6de4

Refactor: Move strategy greation into a factory, make funnel count ag…

8b5d282

…gregation function parametric

dario-liberman closed this Aug 11, 2023

dario-liberman force-pushed the funnel-strategies branch from 7463654 to 527eb4f Compare August 11, 2023 11:33

dario-liberman reopened this Aug 11, 2023

Add license headers

70c6214

dario-liberman and others added 3 commits August 11, 2023 18:30

Simplify factory by postponing strategy construction and templetizing…

f2b7046

… sorted split

Fix linter errors

7c06a72

Merge branch 'apache:master' into funnel-strategies

7e3ae78

atris approved these changes Aug 30, 2023

View reviewed changes

atris merged commit d7aed2e into apache:master Aug 30, 2023

Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Aug 30, 2023

ankitsultana mentioned this pull request Oct 2, 2023

[Feature] Add Support for SQL Formatting in Query Editor #11725

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Funnel Count - Multiple Strategies (no partitioning requisites) #11092

Funnel Count - Multiple Strategies (no partitioning requisites) #11092

dario-liberman commented Jul 12, 2023 •

edited

Loading

codecov-commenter commented Jul 12, 2023 •

edited

Loading

cbalci left a comment

cbalci Jul 24, 2023

dario-liberman Jul 24, 2023 •

edited

Loading

cbalci Jul 24, 2023 •

edited

Loading

dario-liberman Jul 24, 2023

atris left a comment

atris Jul 24, 2023

dario-liberman Jul 24, 2023

atris Jul 24, 2023

dario-liberman Jul 24, 2023

atris Jul 24, 2023

dario-liberman Jul 24, 2023

atris Jul 24, 2023

dario-liberman Jul 24, 2023

dario-liberman Jul 24, 2023

atris Jul 24, 2023

dario-liberman Jul 24, 2023

atris Jul 24, 2023

dario-liberman Jul 24, 2023

dario-liberman commented Jul 24, 2023

dario-liberman commented Aug 11, 2023

dario-liberman commented Aug 18, 2023

Jackie-Jiang commented Aug 30, 2023

Funnel Count - Multiple Strategies (no partitioning requisites) #11092

Funnel Count - Multiple Strategies (no partitioning requisites) #11092

Conversation

dario-liberman commented Jul 12, 2023 • edited Loading

codecov-commenter commented Jul 12, 2023 • edited Loading

Codecov Report

cbalci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dario-liberman Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

cbalci Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dario-liberman commented Jul 24, 2023

dario-liberman commented Aug 11, 2023

dario-liberman commented Aug 18, 2023

Jackie-Jiang commented Aug 30, 2023

dario-liberman commented Jul 12, 2023 •

edited

Loading

codecov-commenter commented Jul 12, 2023 •

edited

Loading

dario-liberman Jul 24, 2023 •

edited

Loading

cbalci Jul 24, 2023 •

edited

Loading