feat(stream,agg): enable distinct agg support in backend #8100

stdrc · 2023-02-21T10:42:49Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Previously in #7797, distinct agg support is added (without cache) but not enabled. This PR enables it by disable 2-phase rewrite rule for streaming distinct agg calls, and also adds an LRU cache in the deduplicater.

This will close #7682, and possibly resolve or at least mitigate the performance issue in #7350 and #7271.

Checklist For Contributors

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features).
I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

My PR DOES NOT contain user-facing changes.

Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

Installation and deployment
Connector (sources & sinks)
SQL commands, functions, and operators
RisingWave cluster configuration changes
Other (please specify in the release note below)

Release note

stdrc · 2023-02-22T06:52:47Z

Did some basic benchmark on AWS c5.4xlarge x86-64 machine, with streaming_parallelism set to 4.

Nexmark Q15

It seems that the throughput gained a ~50% increase, but less smooth.

main:

this pr:

stdrc · 2023-02-22T07:59:16Z

Also tried Q15 without group by, i.e.:

CREATE MATERIALIZED VIEW nexmark_q15 AS
SELECT
     count(*) AS total_bids,
     count(*) filter (where price < 10000) AS rank1_bids,
     count(*) filter (where price >= 10000 and price < 1000000) AS rank2_bids,
     count(*) filter (where price >= 1000000) AS rank3_bids,
     count(distinct bidder) AS total_bidders,
     count(distinct bidder) filter (where price < 10000) AS rank1_bidders,
     count(distinct bidder) filter (where price >= 10000 and price < 1000000) AS rank2_bidders,
     count(distinct bidder) filter (where price >= 1000000) AS rank3_bidders,
     count(distinct auction) AS total_auctions,
     count(distinct auction) filter (where price < 10000) AS rank1_auctions,
     count(distinct auction) filter (where price >= 10000 and price < 1000000) AS rank2_auctions,
     count(distinct auction) filter (where price >= 1000000) AS rank3_auctions
FROM bid;

Results also showed ~22% throughput increase:

main:

this pr:

st1page · 2023-02-22T08:12:55Z

maybe you can add the environment information of the benchmark result in the comment.
Considering the expand-based distinct agg rewriting plan takes more cost in multiple nodes because of the exchange of 2-phase agg, it could be better in real world.

lmatz · 2023-02-22T08:41:27Z

Are main/this pr running on the same machine? seeing the CPU usage is greatly improved, wow!

stdrc · 2023-02-22T08:42:21Z

Are main/this pr running on the same machine?

Yep

BugenZhao

Generally LGTM. It's great to see the improvement with such elegant implementation!

src/frontend/src/optimizer/mod.rs

src/stream/src/executor/global_simple_agg.rs

st1page

LGTM! good work!
btw, maybe we can add a iter_tables method on the ExecutorInner to repleace iter_table_storage(&mut this.storages).chain(this.distinct_dedup_tables.values_mut())

src/stream/src/executor/global_simple_agg.rs

src/frontend/src/optimizer/rule/distinct_agg_rule.rs

stdrc · 2023-02-23T09:06:46Z

Changes have been made according to suggestions you gave. PTAL🥰

st1page

prefer making this strategy easier:

with group by: use executor's implementation
without group by: rewrite the plan
because the only concern about scalability here is that if we can process distributed and the group by key is enough for us to shuffle and scale the processing. But currently, the simple join can only be parallel processed with vnode-based 2-phase agg which is in conflict with the distinct aggregators.

src/frontend/src/optimizer/rule/distinct_agg_rule.rs

Signed-off-by: Richard Chien <[email protected]>

st1page

LGTM

github-actions bot added the type/feature label Feb 21, 2023

stdrc marked this pull request as ready for review February 21, 2023 11:12

stdrc changed the title ~~feat(stream,agg): add distinct agg support in streaming backend~~ feat(stream,agg): enable distinct agg support in backend Feb 21, 2023

stdrc requested review from st1page, BugenZhao and chenzl25 February 21, 2023 11:15

This comment was marked as resolved.

Sign in to view

stdrc requested a review from lmatz February 21, 2023 11:36

TennyZhuang requested a review from xxchan February 21, 2023 16:16

This comment was marked as resolved.

Sign in to view

BugenZhao reviewed Feb 22, 2023

View reviewed changes

src/frontend/src/optimizer/mod.rs Show resolved Hide resolved

src/stream/src/executor/global_simple_agg.rs Outdated Show resolved Hide resolved

st1page reviewed Feb 22, 2023

View reviewed changes

src/stream/src/executor/global_simple_agg.rs Outdated Show resolved Hide resolved

chenzl25 reviewed Feb 22, 2023

View reviewed changes

src/frontend/src/optimizer/rule/distinct_agg_rule.rs Outdated Show resolved Hide resolved

stdrc force-pushed the rc/enable-distinct-agg branch 3 times, most recently from 5e5c129 to 2aa03e4 Compare February 23, 2023 08:58

stdrc commented Feb 23, 2023

View reviewed changes

src/frontend/src/optimizer/rule/distinct_agg_rule.rs Show resolved Hide resolved

st1page reviewed Feb 23, 2023

View reviewed changes

src/frontend/src/optimizer/rule/distinct_agg_rule.rs Outdated Show resolved Hide resolved

stdrc added 2 commits February 24, 2023 15:29

enable backend distinct agg impl

b14bcfa

Signed-off-by: Richard Chien <[email protected]>

update planner tests

e45bc92

Signed-off-by: Richard Chien <[email protected]>

stdrc force-pushed the rc/enable-distinct-agg branch from 88c8581 to e45bc92 Compare February 24, 2023 07:33

stdrc requested a review from st1page February 24, 2023 07:53

st1page approved these changes Feb 24, 2023

View reviewed changes

stdrc added the mergify/can-merge label Feb 24, 2023

Merge branch 'main' into rc/enable-distinct-agg

5901fa4

mergify bot merged commit 3c5bf28 into main Feb 24, 2023

mergify bot deleted the rc/enable-distinct-agg branch February 24, 2023 09:47

stdrc mentioned this pull request Mar 2, 2023

opt(agg): reuse existing count(*) while generating stream plan #8197

Closed

stdrc mentioned this pull request Mar 17, 2023

feat: support any combination of distinct and ordered agg calls #8614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stream,agg): enable distinct agg support in backend #8100

feat(stream,agg): enable distinct agg support in backend #8100

stdrc commented Feb 21, 2023 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

stdrc commented Feb 22, 2023 •

edited

Loading

stdrc commented Feb 22, 2023 •

edited

Loading

st1page commented Feb 22, 2023

lmatz commented Feb 22, 2023

stdrc commented Feb 22, 2023

BugenZhao left a comment

st1page left a comment •

edited

Loading

stdrc commented Feb 23, 2023

st1page left a comment

st1page left a comment

feat(stream,agg): enable distinct agg support in backend #8100

feat(stream,agg): enable distinct agg support in backend #8100

Conversation

stdrc commented Feb 21, 2023 • edited Loading

What's changed and what's your intention?

Checklist For Contributors

Checklist For Reviewers

Documentation

Types of user-facing changes

Release note

This comment was marked as resolved.

This comment was marked as resolved.

stdrc commented Feb 22, 2023 • edited Loading

Nexmark Q15

stdrc commented Feb 22, 2023 • edited Loading

st1page commented Feb 22, 2023

lmatz commented Feb 22, 2023

stdrc commented Feb 22, 2023

BugenZhao left a comment

Choose a reason for hiding this comment

st1page left a comment • edited Loading

Choose a reason for hiding this comment

stdrc commented Feb 23, 2023

st1page left a comment

Choose a reason for hiding this comment

st1page left a comment

Choose a reason for hiding this comment

stdrc commented Feb 21, 2023 •

edited

Loading

stdrc commented Feb 22, 2023 •

edited

Loading

stdrc commented Feb 22, 2023 •

edited

Loading

st1page left a comment •

edited

Loading