Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stream,agg): enable distinct agg support in backend #8100

Merged
merged 3 commits into from
Feb 24, 2023

Conversation

stdrc
Copy link
Member

@stdrc stdrc commented Feb 21, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Previously in #7797, distinct agg support is added (without cache) but not enabled. This PR enables it by disable 2-phase rewrite rule for streaming distinct agg calls, and also adds an LRU cache in the deduplicater.

This will close #7682, and possibly resolve or at least mitigate the performance issue in #7350 and #7271.

Checklist For Contributors

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features).
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

  • I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

  • My PR DOES NOT contain user-facing changes.
Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

@stdrc stdrc marked this pull request as ready for review February 21, 2023 11:12
@stdrc stdrc changed the title feat(stream,agg): add distinct agg support in streaming backend feat(stream,agg): enable distinct agg support in backend Feb 21, 2023
@stdrc

This comment was marked as resolved.

@stdrc stdrc requested a review from lmatz February 21, 2023 11:36
@TennyZhuang TennyZhuang requested a review from xxchan February 21, 2023 16:16
@codecov

This comment was marked as resolved.

@stdrc
Copy link
Member Author

stdrc commented Feb 22, 2023

Did some basic benchmark on AWS c5.4xlarge x86-64 machine, with streaming_parallelism set to 4.

Nexmark Q15

It seems that the throughput gained a ~50% increase, but less smooth.

main:

q15 aws main

this pr:

q15 aws pr

@stdrc
Copy link
Member Author

stdrc commented Feb 22, 2023

Also tried Q15 without group by, i.e.:

CREATE MATERIALIZED VIEW nexmark_q15 AS
SELECT
     count(*) AS total_bids,
     count(*) filter (where price < 10000) AS rank1_bids,
     count(*) filter (where price >= 10000 and price < 1000000) AS rank2_bids,
     count(*) filter (where price >= 1000000) AS rank3_bids,
     count(distinct bidder) AS total_bidders,
     count(distinct bidder) filter (where price < 10000) AS rank1_bidders,
     count(distinct bidder) filter (where price >= 10000 and price < 1000000) AS rank2_bidders,
     count(distinct bidder) filter (where price >= 1000000) AS rank3_bidders,
     count(distinct auction) AS total_auctions,
     count(distinct auction) filter (where price < 10000) AS rank1_auctions,
     count(distinct auction) filter (where price >= 10000 and price < 1000000) AS rank2_auctions,
     count(distinct auction) filter (where price >= 1000000) AS rank3_auctions
FROM bid;

Results also showed ~22% throughput increase:

main:

q15simple aws main

this pr:

q15simple aws pr

@st1page
Copy link
Contributor

st1page commented Feb 22, 2023

maybe you can add the environment information of the benchmark result in the comment.
Considering the expand-based distinct agg rewriting plan takes more cost in multiple nodes because of the exchange of 2-phase agg, it could be better in real world.

@lmatz
Copy link
Contributor

lmatz commented Feb 22, 2023

Are main/this pr running on the same machine? seeing the CPU usage is greatly improved, wow!

@stdrc
Copy link
Member Author

stdrc commented Feb 22, 2023

Are main/this pr running on the same machine?

Yep

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. It's great to see the improvement with such elegant implementation!

Copy link
Contributor

@st1page st1page left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! good work!
btw, maybe we can add a iter_tables method on the ExecutorInner to repleace iter_table_storage(&mut this.storages).chain(this.distinct_dedup_tables.values_mut())

@stdrc stdrc force-pushed the rc/enable-distinct-agg branch 3 times, most recently from 5e5c129 to 2aa03e4 Compare February 23, 2023 08:58
@stdrc
Copy link
Member Author

stdrc commented Feb 23, 2023

Changes have been made according to suggestions you gave. PTAL🥰

Copy link
Contributor

@st1page st1page left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer making this strategy easier:

  1. with group by: use executor's implementation
  2. without group by: rewrite the plan
    because the only concern about scalability here is that if we can process distributed and the group by key is enough for us to shuffle and scale the processing. But currently, the simple join can only be parallel processed with vnode-based 2-phase agg which is in conflict with the distinct aggregators.

Signed-off-by: Richard Chien <[email protected]>
@stdrc stdrc force-pushed the rc/enable-distinct-agg branch from 88c8581 to e45bc92 Compare February 24, 2023 07:33
@stdrc stdrc requested a review from st1page February 24, 2023 07:53
Copy link
Contributor

@st1page st1page left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: add distinct agg support to replace 2-phase rewriting
5 participants