Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stream,agg): add distinct deduplicater #7797

Merged
merged 33 commits into from
Feb 14, 2023
Merged

feat(stream,agg): add distinct deduplicater #7797

merged 33 commits into from
Feb 14, 2023

Conversation

stdrc
Copy link
Member

@stdrc stdrc commented Feb 8, 2023

Signed-off-by: Richard Chien [email protected]
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR adds a DistinctDeduplicater in streaming backend, to support distinct agg in HashAgg and GlobalSimpleAgg. It depends on the state tables inferred in frontend, with one state table for each distinct column. The dedup table schema is like:

group key | distinct key | count for agg call 1 | count for agg call 2 | ...

Let me explain by an example:

select
    count(*), -- count star, no need for a dedup table
    count(distinct a), -- agg call `W`, share a dedup table for distinct column `a`
    count(distinct a) filter (where c > 1000), -- agg call `X`, share a dedup table for distinct column `a`
    count(distinct b), -- agg call `Y`, share a dedup table for distinct column `b`
    count(distinct b) filter (where c > 1000), -- agg call `Z`, share a dedup table for distinct column `b`
from t group by d;

There'll be two dedup tables:

  • Dedup table for column a:
    d | a | count_for_W | count_for_X
    
  • Dedup table for column b:
    d | b | count_for_Y | count_for_Z
    

Each aggregation group has a DistinctDeduplicater, which counts the occurrence of each distinct key for different agg calls according the visibility (already applied agg filter and group filter). For every duplicate item/row, DistinctDeduplicater hide it in the returned visibility.


Dedup state table cache is not supported yet due to possible concern for memory consumption, may introduce in later PR.

The distinct agg support is not enabled yet (DistinctAggRule is still rewriting distinct agg calls to 2-phase agg), may enable in later PR.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features).
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

#7682

stdrc added 3 commits February 8, 2023 20:32
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
@kwannoel kwannoel self-requested a review February 8, 2023 13:58
stdrc added 24 commits February 9, 2023 15:21
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
@stdrc stdrc changed the title feat(stream,agg): distinct aggregator feat(stream,agg): add distinct deduplicater Feb 13, 2023
@stdrc stdrc marked this pull request as ready for review February 13, 2023 15:09
Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
@codecov
Copy link

codecov bot commented Feb 13, 2023

Codecov Report

Merging #7797 (7210cad) into main (0de2cca) will increase coverage by 0.07%.
The diff coverage is 86.50%.

@@            Coverage Diff             @@
##             main    #7797      +/-   ##
==========================================
+ Coverage   71.71%   71.78%   +0.07%     
==========================================
  Files        1113     1114       +1     
  Lines      177694   178366     +672     
==========================================
+ Hits       127425   128033     +608     
- Misses      50269    50333      +64     
Flag Coverage Δ
rust 71.78% <86.50%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/frontend/src/optimizer/plan_node/stream.rs 13.99% <0.00%> (-0.32%) ⬇️
src/stream/src/executor/aggregation/agg_call.rs 81.25% <ø> (ø)
src/stream/src/executor/aggregation/mod.rs 88.13% <ø> (ø)
src/stream/src/from_proto/agg_common.rs 0.00% <0.00%> (ø)
src/stream/src/from_proto/global_simple_agg.rs 0.00% <0.00%> (ø)
src/stream/src/from_proto/hash_agg.rs 0.00% <0.00%> (ø)
...rc/frontend/src/optimizer/plan_node/generic/agg.rs 72.17% <36.73%> (-3.89%) ⬇️
src/stream/src/executor/hash_agg.rs 95.94% <81.08%> (-0.02%) ⬇️
src/stream/src/executor/global_simple_agg.rs 95.08% <90.32%> (-0.35%) ⬇️
src/stream/src/executor/aggregation/agg_group.rs 86.33% <93.33%> (+0.53%) ⬆️
... and 17 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@soundOfDestiny soundOfDestiny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

soundOfDestiny

This comment was marked as duplicate.

Signed-off-by: Richard Chien <[email protected]>
Signed-off-by: Richard Chien <[email protected]>
@mergify mergify bot merged commit 4424382 into main Feb 14, 2023
@mergify mergify bot deleted the rc/distinct-agg branch February 14, 2023 09:39
mergify bot pushed a commit that referenced this pull request Feb 24, 2023
Previously in #7797, distinct agg support is added (without cache) but not enabled. This PR enables it by disable 2-phase rewrite rule for streaming distinct agg calls, and also adds an LRU cache in the deduplicater.

This will close #7682, and possibly resolve or at least mitigate the performance issue in #7350 and #7271.

Approved-By: st1page
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants