Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert #7957 to validate the performance regression of q7 and try to improve if true #8510

Open
Tracked by #7289
lmatz opened this issue Mar 13, 2023 · 9 comments
Open
Tracked by #7289

Comments

@lmatz
Copy link
Contributor

lmatz commented Mar 13, 2023

#7957

SCR-20230313-paf

Meanwhile q7-rewrite which uses group top-n does not change performance much. Didn't see other queries also affected.

@github-actions github-actions bot added this to the release-0.1.18 milestone Mar 13, 2023
@lmatz
Copy link
Contributor Author

lmatz commented Mar 13, 2023

image

Doubled the throughput after reverting the PR

https://github.com/risingwavelabs/risingwave/tree/lz/revert_7957

@lmatz
Copy link
Contributor Author

lmatz commented Mar 14, 2023

@lmatz
Copy link
Contributor Author

lmatz commented Mar 14, 2023

Probably it still depends on the real workload to determine whether it improves or makes the performance worse?
Then how about making it a per-query option?
Query hints, or session variables (hard to control if there are multiple joins in a single query)

Or may_exist not good enough?

@fuyufjh
Copy link
Member

fuyufjh commented Mar 15, 2023

Interesting 🤔 cc. @hzxa21 Any ideas about the reason?

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 15, 2023

TL;DR, I think in the perf benchmark, the locality between the two stream is worse than expected so refilling cache on write in the join executor cause eviction on the old entries that are still needed, leading to higher operator cache miss. cc @KeXiangWang This is different from what we have seen before. Any thoughts?

Findings after looking at grafana (up: 02/23 no may_exist and join cache refill on write, down: 02/24):

  1. Aggregation Cached Keys is way smaller in 02/24
    image
    image

  2. Join Cached Entries is also way smaller in 02/24
    image
    image

  3. Join cache miss rate is higher. Cache misses when insert entries in the left table is high and the miss keys are most likely existing keys (may_exists true rate is high) in 02/24
    image
    image

  4. Read duration of the join left and right table are way higher in 02/24
    image
    image

@KeXiangWang
Copy link
Contributor

I agree with Patrick. Nexmark Q7 is join price of bids and the maxprice of bids within a interval. With refilling, all the bid will be cache in the join cache. But you can image, maxprice stream's event is sparse and only read several keys of price. So, I believe the throughput decrease because of two reason:

  1. Unnecessary call of may_exist when inserting a price event intruduce overhead.
  2. price event occupy too much cache, causing cache miss and thus latences.

Like @lmatz mentioned, we may probably make decision depends on task and data patterns😕.

@lmatz lmatz modified the milestones: release-0.18, release-0.19 Mar 22, 2023
@lmatz lmatz removed this from the release-0.19 milestone Apr 4, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2023

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

@lmatz
Copy link
Contributor Author

lmatz commented Jun 5, 2023

Later, we found that the 8 kafka-partition nexmark is generated by 16 threads, which can mess up the locality.
Now I suppose this factor may affect the effectiveness of refilling.

I will try to re-enable it later and make it a configurable choice.

@lmatz lmatz self-assigned this Jun 5, 2023
Copy link
Contributor

github-actions bot commented Jul 3, 2024

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants