Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

privilege, domain: reduce the memory jitter of privilege reload activity for 2M users (#59487) #59812

Conversation

ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #59487

What problem does this PR solve?

Issue Number: close #59403, ref #55563

Problem Summary:

I create 2M users, and for example, make 10% or 50% of the users active (in-memory).

Then I observe that even when the workload is gone, the tidb-server memory usage jitter periodically.
For example, this one:

image

What changed and how does it work?

There are several changes.

  1. Before this PR, loadAll() is used when the active user count > 1024 ... that's the direct root cause of the jitter.

That's because loadSomeUsers() does not support tooooo many filter condition.
The SQL "select * from user where user = 'a' or user = 'b' or user = 'c' or ..." works poorly when there are too many or conditions. This is a known issue #43885 that we write the code the recursive way and cause stackoverflow.

So the first change is to enhance loadSomeUsers() to support unlimited user count.

It works like this:

  • if user count > 1024, use the 'or user = xx' filter condition to construct the SQL
  • otherwise, use load all SQL but do the 'user = xx' filter condition in the user space.
  1. Use this SQLExecutor.ExecuteInternal() streaming API to replace RestrictedSQLExec.ExecRestrictedSQL() API

The problem of ExecRestrictedSQL is that the API design not fit here.
Its drainRecordSet return []chunk.Row as result and here it can be 2M huge array.
What we need is a streaming API, doing the filter condition at the same time rather than take the whole data set and filter out later.

  1. Use deep copy in the loadTable decode function.

I suspect there is a leak like #59403, the decode function may using a shallow copy and it references the chunk data.
So the whole chunk cannot be freed.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

The memory usage now:

image

The privilege reload activity is every 10min and you can see that the max memory usage is much less than before:

image

  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR. labels Feb 27, 2025
@ti-chi-bot
Copy link
Member Author

@tiancaiamao This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

Copy link

ti-chi-bot bot commented Feb 27, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wjhuang2016 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

ti-chi-bot bot commented Feb 27, 2025

@ti-chi-bot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
idc-jenkins-ci-tidb/unit-test ad6455a link true /test unit-test
idc-jenkins-ci-tidb/check_dev_2 ad6455a link true /test check-dev2
idc-jenkins-ci-tidb/check_dev ad6455a link true /test check-dev
idc-jenkins-ci-tidb/mysql-test ad6455a link true /test mysql-test
idc-jenkins-ci-tidb/build ad6455a link true /test build
pull-br-integration-test ad6455a link true /test pull-br-integration-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tiancaiamao
Copy link
Contributor

Should not cherry-pick 8.5

@ti-chi-bot ti-chi-bot bot added do-not-merge/cherry-pick-not-approved cherry-pick-approved Cherry pick PR approved by release team. and removed cherry-pick-approved Cherry pick PR approved by release team. do-not-merge/cherry-pick-not-approved labels Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/cherry-pick-not-approved do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants