Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB Server crashing w/ OOM despite the tidb_mem_quota_query limit #32287

Closed
coderplay opened this issue Feb 11, 2022 · 7 comments
Closed

TiDB Server crashing w/ OOM despite the tidb_mem_quota_query limit #32287

coderplay opened this issue Feb 11, 2022 · 7 comments
Labels
type/bug The issue is confirmed as a bug.

Comments

@coderplay
Copy link
Contributor

coderplay commented Feb 11, 2022

Bug Report

1. Minimal reproduce step (Required)

huge hash join

2. What did you expect to see? (Required)

We are mimicking our production workloads to a dev TiDB cluster. One of the biggest queries is a huge hash join, which can consume ~50GB of memory. We reduce the tidb_mem_quota_query to 1GB to avoid system's oom-killing, but the setting didn't work.

After diving deeper into the hashjoinExecutor implementation, we figure out that's because below memory usages are not counted in the memory tracker:

  1. The hash table used for joining: https://github.com/pingcap/tidb/blob/master/executor/hash_table.go#L187
  2. The entryStore for above hash table: https://github.com/pingcap/tidb/blob/master/executor/hash_table.go#L324
  3. The newly created chunks: https://github.com/pingcap/tidb/blob/master/executor/join.go#L281
    Besides above func (worker *copIteratorWorker) handleCopResponse can take another 2~4 GB in our case.

I am attaching a heap profiling to this ticket, please check it out. We took the profiling when the TiDB server was still alive, taking around 15.8GB heap memory, but eventually, it can consume ~50GB.

3. What did you see instead (Required)

tidb server get killed by system's oom-killer

4. What is your TiDB version? (Required)

5.2.2 on kubernetes

@coderplay coderplay added the type/bug The issue is confirmed as a bug. label Feb 11, 2022
@XuHuaiyu
Copy link
Contributor

Hi, @coderplay
The largest part of the HashJoin memory usage is the underlying data of the hashtable, which has been counted in the memory tracker, and can be spilled to the disk. But as you mentioned in the description, the entryStore for the hash table, which is actually a pointer to the underlying data has not been tracked.
We are still working on improving the coverage of the memory tracker.

@XuHuaiyu
Copy link
Contributor

We may try to set tidb_enable_rate_limit_action to false to see whether the SQL can be cancelled successfully.

@coderplay
Copy link
Contributor Author

@XuHuaiyu

The largest part of the HashJoin memory usage is the underlying data of the hashtable, which has been counted in the memory tracker,

IIUC, the hashtable isn't accounted, see https://github.com/pingcap/tidb/blob/master/executor/hash_table.go#L187. Could you please let me know which lines of code it does the accounting?

Please be aware that there are 4 big memory consumers in this issue.

@XuHuaiyu
Copy link
Contributor

  1. The underlying data of hashtable is accounted in
    err := c.rowContainer.Add(chk)

hashtable stores the key and rowPtr, the memory of it is not tracked:

// hashTable stores the map of hashKey and RowPtr
hashTable baseHashTable

  1. The newly created chunk is not accounted

    chk := chunk.NewChunkWithCapacity(e.buildSideExec.base().retFieldTypes, e.ctx.GetSessionVars().MaxChunkSize)

  2. The copResponse is accounted in

    worker.memTracker.Consume(consumed)

@coderplay
Copy link
Contributor Author

@XuHuaiyu yep, we are on the same page now. Looking forward to the fix.

@wshwsh12
Copy link
Contributor

wshwsh12 commented May 6, 2022

There are three memory untracked issues mentioned in this issue

  1. For the memory usage of hash table and entryStore, we have tracked them in the pr executor: add some memory tracker in HashJoin #33918.
  2. For the memory usage of NewChunkWithCapacity in func buildHashTableForList, the memory usage will be tracked when trying to add chunk into rowContainer in HashJoinExec.buildHashTable(https://github.com/pingcap/tidb/blob/master/executor/join.go#L783)
  3. For the memory usage of handleCopResponse, currently we will control it through tidb_enable_rate_limit_action. But it seem not to work well at the moment, we will refactor the logic in the near future.

@XuHuaiyu
Copy link
Contributor

XuHuaiyu commented Jul 7, 2022

Close this issue, because we fix this problem in the duplicate issue
#35627

@XuHuaiyu XuHuaiyu closed this as completed Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

3 participants