TiDB Server crashing w/ OOM despite the tidb_mem_quota_query limit #32287

coderplay · 2022-02-11T21:20:10Z

Bug Report

1. Minimal reproduce step (Required)

huge hash join

2. What did you expect to see? (Required)

We are mimicking our production workloads to a dev TiDB cluster. One of the biggest queries is a huge hash join, which can consume ~50GB of memory. We reduce the tidb_mem_quota_query to 1GB to avoid system's oom-killing, but the setting didn't work.

After diving deeper into the hashjoinExecutor implementation, we figure out that's because below memory usages are not counted in the memory tracker:

The hash table used for joining: https://github.com/pingcap/tidb/blob/master/executor/hash_table.go#L187
The entryStore for above hash table: https://github.com/pingcap/tidb/blob/master/executor/hash_table.go#L324
The newly created chunks: https://github.com/pingcap/tidb/blob/master/executor/join.go#L281
Besides above func (worker *copIteratorWorker) handleCopResponse can take another 2~4 GB in our case.

I am attaching a heap profiling to this ticket, please check it out. We took the profiling when the TiDB server was still alive, taking around 15.8GB heap memory, but eventually, it can consume ~50GB.

3. What did you see instead (Required)

tidb server get killed by system's oom-killer

4. What is your TiDB version? (Required)

5.2.2 on kubernetes

The text was updated successfully, but these errors were encountered:

XuHuaiyu · 2022-02-14T03:26:09Z

Hi, @coderplay
The largest part of the HashJoin memory usage is the underlying data of the hashtable, which has been counted in the memory tracker, and can be spilled to the disk. But as you mentioned in the description, the entryStore for the hash table, which is actually a pointer to the underlying data has not been tracked.
We are still working on improving the coverage of the memory tracker.

XuHuaiyu · 2022-02-14T03:28:12Z

We may try to set tidb_enable_rate_limit_action to false to see whether the SQL can be cancelled successfully.

coderplay · 2022-02-14T10:16:39Z

@XuHuaiyu

The largest part of the HashJoin memory usage is the underlying data of the hashtable, which has been counted in the memory tracker,

IIUC, the hashtable isn't accounted, see https://github.com/pingcap/tidb/blob/master/executor/hash_table.go#L187. Could you please let me know which lines of code it does the accounting?

Please be aware that there are 4 big memory consumers in this issue.

XuHuaiyu · 2022-02-15T06:11:34Z

The underlying data of hashtable is accounted in

tidb/executor/hash_table.go

Line 166 in 954e1e1

err := c.rowContainer.Add(chk)

hashtable stores the key and rowPtr, the memory of it is not tracked:

tidb/executor/hash_table.go

Lines 82 to 83 in 954e1e1

    
           // hashTable stores the map of hashKey and RowPtr 
        
           hashTable baseHashTable

The newly created chunk is not accounted

tidb/executor/join.go

Line 281 in 954e1e1

chk := chunk.NewChunkWithCapacity(e.buildSideExec.base().retFieldTypes, e.ctx.GetSessionVars().MaxChunkSize)
The copResponse is accounted in

tidb/store/copr/coprocessor.go

Line 550 in 954e1e1

worker.memTracker.Consume(consumed)

coderplay · 2022-02-16T22:04:07Z

@XuHuaiyu yep, we are on the same page now. Looking forward to the fix.

wshwsh12 · 2022-05-06T02:31:17Z

There are three memory untracked issues mentioned in this issue

For the memory usage of hash table and entryStore, we have tracked them in the pr executor: add some memory tracker in HashJoin #33918.
For the memory usage of NewChunkWithCapacity in func buildHashTableForList, the memory usage will be tracked when trying to add chunk into rowContainer in HashJoinExec.buildHashTable(https://github.com/pingcap/tidb/blob/master/executor/join.go#L783)
For the memory usage of handleCopResponse, currently we will control it through tidb_enable_rate_limit_action. But it seem not to work well at the moment, we will refactor the logic in the near future.

XuHuaiyu · 2022-07-07T03:48:47Z

Close this issue, because we fix this problem in the duplicate issue
#35627

coderplay added the type/bug The issue is confirmed as a bug. label Feb 11, 2022

coderplay mentioned this issue Feb 11, 2022

"advanced" memory quota variables have no effect #32286

Closed

XuHuaiyu closed this as completed Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiDB Server crashing w/ OOM despite the tidb_mem_quota_query limit #32287

TiDB Server crashing w/ OOM despite the tidb_mem_quota_query limit #32287

coderplay commented Feb 11, 2022 •

edited

Loading

XuHuaiyu commented Feb 14, 2022

XuHuaiyu commented Feb 14, 2022

coderplay commented Feb 14, 2022

XuHuaiyu commented Feb 15, 2022

coderplay commented Feb 16, 2022

wshwsh12 commented May 6, 2022

XuHuaiyu commented Jul 7, 2022

TiDB Server crashing w/ OOM despite the tidb_mem_quota_query limit #32287

TiDB Server crashing w/ OOM despite the tidb_mem_quota_query limit #32287

Comments

coderplay commented Feb 11, 2022 • edited Loading

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

XuHuaiyu commented Feb 14, 2022

XuHuaiyu commented Feb 14, 2022

coderplay commented Feb 14, 2022

XuHuaiyu commented Feb 15, 2022

coderplay commented Feb 16, 2022

wshwsh12 commented May 6, 2022

XuHuaiyu commented Jul 7, 2022

coderplay commented Feb 11, 2022 •

edited

Loading