store/tikv: fix a concurrency bug that may cause the batchClient timeout #22239

tiancaiamao · 2021-01-06T12:59:13Z

What problem does this PR solve?

Problem Summary:

The recycleIdleConnArray() logic has a bug: when one goroutine getConnArray() and the other goroutine recycle the idle connection, the prior goroutine may get a stale batchConn which is closed already.

sendBatchRequest() using that stale batchConn would block until timeout.

++++++++++++++++++++++++++++++++++++++++++++++++++++++
g1: connArray := getConnArray(addr, enableBatch)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
                                                                                                            g2: c.Lock()    
                                                                                                            g2: conn := c.conns[addr]
                                                                                                            g2: Unlock()
                                                                                                            g2: conn.Close()
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
g1: sendBatchRequest(connArray)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

What is changed and how it works?

What's Changed:

RLock()
conn := getConArray()
RUnlock()
sendBatchRequest(conn)

This is not enough to protect the conn from been recycle and close.
Now the whole sending process is protected by the read lock, and modify conn map should obtain the write lock.

How it Works:

As long as the sending operation hold the read lock, the recycle connection operation need to wait to obtain the write lock.

Related changes

Need to cherry-pick to the release branch

Maybe we can cherry-pick it to 5.0, it's rare to see this bug in the production environment.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Release note

fix a concurrency bug that may cause the batch client timeout

sre-bot · 2021-01-06T12:59:27Z

No release note, Please follow https://github.com/pingcap/community/blob/master/contributors/release-note-checker.md

lysu · 2021-01-07T06:13:30Z

/bench

lysu · 2021-01-07T07:55:06Z

LGTM

hicqu · 2021-01-11T05:31:49Z

LGTM

tiancaiamao · 2021-01-11T05:33:10Z

/merge

ti-srebot · 2021-01-11T05:35:27Z

/run-all-tests

Signed-off-by: ti-srebot <[email protected]>

ti-srebot · 2021-01-11T05:48:45Z

cherry pick to release-3.0 in PR #22335

Signed-off-by: ti-srebot <[email protected]>

ti-srebot · 2021-01-11T05:49:37Z

cherry pick to release-4.0 in PR #22336

Signed-off-by: ti-srebot <[email protected]>

ti-srebot · 2021-01-11T05:50:49Z

cherry pick to release-5.0-rc in PR #22337

zhangjinpeng87 · 2021-01-11T06:43:11Z

store/tikv/client.go

 	}

 	// TiDB will not send batch commands to TiFlash, to resolve the conflict with Batch Cop Request.
 	enableBatch := req.StoreTp != kv.TiDB && req.StoreTp != kv.TiFlash
+	c.recycleMu.RLock()


Does this will impact the send speed?

…out (#22239) (#22337) Signed-off-by: ti-srebot <[email protected]>

…out (#22239) (#22336) Signed-off-by: ti-srebot <[email protected]>

store/tikv: fix a concurrency bug that may cause the batchClient timeout

937733e

tiancaiamao added type/bugfix This PR fixes a bug. component/tikv labels Jan 6, 2021

fix CI

d89984d

tiancaiamao marked this pull request as ready for review January 6, 2021 14:25

tiancaiamao requested a review from lysu January 6, 2021 14:25

ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jan 7, 2021

lysu requested a review from hicqu January 11, 2021 03:56

lysu added needs-cherry-pick-3.0 labels Jan 11, 2021

lysu added this to the v5.0.0-rc milestone Jan 11, 2021

ti-srebot removed the status/LGT1 Indicates that a PR has LGTM 1. label Jan 11, 2021

ti-srebot approved these changes Jan 11, 2021

View reviewed changes

ti-srebot added the status/LGT2 Indicates that a PR has LGTM 2. label Jan 11, 2021

ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Jan 11, 2021

Merge branch 'master' into recycle-bug

8eeeea2

ti-srebot merged commit ae7e432 into pingcap:master Jan 11, 2021

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jan 11, 2021

cherry pick pingcap#22239 to release-3.0

b1922df

Signed-off-by: ti-srebot <[email protected]>

ti-srebot mentioned this pull request Jan 11, 2021

store/tikv: fix a concurrency bug that may cause the batchClient timeout (#22239) #22335

Closed

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jan 11, 2021

cherry pick pingcap#22239 to release-4.0

9534cca

Signed-off-by: ti-srebot <[email protected]>

ti-srebot mentioned this pull request Jan 11, 2021

store/tikv: fix a concurrency bug that may cause the batchClient timeout (#22239) #22336

Merged

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jan 11, 2021

cherry pick pingcap#22239 to release-5.0-rc

5ed771c

Signed-off-by: ti-srebot <[email protected]>

ti-srebot mentioned this pull request Jan 11, 2021

store/tikv: fix a concurrency bug that may cause the batchClient timeout (#22239) #22337

Merged

tiancaiamao deleted the recycle-bug branch January 11, 2021 06:10

zhangjinpeng87 reviewed Jan 11, 2021

View reviewed changes

ti-srebot added a commit that referenced this pull request Jan 11, 2021

store/tikv: fix a concurrency bug that may cause the batchClient time…

fd4437d

…out (#22239) (#22337) Signed-off-by: ti-srebot <[email protected]>

ti-srebot added a commit that referenced this pull request Jan 11, 2021

store/tikv: fix a concurrency bug that may cause the batchClient time…

1edefab

…out (#22239) (#22336) Signed-off-by: ti-srebot <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store/tikv: fix a concurrency bug that may cause the batchClient timeout #22239

store/tikv: fix a concurrency bug that may cause the batchClient timeout #22239

tiancaiamao commented Jan 6, 2021 •

edited by lysu

Loading

sre-bot commented Jan 6, 2021

lysu commented Jan 7, 2021

lysu commented Jan 7, 2021

hicqu commented Jan 11, 2021

tiancaiamao commented Jan 11, 2021

ti-srebot commented Jan 11, 2021

ti-srebot commented Jan 11, 2021

ti-srebot commented Jan 11, 2021

ti-srebot commented Jan 11, 2021

zhangjinpeng87 Jan 11, 2021

store/tikv: fix a concurrency bug that may cause the batchClient timeout #22239

store/tikv: fix a concurrency bug that may cause the batchClient timeout #22239

Conversation

tiancaiamao commented Jan 6, 2021 • edited by lysu Loading

What problem does this PR solve?

What is changed and how it works?

Related changes

Check List

Release note

sre-bot commented Jan 6, 2021

lysu commented Jan 7, 2021

lysu commented Jan 7, 2021

hicqu commented Jan 11, 2021

tiancaiamao commented Jan 11, 2021

ti-srebot commented Jan 11, 2021

ti-srebot commented Jan 11, 2021

ti-srebot commented Jan 11, 2021

ti-srebot commented Jan 11, 2021

zhangjinpeng87 Jan 11, 2021

Choose a reason for hiding this comment

tiancaiamao commented Jan 6, 2021 •

edited by lysu

Loading