Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/tikv: fix a concurrency bug that may cause the batchClient timeout #22239

Merged
merged 3 commits into from
Jan 11, 2021

Conversation

tiancaiamao
Copy link
Contributor

@tiancaiamao tiancaiamao commented Jan 6, 2021

What problem does this PR solve?

closes #22334

Problem Summary:

The recycleIdleConnArray() logic has a bug: when one goroutine getConnArray() and the other goroutine recycle the idle connection, the prior goroutine may get a stale batchConn which is closed already.

sendBatchRequest() using that stale batchConn would block until timeout.

++++++++++++++++++++++++++++++++++++++++++++++++++++++
g1: connArray := getConnArray(addr, enableBatch)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
                                                                                                            g2: c.Lock()    
                                                                                                            g2: conn := c.conns[addr]
                                                                                                            g2: Unlock()
                                                                                                            g2: conn.Close()
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
g1: sendBatchRequest(connArray)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

What is changed and how it works?

What's Changed:

RLock()
conn := getConArray()
RUnlock()
sendBatchRequest(conn)

This is not enough to protect the conn from been recycle and close.
Now the whole sending process is protected by the read lock, and modify conn map should obtain the write lock.

How it Works:

As long as the sending operation hold the read lock, the recycle connection operation need to wait to obtain the write lock.

Related changes

  • Need to cherry-pick to the release branch

Maybe we can cherry-pick it to 5.0, it's rare to see this bug in the production environment.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Release note

  • fix a concurrency bug that may cause the batch client timeout

@sre-bot
Copy link
Contributor

sre-bot commented Jan 6, 2021

@tiancaiamao tiancaiamao added type/bugfix This PR fixes a bug. component/tikv labels Jan 6, 2021
@tiancaiamao tiancaiamao marked this pull request as ready for review January 6, 2021 14:25
@tiancaiamao tiancaiamao requested a review from lysu January 6, 2021 14:25
@lysu
Copy link
Contributor

lysu commented Jan 7, 2021

/bench

@lysu
Copy link
Contributor

lysu commented Jan 7, 2021

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Jan 7, 2021
@lysu lysu requested a review from hicqu January 11, 2021 03:56
@lysu lysu added this to the v5.0.0-rc milestone Jan 11, 2021
@hicqu
Copy link
Contributor

hicqu commented Jan 11, 2021

LGTM

@ti-srebot ti-srebot removed the status/LGT1 Indicates that a PR has LGTM 1. label Jan 11, 2021
@ti-srebot ti-srebot added the status/LGT2 Indicates that a PR has LGTM 2. label Jan 11, 2021
@tiancaiamao
Copy link
Contributor Author

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Jan 11, 2021
@ti-srebot
Copy link
Contributor

/run-all-tests

@ti-srebot ti-srebot merged commit ae7e432 into pingcap:master Jan 11, 2021
ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jan 11, 2021
@ti-srebot
Copy link
Contributor

cherry pick to release-3.0 in PR #22335

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jan 11, 2021
@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #22336

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jan 11, 2021
@ti-srebot
Copy link
Contributor

cherry pick to release-5.0-rc in PR #22337

@tiancaiamao tiancaiamao deleted the recycle-bug branch January 11, 2021 06:10
}

// TiDB will not send batch commands to TiFlash, to resolve the conflict with Batch Cop Request.
enableBatch := req.StoreTp != kv.TiDB && req.StoreTp != kv.TiFlash
c.recycleMu.RLock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this will impact the send speed?

ti-srebot added a commit that referenced this pull request Jan 11, 2021
ti-srebot added a commit that referenced this pull request Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

batchClient may timeout due to use recycled Idle connArray
6 participants