Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/tikv: fix a concurrency bug that may cause the batchClient timeout (#22239) #22335

Closed

Conversation

ti-srebot
Copy link
Contributor

@ti-srebot ti-srebot commented Jan 11, 2021

cherry-pick #22239 to release-3.0
You can switch your code base to this Pull Request by using git-extras:

# In tidb repo:
git pr 22335

After apply modifications, you can push your change to this PR via:

git push [email protected]:ti-srebot/tidb.git pr/22335:ti-srebot:release-3.0-ae7e43249a35

What problem does this PR solve?

closes #22334

Problem Summary:

The recycleIdleConnArray() logic has a bug: when one goroutine getConnArray() and the other goroutine recycle the idle connection, the prior goroutine may get a stale batchConn which is closed already.

sendBatchRequest() using that stale batchConn would block until timeout.

++++++++++++++++++++++++++++++++++++++++++++++++++++++
g1: connArray := getConnArray(addr, enableBatch)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
                                                                                                            g2: c.Lock()    
                                                                                                            g2: conn := c.conns[addr]
                                                                                                            g2: Unlock()
                                                                                                            g2: conn.Close()
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
g1: sendBatchRequest(connArray)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

What is changed and how it works?

What's Changed:

RLock()
conn := getConArray()
RUnlock()
sendBatchRequest(conn)

This is not enough to protect the conn from been recycle and close.
Now the whole sending process is protected by the read lock, and modify conn map should obtain the write lock.

How it Works:

As long as the sending operation hold the read lock, the recycle connection operation need to wait to obtain the write lock.

Related changes

  • Need to cherry-pick to the release branch

Maybe we can cherry-pick it to 5.0, it's rare to see this bug in the production environment.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Release note

  • fix a concurrency bug that may cause the batch client timeout

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@tiancaiamao you're already a collaborator in bot's repo.

@lysu
Copy link
Contributor

lysu commented Feb 3, 2021

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Feb 3, 2021
@tiancaiamao
Copy link
Contributor

PTAL @hicqu

@xiongjiwei xiongjiwei closed this Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv status/LGT1 Indicates that a PR has LGTM 1. type/bugfix This PR fixes a bug. type/3.0-cherry-pick
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants