-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Reconstruct slave sync thread model #2638
Merged
AlexStocks
merged 7 commits into
OpenAtomFoundation:unstable
from
cheniujh:revise_slave_worker_model
May 22, 2024
Merged
fix: Reconstruct slave sync thread model #2638
AlexStocks
merged 7 commits into
OpenAtomFoundation:unstable
from
cheniujh:revise_slave_worker_model
May 22, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db
2 ensure TrySync resp is handled after binlog tasks
662083e
to
b9a4fa7
Compare
wangshao1
approved these changes
May 20, 2024
baixin01
approved these changes
May 20, 2024
…ailure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf
AlexStocks
reviewed
May 20, 2024
AlexStocks
reviewed
May 20, 2024
AlexStocks
reviewed
May 20, 2024
AlexStocks
reviewed
May 20, 2024
AlexStocks
reviewed
May 20, 2024
2 done some format work
QlQlqiqi
pushed a commit
to QlQlqiqi/pika
that referenced
this pull request
May 22, 2024
* reconstruct slave comsuming thread model, new model: 1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db * 1 make write_binlog_thread_num configurable 2 ensure TrySync resp is handled after binlog tasks * 1 add extra 10s sleep in randomSpopstore test to avoid the sporadic failure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf * 1 use global constexpr to replace fixed num in terms of max_db_num 2 done some format work --------- Co-authored-by: cjh <[email protected]>
chenbt-hz
pushed a commit
to chenbt-hz/pika
that referenced
this pull request
Jun 3, 2024
* reconstruct slave comsuming thread model, new model: 1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db * 1 make write_binlog_thread_num configurable 2 ensure TrySync resp is handled after binlog tasks * 1 add extra 10s sleep in randomSpopstore test to avoid the sporadic failure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf * 1 use global constexpr to replace fixed num in terms of max_db_num 2 done some format work --------- Co-authored-by: cjh <[email protected]>
bigdaronlee163
pushed a commit
to bigdaronlee163/pika
that referenced
this pull request
Jun 8, 2024
* reconstruct slave comsuming thread model, new model: 1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db * 1 make write_binlog_thread_num configurable 2 ensure TrySync resp is handled after binlog tasks * 1 add extra 10s sleep in randomSpopstore test to avoid the sporadic failure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf * 1 use global constexpr to replace fixed num in terms of max_db_num 2 done some format work --------- Co-authored-by: cjh <[email protected]>
chejinge
pushed a commit
that referenced
this pull request
Jul 31, 2024
* reconstruct slave comsuming thread model, new model: 1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db * 1 make write_binlog_thread_num configurable 2 ensure TrySync resp is handled after binlog tasks * 1 add extra 10s sleep in randomSpopstore test to avoid the sporadic failure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf * 1 use global constexpr to replace fixed num in terms of max_db_num 2 done some format work --------- Co-authored-by: cjh <[email protected]>
Merged
cheniujh
added a commit
to cheniujh/pika
that referenced
this pull request
Sep 24, 2024
* reconstruct slave comsuming thread model, new model: 1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db * 1 make write_binlog_thread_num configurable 2 ensure TrySync resp is handled after binlog tasks * 1 add extra 10s sleep in randomSpopstore test to avoid the sporadic failure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf * 1 use global constexpr to replace fixed num in terms of max_db_num 2 done some format work --------- Co-authored-by: cjh <[email protected]>
cheniujh
added a commit
to cheniujh/pika
that referenced
this pull request
Sep 24, 2024
* reconstruct slave comsuming thread model, new model: 1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db * 1 make write_binlog_thread_num configurable 2 ensure TrySync resp is handled after binlog tasks * 1 add extra 10s sleep in randomSpopstore test to avoid the sporadic failure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf * 1 use global constexpr to replace fixed num in terms of max_db_num 2 done some format work --------- Co-authored-by: cjh <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
这个PR做了哪些事:
1 Slave端主从同步线程模型的重构(fix #2637 ):
2 修复了主从超时重连场景下, 因为Slave连续发送两次TrySync Req而导致的Sync Win崩溃问题(fix #2655 )
直接原因:Slave在超时重连时,在短时间内连续发出了2次一模一样的TrySync请求(参数中携带的BinlogOfft都一样),Master端会对这2条TrySync请求做同样的处理(每次都会清空WriteQueue和SyncWin,然后从TrySync请求中携带的Binlog偏移量位置开始发送Binlog),在这时间附近的某些BInlog会被发送2次,Slave也会对这些Binlog消费2次,进而导致了Slave返回的BinlogACK被Master认为不合法。
为什么Slave会连续发出2次TrySync: Slave端消费Binlog的Worker线程的任务队列在发出第一次TrySync时依旧还有上一次主从连接期间积压的写Binlog任务(Slave端超时断联,转入TryConnect状态时,会去一条一条丢弃此时WorkerThread中积压的写Binlog任务,这里的问题是丢得太慢了或者说下一次TrySync发的太快了),当Slave收到第一条TrySync请求的响应,会进入Connected状态,于是会开始消费之前积压的,来自前一次主从连接的写Binlog任务,而这些BInlog的SeesionID对不上,就会触发错误处理分支,将Slave转到TrySync状态,所以slave紧接着发出了第二条TrySync请求。
3 修复了“某个Binlog任务阻塞Slave很久,导致超时重连后Master从错误的起始位置续传Binlog”的问题(fix #2659 )
4 合并本PR以后,Slave对TrySync Reps的处理改成了同步(相较于消费Binlog)处理,那么在某些极端情况下(Slave阻塞比较严重),主从建联可能会延迟,Slave将停留在WaitReply的时间会延长,此时master_link_status也为down,所以另提了 PR #2656 ,给运维增加了更细粒度的监控指标 repl_connect_status以便在master_link_status为down时进一步判断情况。关于repl_connect_status的详细解释以及如何使用,请见Disscussion #2689
What does this PR do:
1. Refactoring of the thread model (fix #2637):
2. Fixed the issue of Sync Win crash caused by Slave sending two consecutive TrySync Req in the scenario of master-slave timeout reconnection (fix #2655):
Direct cause: When the Slave times out and reconnects, it sends two identical TrySync requests in a short period (with the same BinlogOfft parameter). The Master will handle these two TrySync requests in the same way (each time clearing the WriteQueue and SyncWin, then sending Binlog from the offset position carried in the TrySync request). Some Binlogs near this time will be sent twice, and the Slave will consume these Binlogs twice, causing the BinlogACK returned by the Slave to be considered invalid by the Master.
Why does the Slave send two consecutive TrySync: The task queue of the Slave's Binlog-consuming Worker thread still has the write Binlog tasks accumulated during the last master-slave connection when the first TrySync is sent. When the Slave times out and disconnects, entering the TryConnect state, it discards the accumulated write Binlog tasks one by one. The problem is that this process is too slow, or the next TrySync is sent too quickly. When the Slave receives the response to the first TrySync request, it enters the Connected state and starts consuming the previously accumulated write Binlog tasks. Since the SessionIDs of these Binlogs do not match, it triggers the error handling branch, sending the Slave back to the TrySync state, thus sending the second TrySync request.
3. Fixed the issue of "a certain Binlog task blocking the Slave for a long time, causing the Master to resume Binlog transmission from the wrong starting position after a timeout reconnection" (fix #2659):
4. After merging this PR, the handling of TrySync Reps by the Slave has been changed to synchronous (compared to consuming Binlog). Therefore, in some extreme cases (severe Slave blocking), the master-slave connection may be delayed, causing the Slave to stay in the WaitReply state for an extended period. During this time, the master_link_status will also be down. Therefore, PR #2656 has been proposed to add a more granular monitoring metric for operations: repl_connect_status.