fix: Reconstruct slave sync thread model #2638

cheniujh · 2024-05-07T15:29:53Z

这个PR做了哪些事：

1 Slave端主从同步线程模型的重构（fix #2637 ）：

1.1 将Slave端的WriteBinlogWorker和WriteDBWorker分开为两个vector存储，允许用户对write_binlog_worker数量与write_db_worker数量分开配置。具体地，配置项“sync-thread-num”将直接控制从节点消费binlog时用于WriteDB的worker数量（write_db_worker的数量），“sync-binlog-thread-num”决定了write_binlog_worker的数量。
1.2 每个db建议都配一个write_binlog_worker,但也允许用户给出的write_binlog_worker数量小于db数量，此时db会直接取模决定自己使用哪个binlogWorker, 如果用户给出的write_binlog_worker数量大于DB数量，Pika会直接使用db_num作为write_binlog_worker的最终值
1.3 所有DB共用同一个WriteDBWorker Pool来做WriteDB(依旧使用key做hash来选取worker)。

2 修复了主从超时重连场景下, 因为Slave连续发送两次TrySync Req而导致的Sync Win崩溃问题（fix #2655 ）

2.1 针对Sync windows corruption may occur when slave try to reconnect to the master after an timeout-casued repl dissconnection between slave and master #2655 最终确定的修补方案是：将Slave端对TrySync Resp的处理从异步改成同步（使用DB对应的BinlogWorker来处理，确保在主从发生超时重连的场景下，所有过期的Binlog任务都已经被丢弃后Slave才会处理TrySync Resp，这样就能避免Slave消费到SessionID不匹配的过期Binlog任务，进而引发第二次TrySync Req的发送而导致Sync Win Corruption）
2.2 引发Issue Sync windows corruption may occur when slave try to reconnect to the master after an timeout-casued repl dissconnection between slave and master #2655 的原因简述：

直接原因：Slave在超时重连时，在短时间内连续发出了2次一模一样的TrySync请求（参数中携带的BinlogOfft都一样），Master端会对这2条TrySync请求做同样的处理（每次都会清空WriteQueue和SyncWin，然后从TrySync请求中携带的Binlog偏移量位置开始发送Binlog），在这时间附近的某些BInlog会被发送2次，Slave也会对这些Binlog消费2次，进而导致了Slave返回的BinlogACK被Master认为不合法。
为什么Slave会连续发出2次TrySync: Slave端消费Binlog的Worker线程的任务队列在发出第一次TrySync时依旧还有上一次主从连接期间积压的写Binlog任务（Slave端超时断联，转入TryConnect状态时，会去一条一条丢弃此时WorkerThread中积压的写Binlog任务，这里的问题是丢得太慢了或者说下一次TrySync发的太快了），当Slave收到第一条TrySync请求的响应，会进入Connected状态，于是会开始消费之前积压的，来自前一次主从连接的写Binlog任务，而这些BInlog的SeesionID对不上，就会触发错误处理分支，将Slave转到TrySync状态，所以slave紧接着发出了第二条TrySync请求。

3 修复了“某个Binlog任务阻塞Slave很久，导致超时重连后Master从错误的起始位置续传Binlog”的问题（fix #2659 ）

3.1 将Slave端对TrySync Resp的处理从异步改成同步(针对Issue Sync windows corruption may occur when slave try to reconnect to the master after an timeout-casued repl dissconnection between slave and master #2655 的修复方案)，也同时能修复 Issue If a binlog task blocks the slave for a long period, the master may resume the increment replication from an incorrect start position after timeout reconnection #2659

4 合并本PR以后，Slave对TrySync Reps的处理改成了同步（相较于消费Binlog）处理，那么在某些极端情况下（Slave阻塞比较严重），主从建联可能会延迟，Slave将停留在WaitReply的时间会延长，此时master_link_status也为down，所以另提了 PR #2656 ,给运维增加了更细粒度的监控指标 repl_connect_status以便在master_link_status为down时进一步判断情况。关于repl_connect_status的详细解释以及如何使用，请见Disscussion #2689

What does this PR do:

1. Refactoring of the thread model (fix #2637):

1.1 The WriteBinlogWorker and WriteDBWorker on the Slave side are separated into two vectors, allowing users to configure the number of write_binlog_worker and write_db_worker separately. Specifically, the configuration item "sync-thread-num" will directly control the number of write_db_worker used for WriteDB when consuming binlog from the slave node, while "sync-binlog-thread-num" determines the number of write_binlog_worker.
1.2 Each DB is recommended to have a write_binlog_worker, but it also allows the number of write_binlog_worker provided by the user to be less than the number of DBs. In this case, the DB will use modulus to decide which binlog it will use. If the number of write_binlog_worker provided by the user is greater than the number of DBs, Pika will use the db_num as the final value for write_binlog_worker.
1.3 All DBs share the same WriteDBWorker Pool for WriteDB (still using key hashing to select the worker).

2. Fixed the issue of Sync Win crash caused by Slave sending two consecutive TrySync Req in the scenario of master-slave timeout reconnection (fix #2655):

2.1 The final solution for Sync windows corruption may occur when slave try to reconnect to the master after an timeout-casued repl dissconnection between slave and master #2655 is to change the handling of TrySync Resp on the Slave side from asynchronous to synchronous (using the DB corresponding BinlogWorker to ensure that in the scenario of master-slave timeout reconnection, all expired Binlog tasks are discarded before the Slave processes the TrySync Resp. This avoids the Slave consuming expired Binlog tasks with mismatched SessionIDs, which would trigger the sending of a second TrySync Req and cause Sync Win Corruption).
2.2 Brief description of the cause of Issue Sync windows corruption may occur when slave try to reconnect to the master after an timeout-casued repl dissconnection between slave and master #2655:

Direct cause: When the Slave times out and reconnects, it sends two identical TrySync requests in a short period (with the same BinlogOfft parameter). The Master will handle these two TrySync requests in the same way (each time clearing the WriteQueue and SyncWin, then sending Binlog from the offset position carried in the TrySync request). Some Binlogs near this time will be sent twice, and the Slave will consume these Binlogs twice, causing the BinlogACK returned by the Slave to be considered invalid by the Master.
Why does the Slave send two consecutive TrySync: The task queue of the Slave's Binlog-consuming Worker thread still has the write Binlog tasks accumulated during the last master-slave connection when the first TrySync is sent. When the Slave times out and disconnects, entering the TryConnect state, it discards the accumulated write Binlog tasks one by one. The problem is that this process is too slow, or the next TrySync is sent too quickly. When the Slave receives the response to the first TrySync request, it enters the Connected state and starts consuming the previously accumulated write Binlog tasks. Since the SessionIDs of these Binlogs do not match, it triggers the error handling branch, sending the Slave back to the TrySync state, thus sending the second TrySync request.

3. Fixed the issue of "a certain Binlog task blocking the Slave for a long time, causing the Master to resume Binlog transmission from the wrong starting position after a timeout reconnection" (fix #2659):

3.1 Changing the handling of TrySync Resp on the Slave side from asynchronous to synchronous (the solution for Issue Sync windows corruption may occur when slave try to reconnect to the master after an timeout-casued repl dissconnection between slave and master #2655) also fixes Issue If a binlog task blocks the slave for a long period, the master may resume the increment replication from an incorrect start position after timeout reconnection #2659.

4. After merging this PR, the handling of TrySync Reps by the Slave has been changed to synchronous (compared to consuming Binlog). Therefore, in some extreme cases (severe Slave blocking), the master-slave connection may be delayed, causing the Slave to stay in the WaitReply state for an extended period. During this time, the master_link_status will also be down. Therefore, PR #2656 has been proposed to add a more granular monitoring metric for operations: repl_connect_status.

1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db

2 ensure TrySync resp is handled after binlog tasks

…odel

…ailure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf

src/pika_conf.cc

src/pika_repl_client.cc

2 done some format work

* reconstruct slave comsuming thread model, new model: 1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db * 1 make write_binlog_thread_num configurable 2 ensure TrySync resp is handled after binlog tasks * 1 add extra 10s sleep in randomSpopstore test to avoid the sporadic failure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf * 1 use global constexpr to replace fixed num in terms of max_db_num 2 done some format work --------- Co-authored-by: cjh <[email protected]>

reconstruct slave comsuming thread model, new model:

070e0b0

1 each db has one exclusive thread to write binlog 2 every db share the same thread pool to write db

github-actions bot added the ☢️ Bug Something isn't working label May 7, 2024

baerwang requested review from Mixficsol and AlexStocks May 8, 2024 12:55

cheniujh added 3.5.4 4.0.0 labels May 10, 2024

1 make write_binlog_thread_num configurable

b9a4fa7

2 ensure TrySync resp is handled after binlog tasks

cheniujh force-pushed the revise_slave_worker_model branch from 662083e to b9a4fa7 Compare May 16, 2024 07:04

Merge branch 'OpenAtomFoundation:unstable' into revise_slave_worker_m…

0f699a8

…odel

cheniujh mentioned this pull request May 17, 2024

fix: add a user-friendly repl metric "repl_connect_status" in the resp of info command #2656

Merged

AlexStocks added 3.5.5 and removed 3.5.4 labels May 17, 2024

Merge branch 'OpenAtomFoundation:unstable' into revise_slave_worker_m…

13847d8

…odel

cheniujh requested review from wangshao1, chejinge and baixin01 and removed request for Mixficsol May 20, 2024 06:29

wangshao1 approved these changes May 20, 2024

View reviewed changes

Merge branch 'OpenAtomFoundation:unstable' into revise_slave_worker_m…

3232576

…odel

baixin01 approved these changes May 20, 2024

View reviewed changes

1 add extra 10s sleep in randomSpopstore test to avoid the sporadic f…

f25fb89

…ailure of this test case 2 revised some comments about write-binlog-worker-num in pika.conf