-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replication is slow and many “checkpoint has no change, skip sync flush checkpoint” in DM log #4619
Comments
Root causeWhen DM meets row changes of a big transaction in upstream and the locations of row changes are same (same GTID or belonging to one big binlog event), the checkpoint can not be updated thus can not be flushed to downstream. After the checkpoint flush interval (30s by default), DM will try to check the checkpoint status for every row change. In v5.4.0, the checkpoint checking also considers all table checkpoints. If the checkpoint is not updated, DM will print a log in Line 1114 in 14db59b
which involves a disk IO. Disk IO is slow thus will cause the replication becomes slower. WalkaroundUse can change the log level of DM-worker to levels above INFO. See worker configuration file |
possible solution after discuss with @lance6716
|
to |
warn level is higher than info |
Oh, I misunderstood it as saying the user's workaround |
@glorv has reminded us that write disk may not be such slow, so extra time costs may come from checking if any table checkpoint is outdated. Also considering the GTID set of user's scenario is complex, the GTID set comparing may contribute an innegligible factor to time cost. (v5.3.0 we only check global checkpoint) |
if the task has a few tables to replicate, the speed does not drop even if a lots of log is printed. reproduced with 4 tables in task, and GTID set is simple, a transaction delete 2M rows, checkpoint-flush-interval is 5 seconds. |
if the task has 500 tables and a simple GTID set, DELETE QPS will drop from 16.8k to 15.0k (drop ~10%). This is the effects of checking table checkpoints |
if the task has 500 tables and a complex GTID set (10 uuid parts), DELETE QPS will drop from 16.8k to 3.7k (drop ~78%). |
reappear this issue in upgrade testing.
|
What did you do?
This issue may be triggered when DM uses GTID replication and with one of below scenarios:
or DM uses position-based replication and with:
https://asktug.com/t/topic/573236
What did you expect to see?
No response
What did you see instead?
The bug phenomenon is that DM replicates data very slow and there are many “checkpoint has no change, skip sync flush checkpoint” in DM log
Versions of the cluster
DM version (run
dmctl -V
ordm-worker -V
ordm-master -V
):v5.4.0
current status of DM cluster (execute
query-status <task-name>
in dmctl)(paste current status of DM cluster here)
The text was updated successfully, but these errors were encountered: