Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replication is slow and many “checkpoint has no change, skip sync flush checkpoint” in DM log #4619

Closed
lance6716 opened this issue Feb 17, 2022 · 10 comments · Fixed by #4744
Assignees
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. area/dm Issues or PRs related to DM. severity/major type/bug The issue is confirmed as a bug.

Comments

@lance6716
Copy link
Contributor

lance6716 commented Feb 17, 2022

What did you do?

This issue may be triggered when DM uses GTID replication and with one of below scenarios:

  • upstream has a big DML event which contains millions of row changes. normally this would not happen, for default max size of any ROWS_EVENT is 8192
  • an extra large transaction totally has millions of row changes

or DM uses position-based replication and with:

  • upstream has a big DML event which contains millions of row changes, normally this would not happen

https://asktug.com/t/topic/573236

What did you expect to see?

No response

What did you see instead?

The bug phenomenon is that DM replicates data very slow and there are many “checkpoint has no change, skip sync flush checkpoint” in DM log

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v5.4.0

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
@lance6716 lance6716 added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Feb 17, 2022
@lance6716
Copy link
Contributor Author

lance6716 commented Feb 17, 2022

Root cause

When DM meets row changes of a big transaction in upstream and the locations of row changes are same (same GTID or belonging to one big binlog event), the checkpoint can not be updated thus can not be flushed to downstream.

After the checkpoint flush interval (30s by default), DM will try to check the checkpoint status for every row change. In v5.4.0, the checkpoint checking also considers all table checkpoints. If the checkpoint is not updated, DM will print a log in

log.L().Info("checkpoint has no change, skip sync flush checkpoint")

which involves a disk IO. Disk IO is slow thus will cause the replication becomes slower.

Walkaround

Use can change the log level of DM-worker to levels above INFO. See worker configuration file

@D3Hunter
Copy link
Contributor

possible solution after discuss with @lance6716

  • change this log's level to debug
  • on dml events, if there's no check point change after flush interval, mark a flag we entered into a large transaction, and disable checking when this flag is on, reset this flag when we met XID event, to avoid checking checkpoint change repeatedly.

@GMHDBJD
Copy link
Contributor

GMHDBJD commented Feb 17, 2022

change this log's level to debug

to WARN ?

@Ehco1996
Copy link
Contributor

change this log's level to debug

to WARN ?

warn level is higher than info

@GMHDBJD
Copy link
Contributor

GMHDBJD commented Feb 17, 2022

Oh, I misunderstood it as saying the user's workaround

@lance6716
Copy link
Contributor Author

lance6716 commented Feb 17, 2022

@glorv has reminded us that write disk may not be such slow, so extra time costs may come from checking if any table checkpoint is outdated. Also considering the GTID set of user's scenario is complex, the GTID set comparing may contribute an innegligible factor to time cost.

(v5.3.0 we only check global checkpoint)

@lance6716
Copy link
Contributor Author

lance6716 commented Mar 2, 2022

if the task has a few tables to replicate, the speed does not drop even if a lots of log is printed.

reproduced with 4 tables in task, and GTID set is simple, a transaction delete 2M rows, checkpoint-flush-interval is 5 seconds.

@lance6716
Copy link
Contributor Author

lance6716 commented Mar 2, 2022

if the task has 500 tables and a simple GTID set, DELETE QPS will drop from 16.8k to 15.0k (drop ~10%). This is the effects of checking table checkpoints

@lance6716
Copy link
Contributor Author

if the task has 500 tables and a complex GTID set (10 uuid parts), DELETE QPS will drop from 16.8k to 3.7k (drop ~78%).

@XuJianxu
Copy link

XuJianxu commented Mar 3, 2022

reappear this issue in upgrade testing.

  • two upstream open relay log with dm 2.0.1 with 3 masters 3 workers
  • execute create index to upstream1, continue to load data to upstream 2
  • upgrade to v5.4.0
  • execute same ddl to upstream2
  • continue to load data to upstream 1&2
  • this issue happened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. area/dm Issues or PRs related to DM. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants