replication is slow and many “checkpoint has no change, skip sync flush checkpoint” in DM log #4619

lance6716 · 2022-02-17T07:25:02Z

What did you do?

This issue may be triggered when DM uses GTID replication and with one of below scenarios:

upstream has a big DML event which contains millions of row changes. normally this would not happen, for default max size of any ROWS_EVENT is 8192
an extra large transaction totally has millions of row changes

or DM uses position-based replication and with:

upstream has a big DML event which contains millions of row changes, normally this would not happen

https://asktug.com/t/topic/573236

What did you expect to see?

No response

What did you see instead?

The bug phenomenon is that DM replicates data very slow and there are many “checkpoint has no change, skip sync flush checkpoint” in DM log

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v5.4.0

current status of DM cluster (execute `query-status <task-name>` in dmctl)

(paste current status of DM cluster here)

The text was updated successfully, but these errors were encountered:

lance6716 · 2022-02-17T07:33:48Z

Root cause

When DM meets row changes of a big transaction in upstream and the locations of row changes are same (same GTID or belonging to one big binlog event), the checkpoint can not be updated thus can not be flushed to downstream.

After the checkpoint flush interval (30s by default), DM will try to check the checkpoint status for every row change. In v5.4.0, the checkpoint checking also considers all table checkpoints. If the checkpoint is not updated, DM will print a log in

tiflow/dm/syncer/syncer.go

Line 1114 in 14db59b

log.L().Info("checkpoint has no change, skip sync flush checkpoint")

which involves a disk IO. Disk IO is slow thus will cause the replication becomes slower.

Walkaround

Use can change the log level of DM-worker to levels above INFO. See worker configuration file

D3Hunter · 2022-02-17T07:39:35Z

possible solution after discuss with @lance6716

change this log's level to debug
on dml events, if there's no check point change after flush interval, mark a flag we entered into a large transaction, and disable checking when this flag is on, reset this flag when we met XID event, to avoid checking checkpoint change repeatedly.

GMHDBJD · 2022-02-17T08:11:34Z

change this log's level to debug

to WARN ?

Ehco1996 · 2022-02-17T08:44:37Z

change this log's level to debug

to WARN ?

warn level is higher than info

GMHDBJD · 2022-02-17T08:56:29Z

Oh, I misunderstood it as saying the user's workaround

lance6716 · 2022-02-17T12:11:25Z

@glorv has reminded us that write disk may not be such slow, so extra time costs may come from checking if any table checkpoint is outdated. Also considering the GTID set of user's scenario is complex, the GTID set comparing may contribute an innegligible factor to time cost.

(v5.3.0 we only check global checkpoint)

lance6716 · 2022-03-02T08:50:37Z

if the task has a few tables to replicate, the speed does not drop even if a lots of log is printed.

reproduced with 4 tables in task, and GTID set is simple, a transaction delete 2M rows, checkpoint-flush-interval is 5 seconds.

lance6716 · 2022-03-02T09:37:43Z

if the task has 500 tables and a simple GTID set, DELETE QPS will drop from 16.8k to 15.0k (drop ~10%). This is the effects of checking table checkpoints

lance6716 · 2022-03-02T10:43:31Z

if the task has 500 tables and a complex GTID set (10 uuid parts), DELETE QPS will drop from 16.8k to 3.7k (drop ~78%).

XuJianxu · 2022-03-03T08:30:41Z

reappear this issue in upgrade testing.

two upstream open relay log with dm 2.0.1 with 3 masters 3 workers
execute create index to upstream1, continue to load data to upstream 2
upgrade to v5.4.0
execute same ddl to upstream2
continue to load data to upstream 1&2
this issue happened

…ot (#4744) close #4619

…ot (#4744) (#4801) close #4619

lance6716 added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Feb 17, 2022

lance6716 added affects-5.4 This bug affects the 5.4.x(LTS) versions. severity/major labels Feb 17, 2022

ti-chi-bot added may-affects-4.0 may-affects-5.0 may-affects-5.1 may-affects-5.2 may-affects-5.3 labels Feb 17, 2022

lance6716 removed may-affects-4.0 may-affects-5.1 may-affects-5.2 may-affects-5.3 may-affects-5.0 labels Feb 17, 2022

lance6716 mentioned this issue Mar 2, 2022

syncer(dm): fix log flood and performance when frequently call Snapshot #4744

Merged

niubell assigned lance6716 Mar 3, 2022

ti-chi-bot closed this as completed in #4744 Mar 8, 2022

ti-chi-bot pushed a commit that referenced this issue Mar 8, 2022

syncer(dm): fix log flood and performance when frequently call Snapsh…

aacbbcf

…ot (#4744) close #4619

ti-chi-bot mentioned this issue Mar 8, 2022

syncer(dm): fix log flood and performance when frequently call Snapshot (#4744) #4801

Merged

ti-chi-bot added a commit that referenced this issue Mar 14, 2022

syncer(dm): fix log flood and performance when frequently call Snapsh…

67108f0

…ot (#4744) (#4801) close #4619

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replication is slow and many “checkpoint has no change, skip sync flush checkpoint” in DM log #4619

replication is slow and many “checkpoint has no change, skip sync flush checkpoint” in DM log #4619

lance6716 commented Feb 17, 2022 •

edited

Loading

lance6716 commented Feb 17, 2022 •

edited

Loading

D3Hunter commented Feb 17, 2022

GMHDBJD commented Feb 17, 2022

Ehco1996 commented Feb 17, 2022

GMHDBJD commented Feb 17, 2022

lance6716 commented Feb 17, 2022 •

edited

Loading

lance6716 commented Mar 2, 2022 •

edited

Loading

lance6716 commented Mar 2, 2022 •

edited

Loading

lance6716 commented Mar 2, 2022

XuJianxu commented Mar 3, 2022

replication is slow and many “checkpoint has no change, skip sync flush checkpoint” in DM log #4619

replication is slow and many “checkpoint has no change, skip sync flush checkpoint” in DM log #4619

Comments

lance6716 commented Feb 17, 2022 • edited Loading

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

lance6716 commented Feb 17, 2022 • edited Loading

Root cause

Walkaround

D3Hunter commented Feb 17, 2022

GMHDBJD commented Feb 17, 2022

Ehco1996 commented Feb 17, 2022

GMHDBJD commented Feb 17, 2022

lance6716 commented Feb 17, 2022 • edited Loading

lance6716 commented Mar 2, 2022 • edited Loading

lance6716 commented Mar 2, 2022 • edited Loading

lance6716 commented Mar 2, 2022

XuJianxu commented Mar 3, 2022

lance6716 commented Feb 17, 2022 •

edited

Loading

current status of DM cluster (execute `query-status <task-name>` in dmctl)

lance6716 commented Feb 17, 2022 •

edited

Loading

lance6716 commented Feb 17, 2022 •

edited

Loading

lance6716 commented Mar 2, 2022 •

edited

Loading

lance6716 commented Mar 2, 2022 •

edited

Loading