syncer(dm): fix log flood and performance when frequently call Snapshot #4744

lance6716 · 2022-03-02T11:38:47Z

Signed-off-by: lance6716 [email protected]

What problem does this PR solve?

Issue Number: close #4619

What is changed and how it works?

always update lastSnapshotCreationTime when call Snapshot. Now checkShouldFlush will return false after seeing lastSnapshotCreationTime, so we will not call Snapshot too frequently

Check List

Tests

Manual test (add detailed scripts or steps below)
generate two big transaction and set checkpoint-flush-interval to 5s. Use a complex GTID set which consists of 10 uuid part to start task, meet the first big transaction, create 500 tables and a DML for each table to record 500 table checkpoint, meet the second big transaction
before this PR, QPS of first big transaction is 16.8k and second one is 3.7k
after this PR, QPS are both 16.5k

Code changes

Side effects

Related changes

Need to cherry-pick to the release branch

Release note

Fix that there're lot of log of "checkpoint has no change, skip sync flush checkpoint" and performance may drop

Signed-off-by: lance6716 <[email protected]>

ti-chi-bot · 2022-03-02T11:38:48Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

D3Hunter
GMHDBJD

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

lance6716 · 2022-03-02T11:40:44Z

/cc @db-will @Ehco1996 @D3Hunter @niubell

Ehco1996 · 2022-03-03T00:19:14Z

dm/syncer/checkpoint.go

+	// When there's a big transaction in GTID-based replication, after the checkpoint-flush-interval every row change
+	// will trigger a check thus call Snapshot, the Snapshot should be light-weighted. We only check the global point to
+	// return quickly.
+	if !flushGlobalPoint {


so we dont't need tablepoint level snapshot now ?

we save table checkpoint just 8 lines below

when we received sharded ddl for multiple table on upstream database, using pessimistic mode, will we have a case that global point is not updated, but table checkpoint is updated.

will we have any issues?

@GMHDBJD PTAL. Maybe we can add a IsDirty/NeedFlush in checkpoint interface?

in v5.3.0, we always flush checkpoint every 30s no matter the content has been changed.

tiflow/dm/syncer/checkpoint.go

Lines 680 to 684 in 20626ba

func (cp *RemoteCheckPoint) CheckGlobalPoint() bool {

cp.RLock()

defer cp.RUnlock()

return time.Since(cp.globalPointSaveTime) >= time.Duration(cp.cfg.CheckpointFlushInterval)*time.Second

}

The new async checkpoint flush feature added another check of global checkpoint and table checkpoints. At the moment, this PR wants to only add another check of global checkpoint, but can't handle the pessimistic sharding case. For now I think an atomic boolean IsDirty can represent global or table checkpoint is updated before.

I also want to make sure that, why we add another check compared with v5.3.0? @db-will

@GMHDBJD what's the proposed behaviour?

flush every 30 seconds no matter checkpoint has changed

after 30 seconds, check checkpoint is dirty and the dirty variable is turned on when global or table checkpoints are set

Keep same as 2.x. Only flushed after 30 seconds and cp.globalPoint.outOfDate(), which is your current implement.

IsDirty/NeedFlush in checkpoint interface?

would be more intuitive

Keep same as 2.x. Only flushed after 30 seconds and cp.globalPoint.outOfDate(), which is your current implement.

In v2.0.7, I think the behaviour is

in addJob/checkWait, mostly we will flush checkpoints every checkpoint-flush-interval

in FlushPointsExcept, we generate flush SQL by inspecting outOfDate() for each table point, outOfDate()/SaveTimeIsZero/NeedFlushSafeModeExitPoint for global point.

My current code adds a "step 1.5", which is after every checkpoint-flush-interval, we check outOfDate()/SaveTimeIsZero/NeedFlushSafeModeExitPoint of global point to determine if goes to FlushPointsExcept in step 2. That will cause no flushing in pessimistic sharding, if DM-worker also failed to flush when exit, after restart the progress is replayed. Not a big problem but I prefer to avoid it.

Currently I prefer to add IsDirty/NeedFlush to checkpoint interface.

in v5.3.0, we always flush checkpoint every 30s no matter the content has been changed.

tiflow/dm/syncer/checkpoint.go

Lines 680 to 684 in 20626ba

func (cp *RemoteCheckPoint) CheckGlobalPoint() bool {

cp.RLock()

defer cp.RUnlock()

return time.Since(cp.globalPointSaveTime) >= time.Duration(cp.cfg.CheckpointFlushInterval)*time.Second

}

The new async checkpoint flush feature added another check of global checkpoint and table checkpoints. At the moment, this PR wants to only add another check of global checkpoint, but can't handle the pessimistic sharding case. For now I think an atomic boolean IsDirty can represent global or table checkpoint is updated before.

I also want to make sure that, why we add another check compared with v5.3.0? @db-will

the extra checking is added for preventing calls to async flush continuously, when flushing checkpoint is slow.

Signed-off-by: lance6716 <[email protected]>

codecov-commenter · 2022-03-07T05:53:18Z

Codecov Report

Merging #4744 (3860ae5) into master (9607554) will increase coverage by 0.0318%.
The diff coverage is 54.1884%.

Flag	Coverage Δ
cdc	`59.8303% <54.1884%> (-0.0919%)`	⬇️
dm	`52.2372% <ø> (+0.2084%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master      #4744        +/-   ##
================================================
+ Coverage   55.6402%   55.6721%   +0.0318%     
================================================
  Files           494        520        +26     
  Lines         61283      64641      +3358     
================================================
+ Hits          34098      35987      +1889     
- Misses        23750      25132      +1382     
- Partials       3435       3522        +87

lance6716 · 2022-03-07T06:47:28Z

manual test passed and PR description updated, PTAL @GMHDBJD @D3Hunter

GMHDBJD

LGTM

D3Hunter

LGTM

D3Hunter · 2022-03-07T11:05:09Z

/run-all-tests

db-will

Looks good!

ti-chi-bot · 2022-03-07T22:27:46Z

@db-will: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

Looks good!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

lance6716 · 2022-03-08T00:31:22Z

/merge

ti-chi-bot · 2022-03-08T00:31:24Z

This pull request has been accepted and is ready to merge.

Commit hash: 3860ae5

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot · 2022-03-08T00:54:27Z

In response to a cherrypick label: new pull request created: #4801.

…ot (#4744) (#4801) close #4619

syncer(dm): fix log flood and performance when frequently call Snapshot

b3a86c8

Signed-off-by: lance6716 <[email protected]>

ti-chi-bot added needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 2, 2022

lance6716 added the area/dm Issues or PRs related to DM. label Mar 2, 2022

ti-chi-bot requested review from D3Hunter, db-will, Ehco1996 and niubell March 2, 2022 11:40

Ehco1996 reviewed Mar 3, 2022

View reviewed changes

fix unit test

bbc39da

Signed-off-by: lance6716 <[email protected]>

ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 3, 2022

fix unit test

4373fa6

Signed-off-by: lance6716 <[email protected]>

ti-chi-bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 7, 2022

GMHDBJD approved these changes Mar 7, 2022

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 7, 2022

D3Hunter approved these changes Mar 7, 2022

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 7, 2022

Merge branch 'master' into fix-log-flood

3860ae5

db-will approved these changes Mar 7, 2022

View reviewed changes

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 8, 2022

Merge branch 'master' into fix-log-flood

797ec66

ti-chi-bot merged commit aacbbcf into pingcap:master Mar 8, 2022

ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this pull request Mar 8, 2022

This is an automated cherry-pick of pingcap#4744

6f0b4f1

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot mentioned this pull request Mar 8, 2022

syncer(dm): fix log flood and performance when frequently call Snapshot (#4744) #4801

Merged

ti-chi-bot added a commit that referenced this pull request Mar 14, 2022

syncer(dm): fix log flood and performance when frequently call Snapsh…

67108f0

…ot (#4744) (#4801) close #4619

lance6716 deleted the fix-log-flood branch October 13, 2022 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syncer(dm): fix log flood and performance when frequently call Snapshot #4744

syncer(dm): fix log flood and performance when frequently call Snapshot #4744

lance6716 commented Mar 2, 2022 •

edited

Loading

ti-chi-bot commented Mar 2, 2022 •

edited

Loading

lance6716 commented Mar 2, 2022

Ehco1996 Mar 3, 2022

lance6716 Mar 3, 2022

db-will Mar 3, 2022

lance6716 Mar 3, 2022

lance6716 Mar 3, 2022

lance6716 Mar 4, 2022

GMHDBJD Mar 4, 2022

D3Hunter Mar 4, 2022

lance6716 Mar 4, 2022

db-will Mar 7, 2022

codecov-commenter commented Mar 7, 2022 •

edited

Loading

lance6716 commented Mar 7, 2022

GMHDBJD left a comment

D3Hunter left a comment

D3Hunter commented Mar 7, 2022

db-will left a comment

ti-chi-bot commented Mar 7, 2022

lance6716 commented Mar 8, 2022

ti-chi-bot commented Mar 8, 2022

ti-chi-bot commented Mar 8, 2022

	func (cp *RemoteCheckPoint) CheckGlobalPoint() bool {
	cp.RLock()
	defer cp.RUnlock()
	return time.Since(cp.globalPointSaveTime) >= time.Duration(cp.cfg.CheckpointFlushInterval)*time.Second
	}

syncer(dm): fix log flood and performance when frequently call Snapshot #4744

syncer(dm): fix log flood and performance when frequently call Snapshot #4744

Conversation

lance6716 commented Mar 2, 2022 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot commented Mar 2, 2022 • edited Loading

lance6716 commented Mar 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Mar 7, 2022 • edited Loading

Codecov Report

lance6716 commented Mar 7, 2022

GMHDBJD left a comment

Choose a reason for hiding this comment

D3Hunter left a comment

Choose a reason for hiding this comment

D3Hunter commented Mar 7, 2022

db-will left a comment

Choose a reason for hiding this comment

ti-chi-bot commented Mar 7, 2022

lance6716 commented Mar 8, 2022

ti-chi-bot commented Mar 8, 2022

ti-chi-bot commented Mar 8, 2022

lance6716 commented Mar 2, 2022 •

edited

Loading

ti-chi-bot commented Mar 2, 2022 •

edited

Loading

codecov-commenter commented Mar 7, 2022 •

edited

Loading