Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syncer(dm): fix log flood and performance when frequently call Snapshot #4744

Merged
merged 5 commits into from
Mar 8, 2022

Conversation

lance6716
Copy link
Contributor

@lance6716 lance6716 commented Mar 2, 2022

Signed-off-by: lance6716 [email protected]

What problem does this PR solve?

Issue Number: close #4619

What is changed and how it works?

always update lastSnapshotCreationTime when call Snapshot. Now checkShouldFlush will return false after seeing lastSnapshotCreationTime, so we will not call Snapshot too frequently

Check List

Tests

  • Manual test (add detailed scripts or steps below)
    generate two big transaction and set checkpoint-flush-interval to 5s. Use a complex GTID set which consists of 10 uuid part to start task, meet the first big transaction, create 500 tables and a DML for each table to record 500 table checkpoint, meet the second big transaction
    before this PR, QPS of first big transaction is 16.8k and second one is 3.7k
    after this PR, QPS are both 16.5k

Code changes

Side effects

Related changes

  • Need to cherry-pick to the release branch

Release note

Fix that there're lot of log of "checkpoint has no change, skip sync flush checkpoint" and performance may drop

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Mar 2, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • D3Hunter
  • GMHDBJD

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 2, 2022
@lance6716 lance6716 added the area/dm Issues or PRs related to DM. label Mar 2, 2022
@lance6716
Copy link
Contributor Author

/cc @db-will @Ehco1996 @D3Hunter @niubell

// When there's a big transaction in GTID-based replication, after the checkpoint-flush-interval every row change
// will trigger a check thus call Snapshot, the Snapshot should be light-weighted. We only check the global point to
// return quickly.
if !flushGlobalPoint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we dont't need tablepoint level snapshot now ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we save table checkpoint just 8 lines below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we received sharded ddl for multiple table on upstream database, using pessimistic mode, will we have a case that global point is not updated, but table checkpoint is updated.

will we have any issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GMHDBJD PTAL. Maybe we can add a IsDirty/NeedFlush in checkpoint interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in v5.3.0, we always flush checkpoint every 30s no matter the content has been changed.

func (cp *RemoteCheckPoint) CheckGlobalPoint() bool {
cp.RLock()
defer cp.RUnlock()
return time.Since(cp.globalPointSaveTime) >= time.Duration(cp.cfg.CheckpointFlushInterval)*time.Second
}

The new async checkpoint flush feature added another check of global checkpoint and table checkpoints. At the moment, this PR wants to only add another check of global checkpoint, but can't handle the pessimistic sharding case. For now I think an atomic boolean IsDirty can represent global or table checkpoint is updated before.

I also want to make sure that, why we add another check compared with v5.3.0? @db-will

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GMHDBJD what's the proposed behaviour?

  • flush every 30 seconds no matter checkpoint has changed
  • after 30 seconds, check checkpoint is dirty and the dirty variable is turned on when global or table checkpoints are set

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep same as 2.x. Only flushed after 30 seconds and cp.globalPoint.outOfDate(), which is your current implement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsDirty/NeedFlush in checkpoint interface?

would be more intuitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep same as 2.x. Only flushed after 30 seconds and cp.globalPoint.outOfDate(), which is your current implement.

In v2.0.7, I think the behaviour is

  1. in addJob/checkWait, mostly we will flush checkpoints every checkpoint-flush-interval
  2. in FlushPointsExcept, we generate flush SQL by inspecting outOfDate() for each table point, outOfDate()/SaveTimeIsZero/NeedFlushSafeModeExitPoint for global point.

My current code adds a "step 1.5", which is after every checkpoint-flush-interval, we check outOfDate()/SaveTimeIsZero/NeedFlushSafeModeExitPoint of global point to determine if goes to FlushPointsExcept in step 2. That will cause no flushing in pessimistic sharding, if DM-worker also failed to flush when exit, after restart the progress is replayed. Not a big problem but I prefer to avoid it.

Currently I prefer to add IsDirty/NeedFlush to checkpoint interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in v5.3.0, we always flush checkpoint every 30s no matter the content has been changed.

func (cp *RemoteCheckPoint) CheckGlobalPoint() bool {
cp.RLock()
defer cp.RUnlock()
return time.Since(cp.globalPointSaveTime) >= time.Duration(cp.cfg.CheckpointFlushInterval)*time.Second
}

The new async checkpoint flush feature added another check of global checkpoint and table checkpoints. At the moment, this PR wants to only add another check of global checkpoint, but can't handle the pessimistic sharding case. For now I think an atomic boolean IsDirty can represent global or table checkpoint is updated before.

I also want to make sure that, why we add another check compared with v5.3.0? @db-will

the extra checking is added for preventing calls to async flush continuously, when flushing checkpoint is slow.

Signed-off-by: lance6716 <[email protected]>
@ti-chi-bot ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 3, 2022
Signed-off-by: lance6716 <[email protected]>
@ti-chi-bot ti-chi-bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 7, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 7, 2022

Codecov Report

Merging #4744 (3860ae5) into master (9607554) will increase coverage by 0.0318%.
The diff coverage is 54.1884%.

Flag Coverage Δ
cdc 59.8303% <54.1884%> (-0.0919%) ⬇️
dm 52.2372% <ø> (+0.2084%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master      #4744        +/-   ##
================================================
+ Coverage   55.6402%   55.6721%   +0.0318%     
================================================
  Files           494        520        +26     
  Lines         61283      64641      +3358     
================================================
+ Hits          34098      35987      +1889     
- Misses        23750      25132      +1382     
- Partials       3435       3522        +87     

@lance6716
Copy link
Contributor Author

manual test passed and PR description updated, PTAL @GMHDBJD @D3Hunter

Copy link
Contributor

@GMHDBJD GMHDBJD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 7, 2022
Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 7, 2022
@D3Hunter
Copy link
Contributor

D3Hunter commented Mar 7, 2022

/run-all-tests

Copy link
Contributor

@db-will db-will left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@ti-chi-bot
Copy link
Member

@db-will: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

Looks good!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@lance6716
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 3860ae5

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 8, 2022
@ti-chi-bot ti-chi-bot merged commit aacbbcf into pingcap:master Mar 8, 2022
ti-chi-bot pushed a commit to ti-chi-bot/tiflow that referenced this pull request Mar 8, 2022
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #4801.

@lance6716 lance6716 deleted the fix-log-flood branch October 13, 2022 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm Issues or PRs related to DM. needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

replication is slow and many “checkpoint has no change, skip sync flush checkpoint” in DM log
7 participants