-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syncer(dm): fix log flood and performance when frequently call Snapshot #4744
Conversation
Signed-off-by: lance6716 <[email protected]>
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
dm/syncer/checkpoint.go
Outdated
// When there's a big transaction in GTID-based replication, after the checkpoint-flush-interval every row change | ||
// will trigger a check thus call Snapshot, the Snapshot should be light-weighted. We only check the global point to | ||
// return quickly. | ||
if !flushGlobalPoint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we dont't need tablepoint level snapshot now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we save table checkpoint just 8 lines below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we received sharded ddl for multiple table on upstream database, using pessimistic mode, will we have a case that global point is not updated, but table checkpoint is updated.
will we have any issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GMHDBJD PTAL. Maybe we can add a IsDirty
/NeedFlush
in checkpoint interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in v5.3.0, we always flush checkpoint every 30s no matter the content has been changed.
tiflow/dm/syncer/checkpoint.go
Lines 680 to 684 in 20626ba
func (cp *RemoteCheckPoint) CheckGlobalPoint() bool { | |
cp.RLock() | |
defer cp.RUnlock() | |
return time.Since(cp.globalPointSaveTime) >= time.Duration(cp.cfg.CheckpointFlushInterval)*time.Second | |
} |
The new async checkpoint flush feature added another check of global checkpoint and table checkpoints. At the moment, this PR wants to only add another check of global checkpoint, but can't handle the pessimistic sharding case. For now I think an atomic boolean IsDirty can represent global or table checkpoint is updated before.
I also want to make sure that, why we add another check compared with v5.3.0? @db-will
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GMHDBJD what's the proposed behaviour?
- flush every 30 seconds no matter checkpoint has changed
- after 30 seconds, check checkpoint is dirty and the dirty variable is turned on when global or table checkpoints are set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep same as 2.x. Only flushed after 30 seconds and cp.globalPoint.outOfDate(), which is your current implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsDirty/NeedFlush in checkpoint interface?
would be more intuitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep same as 2.x. Only flushed after 30 seconds and cp.globalPoint.outOfDate(), which is your current implement.
In v2.0.7, I think the behaviour is
- in addJob/checkWait, mostly we will flush checkpoints every
checkpoint-flush-interval
- in FlushPointsExcept, we generate flush SQL by inspecting outOfDate() for each table point, outOfDate()/SaveTimeIsZero/NeedFlushSafeModeExitPoint for global point.
My current code adds a "step 1.5", which is after every checkpoint-flush-interval
, we check outOfDate()/SaveTimeIsZero/NeedFlushSafeModeExitPoint of global point to determine if goes to FlushPointsExcept in step 2. That will cause no flushing in pessimistic sharding, if DM-worker also failed to flush when exit, after restart the progress is replayed. Not a big problem but I prefer to avoid it.
Currently I prefer to add IsDirty/NeedFlush to checkpoint interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in v5.3.0, we always flush checkpoint every 30s no matter the content has been changed.
tiflow/dm/syncer/checkpoint.go
Lines 680 to 684 in 20626ba
func (cp *RemoteCheckPoint) CheckGlobalPoint() bool { cp.RLock() defer cp.RUnlock() return time.Since(cp.globalPointSaveTime) >= time.Duration(cp.cfg.CheckpointFlushInterval)*time.Second } The new async checkpoint flush feature added another check of global checkpoint and table checkpoints. At the moment, this PR wants to only add another check of global checkpoint, but can't handle the pessimistic sharding case. For now I think an atomic boolean IsDirty can represent global or table checkpoint is updated before.
I also want to make sure that, why we add another check compared with v5.3.0? @db-will
the extra checking is added for preventing calls to async flush continuously, when flushing checkpoint is slow.
Signed-off-by: lance6716 <[email protected]>
Signed-off-by: lance6716 <[email protected]>
Codecov Report
Flags with carried forward coverage won't be shown. Click here to find out more. @@ Coverage Diff @@
## master #4744 +/- ##
================================================
+ Coverage 55.6402% 55.6721% +0.0318%
================================================
Files 494 520 +26
Lines 61283 64641 +3358
================================================
+ Hits 34098 35987 +1889
- Misses 23750 25132 +1382
- Partials 3435 3522 +87 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/run-all-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@db-will: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 3860ae5
|
Signed-off-by: ti-chi-bot <[email protected]>
In response to a cherrypick label: new pull request created: #4801. |
Signed-off-by: lance6716 [email protected]
What problem does this PR solve?
Issue Number: close #4619
What is changed and how it works?
always update
lastSnapshotCreationTime
when callSnapshot
. NowcheckShouldFlush
will return false after seeinglastSnapshotCreationTime
, so we will not callSnapshot
too frequentlyCheck List
Tests
generate two big transaction and set checkpoint-flush-interval to 5s. Use a complex GTID set which consists of 10 uuid part to start task, meet the first big transaction, create 500 tables and a DML for each table to record 500 table checkpoint, meet the second big transaction
before this PR, QPS of first big transaction is 16.8k and second one is 3.7k
after this PR, QPS are both 16.5k
Code changes
Side effects
Related changes
Release note