-
Notifications
You must be signed in to change notification settings - Fork 101
lightning: check and restore pd scheduler even if our task failed #1336
Conversation
|
||
err := exec.Transact(ctx, "check and init task status", func(ctx context.Context, tx *sql.Tx) error { | ||
// avoid override existing metadata if the meta is already inserted. | ||
stmt := fmt.Sprintf(`INSERT IGNORE INTO %s (task_id, status) values (?, ?)`, m.tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why insert be fore select......
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure there is exact one row represent current lightning. Need the reset the task status if current task's status is exit_unfinished
@@ -1435,6 +1439,7 @@ func (rc *Controller) restoreTables(ctx context.Context) error { | |||
// finishSchedulers() | |||
// cancelFunc(switchBack) | |||
// finishFuncCalled = true | |||
taskFinished = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is it used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If current task is exit before lightning finished (maybe met error or by user terminating), we should not clean up the task/table meta tables if all other lightning are finished
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. finishSchedulers
will be called before this method ending...... Emmmmmm, it seems to be difficult to understand
) | ||
|
||
func (m taskMetaStatus) realStatus() taskMetaStatus { | ||
return m & (taskMetaStatusExitUnfinished - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why define a method realstatus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
// finishSchedulers() | ||
// cancelFunc(switchBack) | ||
// finishFuncCalled = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We originally want to restore pd schedulers and switch back tikv to normal mode after data import finished. Then the cluster can do possible rebalance during checksum and analyze. But In our test, these rebalance will bring non-trivial impact to checksum and analyze. So we need to investigate further to determine whether we can still do this. So I think we can keep these before coming up with a clear conclusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
|
||
err := exec.Transact(ctx, "check and init task status", func(ctx context.Context, tx *sql.Tx) error { | ||
// avoid override existing metadata if the meta is already inserted. | ||
stmt := fmt.Sprintf(`INSERT INTO %s (task_id, status) values (?, ?) ON DUPLICATE KEY UPDATE state = ?`, m.tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment feels outdated.
@@ -105,6 +105,7 @@ const ( | |||
task_id BIGINT(20) UNSIGNED NOT NULL, | |||
pd_cfgs VARCHAR(2048) NOT NULL DEFAULT '', | |||
status VARCHAR(32) NOT NULL, | |||
state TINYINT(1) NOT NULL DEFAULT 0 COMMENT '0: normal, 1: exited before finish', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the taskMetaTableName
be changed?
otherwise if we used Lightning before, and this CREATE TABLE IF NOT EXISTS means the task_meta
with the old table structure will be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. But since we don't GA this feature and if lightning exited before finished, the current logic may still not be recover except manually drop the meta schema. We also don't recommend change lightning binary during one import task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after we have a consensus on #1336 (comment)
/run-integration_test |
1 similar comment
/run-integration_test |
Unstable configlist test will be fixed by #1393. |
/run-integration_test |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 799f4e0
|
/cherry-pick release-5.1 |
Signed-off-by: ti-chi-bot <[email protected]>
In response to a cherrypick label: new pull request created: #1422. |
@glorv: new pull request could not be created: failed to create pull request against pingcap/br#release-5.1 from head ti-chi-bot:cherry-pick-1336-to-release-5.1: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-1336-to-release-5.1."}],"documentation_url":"https://docs.github.com/rest/reference/pulls#create-a-pull-request"} In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
Signed-off-by: ti-chi-bot <[email protected]>
What problem does this PR solve?
Fix the bug that in concurrency import mode, if one or more lightning failed before import all tables finished, pd schedulers won't be restored.
What is changed and how it works?
This PR adds a new status to mark that one task is exit before finishing. A lightning instance will set its status to one of (checksum_skipping, checksumming, unfinished), other lightinng can check these statuses to make sure whether it is the last running instance. The last instance should recover the pd schedulers configs.
Check List
Tests
Code changes
Side effects
Related changes
Release note