-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dm validator should retry on transient error and there should be no deadlock when bad thing happens #9257
Labels
affects-6.5
This bug affects the 6.5.x(LTS) versions.
affects-6.6
affects-7.0
affects-7.1
This bug affects the 7.1.x(LTS) versions.
area/dm
Issues or PRs related to DM.
severity/moderate
type/bug
The issue is confirmed as a bug.
Comments
another example of retryable error is
|
the deadlock issue is similar to desc in #7241 (comment), but it's on |
1 task
This was referenced Aug 11, 2023
3AceShowHand
pushed a commit
to 3AceShowHand/tiflow
that referenced
this issue
Aug 29, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
affects-6.5
This bug affects the 6.5.x(LTS) versions.
affects-6.6
affects-7.0
affects-7.1
This bug affects the 7.1.x(LTS) versions.
area/dm
Issues or PRs related to DM.
severity/moderate
type/bug
The issue is confirmed as a bug.
What did you do?
there are 2 issues,
Error 9005: Region is unavailable
is not retryable even though it could/should be transient error https://github.com/pingcap/tiflow/blob/master/dm/syncer/validate_worker.go#L170dmctl
operation on the task, particularly, one cannot restart the task bydmctl stop-task/start-task
because of the deadlock.in data validator thread,
stopInner
, it waits for all workers to be done https://github.com/pingcap/tiflow/blob/v7.1.0/dm/syncer/data_validator.go#L743wg
, and no one will drain theerrChan
meantime, in worker threads,
errChan
https://github.com/pingcap/tiflow/blob/v7.1.0/dm/syncer/validate_worker.go#L170errChan
will be filled fully because the validator doesn't drain the errorserrChan
holding thewg
unreleased, which blocks theinnerStop
to finishWhat did you expect to see?
stop-task/start-task
in any casesWhat did you see instead?
Versions of the cluster
master
current status of DM cluster (execute
query-status <task-name>
in dmctl)The text was updated successfully, but these errors were encountered: