Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dm validator should retry on transient error and there should be no deadlock when bad thing happens #9257

Closed
hihihuhu opened this issue Jun 16, 2023 · 2 comments · Fixed by #9522
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-6.6 affects-7.0 affects-7.1 This bug affects the 7.1.x(LTS) versions. area/dm Issues or PRs related to DM. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@hihihuhu
Copy link
Contributor

What did you do?

there are 2 issues,

  1. Error 9005: Region is unavailable is not retryable even though it could/should be transient error https://github.com/pingcap/tiflow/blob/master/dm/syncer/validate_worker.go#L170
  2. there could be deadlock when errors happen, which will block any dmctl operation on the task, particularly, one cannot restart the task by dmctl stop-task/start-task because of the deadlock.

in data validator thread,

meantime, in worker threads,

What did you expect to see?

  1. region is not available could be retried
  2. there is no deadlock and the task could be restart by stop-task/start-task in any cases

What did you see instead?

  1. region is not available stops the validator
  2. deadlock happens when a lot of workers report error

Versions of the cluster

master

current status of DM cluster (execute query-status <task-name> in dmctl)

{
    "result": true,
    "msg": "",
    "sources": [
        {
            "result": true,
            "msg": "",
            "sourceStatus": {
                "source": "tidbmigration-sample-src",
                "worker": "test-single-cell-dm-worker-0",
                "result": null,
                "relayStatus": null
            },
            "subTaskStatus": [
                {
                    "name": "tidbmigration-sample-task",
                    "stage": "Running",
                    "unit": "Sync",
                    "result": null,
                    "unresolvedDDLLockID": "",
                    "sync": {
                        "totalEvents": "19635499",
                        "totalTps": "41072",
                        "recentTps": "12483",
                        "masterBinlog": "(mysql-bin-changelog.061817, 57798824)",
                        "masterBinlogGtid": "",
                        "syncerBinlog": "(mysql-bin-changelog.061817, 55744546)",
                        "syncerBinlogGtid": "00000000-0000-0000-0000-000000000000:0",
                        "blockingDDLs": [
                        ],
                        "unresolvedGroups": [
                        ],
                        "synced": false,
                        "binlogType": "remote",
                        "secondsBehindMaster": "0",
                        "blockDDLOwner": "",
                        "conflictMsg": "",
                        "totalRows": "19635499",
                        "totalRps": "41072",
                        "recentRps": "12483"
                    },
                    "validation": {
                        "task": "tidbmigration-sample-task",
                        "source": "tidbmigration-sample-src",
                        "mode": "full",
                        "stage": "Running",
                        "validatorBinlog": "(mysql-bin-changelog.061762, 77247898)",
                        "validatorBinlogGtid": "00000000-0000-0000-0000-000000000000:0",
                        "result": {
                            "isCanceled": false,
                            "errors": [
                                {
                                    "ErrCode": 43005,
                                    "ErrClass": "validator",
                                    "ErrScope": "internal",
                                    "ErrLevel": "high",
                                    "Message": "failed to validate row change",
                                    "RawCause": "Error 9005: Region is unavailable",
                                    "Workaround": ""
                                }
                            ],
                            "detail": null
                        },
                        "processedRowsStatus": "insert/update/delete: 1673826/1673561/371420",
                        "pendingRowsStatus": "insert/update/delete: 17259/15936/3015",
                        "errorRowsStatus": "new/ignored/resolved: 0/0/0"
                    }
                }
            ]
        }
    ]
}
@hihihuhu hihihuhu added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Jun 16, 2023
@GMHDBJD GMHDBJD added severity/moderate affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-6.6 affects-7.0 affects-7.1 This bug affects the 7.1.x(LTS) versions. labels Jun 19, 2023
@hihihuhu
Copy link
Contributor Author

hihihuhu commented Jul 5, 2023

another example of retryable error is

...
                    "validation": {
                        "task": "tidbmigration-sample-task",
                        "source": "tidbmigration-sample-src",
                        "mode": "full",
                        "stage": "Stopped",
                        "validatorBinlog": "(mysql-bin-changelog.113098, 21486296)",
                        "validatorBinlogGtid": "00000000-0000-0000-0000-000000000000:0",
                        "result": {
                            "isCanceled": false,
                            "errors": [
                                {
                                    "ErrCode": 43005,
                                    "ErrClass": "validator",
                                    "ErrScope": "internal",
                                    "ErrLevel": "high",
                                    "Message": "failed to validate row change",
                                    "RawCause": "Error 1105: no available connections",
                                    "Workaround": ""
                                }
                            ],
                            "detail": null
                        },
                        "processedRowsStatus": "insert/update/delete: 2455061263/2454996366/545577884",
                        "pendingRowsStatus": "insert/update/delete: 690/668/160",
                        "errorRowsStatus": "new/ignored/resolved: 0/0/0"
                    }

@D3Hunter
Copy link
Contributor

D3Hunter commented Aug 8, 2023

the deadlock issue is similar to desc in #7241 (comment), but it's on validator.lock, not wg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-6.6 affects-7.0 affects-7.1 This bug affects the 7.1.x(LTS) versions. area/dm Issues or PRs related to DM. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants