Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

err_msg: clarify some ambiguous error message #1076

Merged
merged 24 commits into from
May 21, 2021
Merged

err_msg: clarify some ambiguous error message #1076

merged 24 commits into from
May 21, 2021

Conversation

ZipFast
Copy link
Contributor

@ZipFast ZipFast commented Apr 29, 2021

What problem does this PR solve?

Keeping improve BR errors and logs #1055

Some error message in BR is ambiguous and misleading.
i.e.

  • If tikv has no permission to read from & write to backup storage, error message cannot clarify whether error occur on br side or tikv side. Users may think why I got "permission denied"? I'm running BR as root!
  • If user want to restore tables or databases that have not been backup, error should occur.

Clarify the summary info including:

  • data-size => total kv size
  • size => backup data size(after compressed)

What is changed and how it works?

  1. Adjust the backup and restore summary:
  • the data-size field has been changed to total-kv-size
  • the size field in backup summary has been changed to backup data size(after compressed)
  • the size field in restore summary has been changed to restore data size(after decompressed)
  1. Some error messages are changed, and will clarify what is going on when error occur.
  • Original error message is as following
[2021/05/18 16:37:28.517 +08:00] [ERROR] [push.go:147] ["backup occur unknown error"] [error="Io(Os { code: 13, kind: PermissionDenied...})"] [stack="github.com/pingcap/br/pkg/backup.(*pushDown).pushBackup\n\t/home/zwj/log_improve/br/pkg/backup/push.go:147\ngithub.jparrowsec.cn/pingcap/br/pkg/backup.(*Client).BackupRange\n\t/home/zwj/log_improve/br/pkg/backup/client.go:540\ngithub.jparrowsec.cn/pingcap/br/pkg/backup.(*Client).BackupRanges.func2.1\n\t/home/zwj/log_improve/br/pkg/backup/client.go:476\ngithub.jparrowsec.cn/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/home/zwj/log_improve/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/root/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57"]
  • For now, the error message output is as following:
[2021/05/19 15:43:43.305 +08:00] [ERROR] [push.go:175] [error="[BR:KV:ErrKVStorage]tikv storage occur I/O error: I/O permission denied error occurs on TiKV Node(store id: 1; Address: 127.0.0.1:20162)"] ["work around"="please ensure tikv has permission to read from & write to the storage."] [stack="github.com/pingcap/br/pkg/backup.(*pushDown).pushBackup\n\t/home/zwj/log_improve/br/pkg/backup/push.go:175\ngithub.jparrowsec.cn/pingcap/br/pkg/backup.(*Client).BackupRange\n\t/home/zwj/log_improve/br/pkg/backup/client.go:540\ngithub.jparrowsec.cn/pingcap/br/pkg/backup.(*Client).BackupRanges.func2.1\n\t/home/zwj/log_improve/br/pkg/backup/client.go:476\ngithub.jparrowsec.cn/pingcap/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/home/zwj/log_improve/br/pkg/utils/worker.go:63\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/root/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57"]
  1. When user want to restore tables or databases that have not been backup, error should occur.

Check List

Tests

  • Manual test (add detailed scripts or steps below)
  1. If user want to restore a table or database that has not been backup, an error should occur to notice the user
    image
    image
  2. If user restore a empty table, the summary should notice the user there is nothing to restore.
    image
  3. Summary message have improve, data-size => total-kv-size and size => backup data size(after compressed)
    image
  • Integration test

Code changes

  • Has exported variable/fields change

Release Note

  • No realease note

@ZipFast
Copy link
Contributor Author

ZipFast commented Apr 29, 2021

/cc @3pointer /cc @YuJuncen

@ti-chi-bot ti-chi-bot requested a review from YuJuncen April 29, 2021 10:06
@ti-chi-bot
Copy link
Member

@Zwj-coder: GitHub didn't allow me to request PR reviews from the following users: /cc.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @3pointer /cc @YuJuncen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@YuJuncen
Copy link
Collaborator

/lgtm

@ti-chi-bot ti-chi-bot added the status/LGT1 LGTM1 label Apr 30, 2021
@YuJuncen
Copy link
Collaborator

YuJuncen commented Apr 30, 2021

Eh, by the way, I have an idea about how to handle this, which might make this PR better. (also cc @3pointer, PTAL)

At here:

br/pkg/backup/push.go

Lines 140 to 141 in b1ed365

log.Error("backup occur unknown error", zap.String("error", errPb.GetMsg()))
return res, errors.Annotatef(berrors.ErrKVUnknown, "%v", errPb)

The 'unknown error' omits many context, which make it confusing -- users may think this error occurs in the internal of PR. How about add the store into the context? like:

 return res, errors.Annotatef(berrors.ErrKVUnknown, "error occurs in the TiKV Node %s(@%s): %v", 
    store.GetID(), 
    redact.String(store.GetAddress()), // The net address might be sensitive, so we might need to redact them.
    errPb)

After this, I think, users can notice that this error occurs in TiKV, and we can keep the description of ErrKVUnknown untouched, in case of misleading in non-permission relative cases.

NOTEs:

Unfortunately, currently, our pushDown make multi workers and receive response from them by a channel(respCh), which omits the store info:

br/pkg/backup/push.go

Lines 25 to 29 in b1ed365

type pushDown struct {
mgr ClientMgr
respCh chan *backuppb.BackupResponse
errCh chan error
}

You may would like to create a type that contains both response and store info, and replace this type to respCh, then send the store info like this:

push.respCh <- RespAndStore{
	Resp:  resp,
	Store: store,
}

... at here:

br/pkg/backup/push.go

Lines 79 to 83 in b1ed365

func(resp *backuppb.BackupResponse) error {
// Forward all responses (including error).
push.respCh <- resp
return nil
},

@ZipFast
Copy link
Contributor Author

ZipFast commented Apr 30, 2021

Eh, by the way, I have an idea about how to handle this, which might make this PR better. (also cc @3pointer, PTAL)

At here:

br/pkg/backup/push.go

Lines 140 to 141 in b1ed365

log.Error("backup occur unknown error", zap.String("error", errPb.GetMsg()))
return res, errors.Annotatef(berrors.ErrKVUnknown, "%v", errPb)

The 'unknown error' omits many context, which make it confusing -- users may think this error occurs in the internal of PR. How about add the store into the context? like:

 return res, errors.Annotatef(berrors.ErrKVUnknown, "error occurs in the TiKV Node %s(@%s): %v", 
    store.GetID(), 
    redact.String(store.GetAddress()), // The net address might be sensitive, so we might need to redact them.
    errPb)

I think after this, users can notice that this error occurs in TiKV, and we can keep the description of ErrKVUnknown untouched, in case of misleading in non-permission relative cases.

NOTEs:

Unfortunately, currently, our pushDown make multi workers and receive response from them by a channel(respCh), which omits the store info:

br/pkg/backup/push.go

Lines 25 to 29 in b1ed365

type pushDown struct {
mgr ClientMgr
respCh chan *backuppb.BackupResponse
errCh chan error
}

You may would like to create a type that contains both response and store info, and replace this type to respCh, then send the store info like this:

push.respCh <- RespAndStore{
	Resp:  resp,
	Store: store,
}

... at here:

br/pkg/backup/push.go

Lines 79 to 83 in b1ed365

func(resp *backuppb.BackupResponse) error {
// Forward all responses (including error).
push.respCh <- resp
return nil
},

I think since the error stack contains both error message from tikv and br, if the error occur at the tikv side, it's tikv's responsibility to report detailed error message.

@YuJuncen
Copy link
Collaborator

@Zwj-coder Yep, but I think BR can figure out on WHICH TiKV error happens , IMO this could be helpful for debugging. 🤔

@ZipFast
Copy link
Contributor Author

ZipFast commented Apr 30, 2021

@YuJuncen I agree, but if the tikv cluster is configured in the same way, one of them have permission problem probably means rest of them also have same problem, so we could add this store help information for debugging usage.

@YuJuncen
Copy link
Collaborator

YuJuncen commented Apr 30, 2021

@Zwj-coder Well, that is true. Even I'm still afraid this message being misleading in some conditions... 😂

@ZipFast ZipFast marked this pull request as draft April 30, 2021 09:30
@ZipFast ZipFast marked this pull request as ready for review May 6, 2021 02:10
@ZipFast
Copy link
Contributor Author

ZipFast commented May 6, 2021

/cc @YuJuncen @3pointer

@3pointer
Copy link
Collaborator

/lgtm

BTW Is internal error better than unknown error here?🤔️

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • 3pointer
  • YuJuncen

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by writing /lgtm in a comment.
Reviewer can cancel approval by writing /lgtm cancel in a comment.

@ti-chi-bot ti-chi-bot added status/LGT2 LGTM2 and removed status/LGT1 LGTM1 labels May 11, 2021
@ZipFast ZipFast added this to the v4.0.13 milestone May 11, 2021
pkg/utils/permission.go Outdated Show resolved Hide resolved
@3pointer
Copy link
Collaborator

/run-integration-tests

if tblFlag := flags.Lookup(flagTable); tblFlag != nil {
tbl := tblFlag.Value.String()
if len(tbl) == 0 {
return errors.Annotate(berrors.ErrInvalidArgument, "empty table name is not allowed")
}
cfg.Tables[tbl] = struct{}{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have the same tables in different dbs.
It's better to change the key from tbl to db.tbl.

pkg/task/restore.go Outdated Show resolved Hide resolved
@IANTHEREAL
Copy link
Collaborator

/lgtm

@ZipFast ZipFast merged commit 3c85312 into pingcap:master May 21, 2021
ZipFast pushed a commit that referenced this pull request May 21, 2021
@ZipFast ZipFast self-assigned this May 21, 2021
ti-chi-bot pushed a commit to ti-chi-bot/br that referenced this pull request May 21, 2021
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #1131.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request could not be created: failed to create pull request against pingcap/br#release-4.0 from head ti-chi-bot:cherry-pick-1076-to-release-4.0: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-1076-to-release-4.0."}],"documentation_url":"https://docs.github.com/rest/reference/pulls#create-a-pull-request"}

ti-chi-bot pushed a commit to ti-chi-bot/br that referenced this pull request May 21, 2021
ti-chi-bot pushed a commit to ti-chi-bot/br that referenced this pull request May 21, 2021
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #1132.

ti-chi-bot pushed a commit to ti-chi-bot/br that referenced this pull request May 21, 2021
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request could not be created: failed to create pull request against pingcap/br#release-5.0 from head ti-chi-bot:cherry-pick-1076-to-release-5.0: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-1076-to-release-5.0."}],"documentation_url":"https://docs.github.com/rest/reference/pulls#create-a-pull-request"}

ZipFast pushed a commit that referenced this pull request May 24, 2021
* This is an automated cherry-pick of #1076

Signed-off-by: ti-chi-bot <[email protected]>

* resolve conflict

* resolve conflict

Co-authored-by: zhangwenjum <[email protected]>
3pointer added a commit to 3pointer/br that referenced this pull request Jun 4, 2021
* improve error message

* add debug info

* improve tikv unknown error

* refomat the error message

* add error message test

* remove tikv cluster health check

* add err

* separate storage io error from unknown error

* update push.go

* update backup/restore summary

* add nothing to restore info to summary

* update comments

* improve error message

* add restore table check

* add restore table check

* update glue

* update

* using db.table to identify

Co-authored-by: 3pointer <[email protected]>
3pointer added a commit to 3pointer/br that referenced this pull request Jun 4, 2021
* improve error message

* add debug info

* improve tikv unknown error

* refomat the error message

* add error message test

* remove tikv cluster health check

* add err

* separate storage io error from unknown error

* update push.go

* update backup/restore summary

* add nothing to restore info to summary

* update comments

* improve error message

* add restore table check

* add restore table check

* update glue

* update

* using db.table to identify

Co-authored-by: 3pointer <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants