Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

BR should retry RegionError on BatchSplitRegions #219

Closed
overvenus opened this issue Mar 30, 2020 · 5 comments · Fixed by #247
Closed

BR should retry RegionError on BatchSplitRegions #219

overvenus opened this issue Mar 30, 2020 · 5 comments · Fixed by #247
Labels
type/bug Something isn't working

Comments

@overvenus
Copy link
Member

Integration test fails #214 (comment)

pretty printed backtrace

[2020-03-30T09:59:53.436Z] [2020/03/30 17:59:53.217 +08:00] [ERROR] [restore.go:238] ["split regions failed"] [error="split region failed: region=id:3828 start_key:\"t\\200\\000\\000\\000\\000\\000\\003\\377\\275_r\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\003\\377\\277\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:20 version:1049 > peers:<id:3829 store_id:6 > peers:<id:3830 store_id:1 > peers:<id:3831 store_id:5 > , err=message:\"peer is not leader for region 3828, leader may Some(id: 3831 store_id: 5)\" not_leader:<region_id:3828 leader:<id:3831 store_id:5 > > "] [errorVerbose="split region failed: region=id:3828 start_key:\"t\\200\\000\\000\\000\\000\\000\\003\\377\\275_r\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\003\\377\\277\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:20 version:1049 > peers:<id:3829 store_id:6 > peers:<id:3830 store_id:1 > peers:<id:3831 store_id:5 > , err=message:\"peer is not leader for region 3828, leader may Some(id: 3831 store_id: 5)\" not_leader:<region_id:3828 leader:<id:3831 store_id:5 > > 
github.com/pingcap/br/pkg/restore.(*pdClient).BatchSplitRegions
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/pkg/restore/split_client.go:230
github.com/pingcap/br/pkg/restore.(*RegionSplitter).splitAndScatterRegions
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/pkg/restore/split.go:316
github.com/pingcap/br/pkg/restore.(*RegionSplitter).Split
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/pkg/restore/split.go:118
github.com/pingcap/br/pkg/restore.SplitRanges
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/pkg/restore/util.go:344
github.com/pingcap/br/pkg/task.RunRestore
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/pkg/task/restore.go:236
github.com/pingcap/br/cmd.runRestoreCommand
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/cmd/restore.go:21
github.com/pingcap/br/cmd.newDbRestoreCommand.func1
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/cmd/restore.go:93
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:826
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:914
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:864
github.com/pingcap/br.main
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/main.go:54
github.com/pingcap/br.TestRunMain.func1
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/main_test.go:39
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357"] [stack="github.com/pingcap/log.Error
	/go/pkg/mod/github.com/pingcap/[email protected]/global.go:42
github.com/pingcap/br/pkg/task.RunRestore
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/pkg/task/restore.go:238
github.com/pingcap/br/cmd.runRestoreCommand
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/cmd/restore.go:21
github.com/pingcap/br/cmd.newDbRestoreCommand.func1
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/cmd/restore.go:93
github.com/spf13/cobra.(*Command).execute
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:826
github.com/spf13/cobra.(*Command).ExecuteC
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:914
github.com/spf13/cobra.(*Command).Execute
	/go/pkg/mod/github.com/spf13/[email protected]/command.go:864
github.com/pingcap/br.main
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/main.go:54
github.com/pingcap/br.TestRunMain.func1
	/home/jenkins/agent/workspace/br_ghpr_unit_and_integration_test/go/src/github.com/pingcap/br/main_test.go:39"]

resp, err := client.SplitRegion(ctx, &kvrpcpb.SplitRegionRequest{
Context: &kvrpcpb.Context{
RegionId: regionInfo.Region.Id,
RegionEpoch: regionInfo.Region.RegionEpoch,
Peer: peer,
},
SplitKeys: keys,
})
if err != nil {
return nil, err
}
if resp.RegionError != nil {
return nil, errors.Errorf("split region failed: region=%v, err=%v", regionInfo.Region, resp.RegionError)
}

BR should retry on RegionError:

  • NotLeader
  • RegionNotFound
  • EpochNotMatch
  • ServerIsBusy
  • StaleCommand
@overvenus overvenus added the type/bug Something isn't working label Mar 30, 2020
@3pointer
Copy link
Collaborator

it already has retry 32 times on some region errors outside, except no valid key error.
not leader may be the last error we got, will add more retry times work?

@kennytm
Copy link
Collaborator

kennytm commented Mar 31, 2020

perhaps at least record all errors, in a multierr or something.

@3pointer
Copy link
Collaborator

perhaps at least record all errors, in a multierr or something.

it will record all errors as warn after 3 times retry, I'll take a look at multierr .

@kennytm
Copy link
Collaborator

kennytm commented Mar 31, 2020

@YuJuncen
Copy link
Collaborator

This problem isn't just relative to #214, but any workload with small table may meet this error. Because RegionSplitter::Split would retry all the split process when BatchSplitRegion failed. And it's common that leader will drift when scattering regions.

Modify the BatchSplitRegion to make it retry on some of RegionErrors would probably work, I will give it a try.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants