Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NotLeader error may cause unnecessary backoff by causing pseudo region error #1112

Closed
MyonKeminta opened this issue Jan 12, 2024 · 0 comments · Fixed by #1115
Closed

NotLeader error may cause unnecessary backoff by causing pseudo region error #1112

MyonKeminta opened this issue Jan 12, 2024 · 0 comments · Fixed by #1115

Comments

@MyonKeminta
Copy link
Contributor

MyonKeminta commented Jan 12, 2024

When a NotLeader error carrying a new leader info is received from TiKV, we update the leader info of the region by updating the struct Region 's store field which is in type RegionStore). As the Region is a type of shared data managed by RegionCache, the store is updated by replacing atomic pointers, and thus RegionStore is used in copy-on-write pattern.

newRegionStore := oldRegionStore.clone()
newRegionStore.workTiKVIdx = leaderIdx
newRegionStore.storeEpochs[leaderIdx] = atomic.LoadUint32(&newRegionStore.stores[leaderIdx].epoch)
if !r.compareAndSwapStore(oldRegionStore, newRegionStore) {

Then, #264 introduces this which is executed every time attempting to send a request:

if &oldRegionStore.stores != &newRegionStore.stores {
s.state = &invalidStore{}
return
}

This forces the replicaSelector to be rebuilt when retrying on NotLeader error, by return nil rpcCtx:

func (state *invalidStore) next(_ *retry.Backoffer, _ *replicaSelector) (*RPCContext, error) {
metrics.TiKVReplicaSelectorFailureCounter.WithLabelValues("invalidStore").Inc()
return nil, nil
}

and then triggering pseudo region error:

if rpcCtx == nil {
// TODO(youjiali1995): remove it when using the replica selector for all requests.
// If the region is not found in cache, it must be out
// of date and already be cleaned up. We can skip the
// RPC by returning RegionError directly.
// TODO: Change the returned error to something like "region missing in cache",
// and handle this error like EpochNotMatch, which means to re-split the request and retry.
s.logSendReqError(bo, "throwing pseudo region error due to no replica available", regionID, retryTimes, req, totalErrors)
resp, err = tikvrpc.GenRegionErrorResp(req, &errorpb.Error{EpochNotMatch: &errorpb.EpochNotMatch{}})
return resp, nil, retryTimes, err
}

This will cause backoff to the caller of SendReqCtx, which is unnecessary in a NotLeader case where the new leader is immediately known.

The usage of invalidStore introduced by #264 mentioned above looks expected to compare whether the store field of the old and new regionStore instances points to the same block of memory, but mistakenly compared the address of the store field itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant