Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: don't return ReplicaCorruptionError from commit trigger evaluation due to context cancellation #114442

Closed
nvanbenschoten opened this issue Nov 14, 2023 · 2 comments · Fixed by #115946
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Nov 14, 2023

We currently wrap any error from splitTrigger or mergeTrigger evaluation with a ReplicaCorruptionError. This is incorrect in the case of a context cancellation error, which is possible because request evaluation runs before raft, in the client's context.

We should return the errors directly in these cases, without wrapping them in corruption errors. Something like:

diff --git a/pkg/kv/kvserver/batcheval/cmd_end_transaction.go b/pkg/kv/kvserver/batcheval/cmd_end_transaction.go
index e21903fe631..20b5521fcf2 100644
--- a/pkg/kv/kvserver/batcheval/cmd_end_transaction.go
+++ b/pkg/kv/kvserver/batcheval/cmd_end_transaction.go
@@ -779,7 +779,10 @@ func RunCommitTrigger(
                        ctx, rec, batch, *ms, ct.SplitTrigger, txn.WriteTimestamp,
                )
                if err != nil {
-                       return result.Result{}, kvpb.NewReplicaCorruptionError(err)
+                       if !errors.Is(err, ctx.Err()) {
+                               err = kvpb.NewReplicaCorruptionError(err)
+                       }
+                       return result.Result{}, err
                }
                *ms = newMS
                return res, nil
@@ -787,7 +790,10 @@ func RunCommitTrigger(
        if mt := ct.GetMergeTrigger(); mt != nil {
                res, err := mergeTrigger(ctx, rec, batch, ms, mt, txn.WriteTimestamp)
                if err != nil {
-                       return result.Result{}, kvpb.NewReplicaCorruptionError(err)
+                       if !errors.Is(err, ctx.Err()) {
+                               err = kvpb.NewReplicaCorruptionError(err)
+                       }
+                       return result.Result{}, err
                }
                return res, nil
        }

Jira issue: CRDB-33515

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team labels Nov 14, 2023
@nvanbenschoten
Copy link
Member Author

@dt found another case where we make the same mistake:

return kvpb.NewError(kvpb.NewReplicaCorruptionError(
errors.Wrap(err, "could not read from AbortSpan")))

We should audit all of these calls to NewReplicaCorruptionError. We can also add some assertions in that construct that the error does not have a context timeout/cancellation cause.

@dt
Copy link
Member

dt commented Nov 16, 2023

Should we rename NewReplicaCorruptionError to MaybeWrapReplicaCorruptionError and just make it take a ctx and do the check itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants