Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: spurious ambiguous errors can be returned #129427

Open
andrewbaptist opened this issue Aug 21, 2024 · 4 comments
Open

kv: spurious ambiguous errors can be returned #129427

andrewbaptist opened this issue Aug 21, 2024 · 4 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team

Comments

@andrewbaptist
Copy link
Contributor

andrewbaptist commented Aug 21, 2024

Describe the problem

The test kv/gracefuldraining fails without disabling the max gossip frequency.

Customers expect to not hit ambiguous errors during clean drains and shutdowns. This is a somewhat expected and rare error and we expect customers to retry, but it is still a regression in behavior.

To Reproduce

Remove the --tolerate-errors line from the test. It will fail ~50% of the time.

Expected behavior
Customers expect to not hit ambiguous errors during clean drains and shutdowns. This is a somewhat expected and rare error and we expect customers to retry, but it is still a regression in behavior.

Jira issue: CRDB-41538

Epic CRDB-39956

@andrewbaptist andrewbaptist added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 21, 2024
Copy link

blathers-crl bot commented Aug 21, 2024

Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@andrewbaptist andrewbaptist added T-kv KV Team P-2 Issues/test failures with a fix SLA of 3 months backport-24.2.x and removed backport-24.2.x labels Aug 21, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
@andrewbaptist
Copy link
Contributor Author

There are a few different failure modes. I'll put them in different commennts.

@andrewbaptist
Copy link
Contributor Author

andrewbaptist commented Aug 28, 2024

Failure mode 1:

  1. DistSender has a stale cache (s1, s2, s3). The current range is on s1, s4, s5.
  2. sendToReplicas attempts the three in the cache for a Put / EndTxn and correctly receives replica not found from s2 and s3 and creates an AmbiguousError from s1 (since it doesn't respond).
  3. sendPartialBatch doesn’t retry the batch request since it wasn’t a SendError

There are a few ways to handle this, but the fundamental problem is the “contract” between sendToReplicas and sendPartialBatch. Specifically sendToReplicas stops retrying if the descriptor changes, but still returns the “worst” error which in this case was an ambiguous error. sendPartialBatch will never retry ambiguous errors.

The current contract between these methods is too weak. Specifically in the case of both a RangeMismatchError and an AmbiguousError

Some options:

  • If we get both types of errors, return the RangeMismatchError, and modify the “original” BatchRequest to sendToReplicas to set CanForwardReadTimestamp rather than the cloned copy
  • Return “all” errors from sendToReplicas and move the processing of selectBestError up a level. Currently we are mixing the “best error” with “should retry” and losing information.

@andrewbaptist
Copy link
Contributor Author

andrewbaptist commented Aug 28, 2024

Failure mode 2:

  1. DistSender has a stale cache (s1, s2, s3). The current range is on s1, s2, s5. with s5 as the leaseholder.
  2. sendToReplicas attempts the three in the cache for aPut / EndTxn and receives RangeMismatchError from s3 and NotLeaseHolderError from s2 and creates an AmbiguousError from s1 (since it doesn't respond).
  3. sendPartialBatch doesn’t retry the batch request since it wasn’t a SendError

The rest of the analysis is basically the same.

@andrewbaptist andrewbaptist changed the title kv: gracefuldraining test fails with default cluster settings kv: spurious ambiguous errors can be returned Aug 28, 2024
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Aug 28, 2024
Previously the test could return an ambiguous error if there was a
combination of a `RangeMismatchError` and an `AmbiguousError`. This
would request would be successful if the range info was updated and the
request retried, but we don't do that today.

Informs: cockroachdb#129427
Epic: none

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Aug 29, 2024
This test can fail with an ambiguous error if there was a combination of
a `RangeMismatchError` and an `AmbiguousError` from different nodes.
This error combination would be successful if the range descriptor is
updated and the request retried. We incorrectly don't retry this today.

Informs: cockroachdb#129427
Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Aug 29, 2024
129829: roachtest: tolerate errors on gracefuldraining test r=kvoli a=andrewbaptist

This test can fail with an ambiguous error if there was a combination of
a `RangeMismatchError` and an `AmbiguousError` from different nodes.
This error combination would be successful if the range descriptor is
updated and the request retried. We incorrectly don't retry this today.

Informs: #129427
Epic: none

Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
andrewbaptist added a commit that referenced this issue Sep 4, 2024
This test can fail with an ambiguous error if there was a combination of
a `RangeMismatchError` and an `AmbiguousError` from different nodes.
This error combination would be successful if the range descriptor is
updated and the request retried. We incorrectly don't retry this today.

Informs: #129427
Epic: none

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Sep 4, 2024
This test can fail with an ambiguous error if there was a combination of
a `RangeMismatchError` and an `AmbiguousError` from different nodes.
This error combination would be successful if the range descriptor is
updated and the request retried. We incorrectly don't retry this today.

Informs: cockroachdb#129427
Epic: none

Release note: None
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Sep 4, 2024
This test can fail with an ambiguous error if there was a combination of
a `RangeMismatchError` and an `AmbiguousError` from different nodes.
This error combination would be successful if the range descriptor is
updated and the request retried. We incorrectly don't retry this today.

Informs: cockroachdb#129427
Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

1 participant