Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvnemesis: add non-blocking transactions #59062

Closed
nvanbenschoten opened this issue Jan 15, 2021 · 0 comments · Fixed by #63747
Closed

kvnemesis: add non-blocking transactions #59062

nvanbenschoten opened this issue Jan 15, 2021 · 0 comments · Fixed by #63747
Assignees
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker

Comments

@nvanbenschoten
Copy link
Member

Once they exist.

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-testing Testing tools and infrastructure labels Jan 15, 2021
@nvanbenschoten nvanbenschoten self-assigned this Jan 15, 2021
@cockroachdb cockroachdb deleted a comment from blathers-crl bot Mar 30, 2021
craig bot pushed a commit that referenced this issue Apr 8, 2021
62580: kv: deflake multi-node kvnemesis r=nvanbenschoten a=nvanbenschoten

Fixes #61322.
Prep to address #59062.

#### kv: don't consider Subsume error as assertion failures

These were added in b15e3dd, but that was incorrect. It is valid for a merge that races with another merge or split to hit these case, as the Subsume request is non-transactional.

Also, allow `RHS range bounds do not match` errors in kvnemesis. This is a valid error for the reasons mentioned above.

#### kv: mark error from Replica.GetSnapshot with errMarkSnapshotError

Without it, we see the following errors in kvnemesis:
```
Wraps: (2) error applying x.AdminChangeReplicas(ctx, /Table/50/"244956da", [{ADD_VOTER n2,s2}]) // [n1,s1,r36/1:/Table/{40-50/"ba86b…}]: failed to generate LEARNER_INITIAL snapshot: couldn't find range descriptor
```

#### kv: allow two more errors with TransferLeaseOperation

This deflakes TestKVNemesisMultiNode. The test had recently gotten flakier, which I bisected back to 3fe1992. It appears that the changes in that commit made it more likely to hit this `replica not found in RangeDescriptor`, which is returned after a TransferLease request has acquired latches, in addition to the existing `unable to find store \d+ in range` error, which is returned before a TransferLease request has acquired latches.

Co-authored-by: Nathan VanBenschoten <[email protected]>
craig bot pushed a commit that referenced this issue Apr 27, 2021
63747: kvnemesis: toggle global_reads attribute in zone configs r=nvanbenschoten a=nvanbenschoten

Closes #59062.

This commit adds a new `ChangeZoneOperation` with the option to toggle
the `global_reads` attribute to KV nemesis. This testing ensures that
global read ranges are stressed with a variety of different conditions,
including range splits, range merges, rebalancing, lease transfers,
rangefeeds, and more.

So far, there has been one bug revealed by this testing where a range
merge watcher incorrectly considered a merge to have succeeded due to a
"stale read" when the merge actually failed. This will be fixed in a
separate PR that I'll merge before this one. In addition, I reverted a
few key changes to demonstrate that this testing would have revealed
other issues had they not been addressed proactively.

```go
// 1. if in TransferLease, reads above new lease start time are ignored
priorReadSum = rspb.FromTimestamp(newLease.Start.ToTimestamp())

// then after 507 runs over 1m11s
committed txn non-atomic timestamps: [w]/Table/50/"0d0e019b":1617159805.036363000,2?->v-11 [r]/Table/50/"22ebd28f":[<min>, <max>)-><nil> [s]/Table/50/"{3059eb81"-b379f625"}:{0:[1617159804.712970000,1, <max>), 1:[1617159804.712970000,1, <max>), 2:[1617159804.712970000,1, <max>), gap:[<min>, 1617159804.748590000,0)}->[/Table/50/"380e4125":v-10, /Table/50/"3812daf0":v-9, /Table/50/"3c11a3d5":v-7] [r]/Table/50/"68767751":[<min>, <max>)-><nil> [w]/Table/50/"d05c9e5f":1617159805.036363000,2?->v-12

// 2. if in Subsume, reads above freeze time are ignored
priorReadSum = rspb.FromTimestamp(reply.FreezeStart.ToTimestamp())

// then after many runs it would have hit something similar

// 3. if in (*txnwait.Queue).forcePushAbort we did not forward the PushRequest
// timestamp to req.PushTo (i.e. if we reverted 8ba492c)
// b.Header.Timestamp.Forward(req.PushTo)

// then after 29 runs over 8s
error applying x.ScanForUpdate(ctx, /Table/50/"9e80b889", /Table/50/"e2985e21", 0) // (nil, request timestamp 1618333326.405544000,0 less than PushTo timestamp 1618333326.631766000,3?): request timestamp 1618333326.405544000,0 less than PushTo timestamp 1618333326.631766000,3?
```

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in b025216 Apr 27, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 20, 2021
Closes cockroachdb#59062.

This commit adds a new `ChangeZoneOperation` with the option to toggle
the `global_reads` attribute to KV nemesis. This testing ensures that
global read ranges are stressed with a variety of different conditions,
including range splits, range merges, rebalancing, lease transfers,
rangefeeds, and more.

So far, there has been one bug revealed by this testing where a range
merge watcher incorrectly considered a merge to have succeeded due to a
"stale read" when the merge actually failed. This will be fixed in the
following commit. In addition, I reverted a few key changes to
demonstrate that this testing would have revealed other issues had they
not been addressed proactively.

```go
// 1. if in TransferLease, reads above new lease start time are ignored
priorReadSum = rspb.FromTimestamp(newLease.Start.ToTimestamp())

// then after 507 runs over 1m11s
committed txn non-atomic timestamps: [w]/Table/50/"0d0e019b":1617159805.036363000,2?->v-11 [r]/Table/50/"22ebd28f":[<min>, <max>)-><nil> [s]/Table/50/"{3059eb81"-b379f625"}:{0:[1617159804.712970000,1, <max>), 1:[1617159804.712970000,1, <max>), 2:[1617159804.712970000,1, <max>), gap:[<min>, 1617159804.748590000,0)}->[/Table/50/"380e4125":v-10, /Table/50/"3812daf0":v-9, /Table/50/"3c11a3d5":v-7] [r]/Table/50/"68767751":[<min>, <max>)-><nil> [w]/Table/50/"d05c9e5f":1617159805.036363000,2?->v-12

// 2. if in Subsume, reads above freeze time are ignored
priorReadSum = rspb.FromTimestamp(reply.FreezeStart.ToTimestamp())

// then after many runs it would have hit something similar

// 3. if in (*txnwait.Queue).forcePushAbort we did not forward the PushRequest
// timestamp to req.PushTo (i.e. if we reverted 8ba492c)
// b.Header.Timestamp.Forward(req.PushTo)

// then after 29 runs over 8s
error applying x.ScanForUpdate(ctx, /Table/50/"9e80b889", /Table/50/"e2985e21", 0) // (nil, request timestamp 1618333326.405544000,0 less than PushTo timestamp 1618333326.631766000,3?): request timestamp 1618333326.405544000,0 less than PushTo timestamp 1618333326.631766000,3?
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) GA-blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant