Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent errors when running tpc-c on six node cluster #34228

Closed
awoods187 opened this issue Jan 24, 2019 · 2 comments · Fixed by #34548
Closed

Frequent errors when running tpc-c on six node cluster #34228

awoods187 opened this issue Jan 24, 2019 · 2 comments · Fixed by #34548
Assignees
Labels
S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@awoods187
Copy link
Contributor

awoods187 commented Jan 24, 2019

Describe the problem

While running tpc-c on six node clusters, I see repeated failures of:

Error: error in newOrder: ERROR: duplicate key value (o_w_id,o_d_id,o_id)=(1794,10,3002) violates unique constraint "primary" (SQLSTATE 23505)
Error:  exit status 1

Note, I've observed this on multiple separate nodes.

To Reproduce

  1. export CLUSTER=andy-base
  2. roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
  3. roachprod run $CLUSTER:1-6 -- 'sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier /dev/nvme1n1 /mnt/data1/; mount | grep /mnt/data1'
  4. roachprod stage $CLUSTER:1-6 cockroach
  5. roachprod stage $CLUSTER:7 workload
  6. roachprod start $CLUSTER:1-6
  7. roachprod adminurl --open $CLUSTER:1
  8. roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
  9. roachprod run $CLUSTER:7 "./workload run tpcc --ramp=5m --warehouses=4000 --duration=15m --split --scatter {pgurl:1-3}"

Expected behavior
TPC-C compelte without an error.

Environment:
v2.2.0-alpha.20181217-820-g645c0c9

@awoods187
Copy link
Contributor Author

awoods187 commented Jan 24, 2019

I'm also seeing

Error: error in delivery: ERROR: TransactionStatusError: transaction deadline exceeded (REASON_UNKNOWN) (SQLSTATE XX000)
Error:  exit status 1

frequently as well

@awoods187
Copy link
Contributor Author

awoods187 commented Jan 25, 2019

Ran into this as well:

Error: error in payment: EOF
Error:  exit status 1

Update: this killed a node. #34241

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 5, 2019
Fixes cockroachdb#34025.
Fixes cockroachdb#33624.
Fixes cockroachdb#33335.
Fixes cockroachdb#33151.
Fixes cockroachdb#33149.
Fixes cockroachdb#34159.
Fixes cockroachdb#34293.
Fixes cockroachdb#32813.
Fixes cockroachdb#30886.
Fixes cockroachdb#34228.
Fixes cockroachdb#34321.

It is rare but possible for a replica to become a leaseholder but not
learn about this until it applies a snapshot. Immediately upon the
snapshot application's `ReplicaState` update, the replica will begin
operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger
in-memory side-effects to be performed. This could result in a regression
in the new leaseholder's timestamp cache compared to the previous
leaseholder, allowing write-skew like we saw in cockroachdb#34025. This could
presumably result in other anomalies as well, because all of the
steps in `leasePostApply` were skipped.

This PR fixes this bug by detecting lease updates when applying
snapshots and making sure to react correctly to them. It also likely
fixes the referenced issue. The new test demonstrated that without
this fix, the serializable violation speculated about in the issue
was possible.

Release note (bug fix): Fix bug where lease transfers passed through
Snapshots could forget to update in-memory state on the new leaseholder,
allowing write-skew between read-modify-write operations.
craig bot pushed a commit that referenced this issue Feb 5, 2019
34548: storage: apply lease change side-effects on snapshot recipients r=nvanbenschoten a=nvanbenschoten

Fixes #34025.
Fixes #33624.
Fixes #33335.
Fixes #33151.
Fixes #33149.
Fixes #34159.
Fixes #34293.
Fixes #32813.
Fixes #30886.
Fixes #34228.
Fixes #34321.

It is rare but possible for a replica to become a leaseholder but not learn about this until it applies a snapshot. Immediately upon the snapshot application's `ReplicaState` update, the replica will begin operating as a standard leaseholder.

Before this change, leases acquired in this way would not trigger in-memory side-effects to be performed. This could result in a regression in the new leaseholder's timestamp cache compared to the previous leaseholder's cache, allowing write-skew like we saw in #34025. This could presumably result in other anomalies as well, because all of the steps in `leasePostApply` were skipped (as theorized by #34025 (comment)).

This PR fixes this bug by detecting lease updates when applying snapshots and making sure to react correctly to them. It also likely fixes the referenced issue. The new test demonstrates that without this fix, the serializable violation speculated about in the issue was possible.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #34548 Feb 5, 2019
@awoods187 awoods187 added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants