Frequent errors when running tpc-c on six node cluster #34228

awoods187 · 2019-01-24T16:51:19Z

Describe the problem

While running tpc-c on six node clusters, I see repeated failures of:

Error: error in newOrder: ERROR: duplicate key value (o_w_id,o_d_id,o_id)=(1794,10,3002) violates unique constraint "primary" (SQLSTATE 23505)
Error:  exit status 1

Note, I've observed this on multiple separate nodes.

To Reproduce

export CLUSTER=andy-base
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER:1-6 -- 'sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier /dev/nvme1n1 /mnt/data1/; mount | grep /mnt/data1'
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=5m --warehouses=4000 --duration=15m --split --scatter {pgurl:1-3}"

Expected behavior
TPC-C compelte without an error.

Environment:
v2.2.0-alpha.20181217-820-g645c0c9

The text was updated successfully, but these errors were encountered:

awoods187 · 2019-01-24T17:02:16Z

I'm also seeing

Error: error in delivery: ERROR: TransactionStatusError: transaction deadline exceeded (REASON_UNKNOWN) (SQLSTATE XX000)
Error:  exit status 1

frequently as well

awoods187 · 2019-01-25T01:05:36Z

Ran into this as well:

Error: error in payment: EOF
Error:  exit status 1

Update: this killed a node. #34241

Fixes cockroachdb#34025. Fixes cockroachdb#33624. Fixes cockroachdb#33335. Fixes cockroachdb#33151. Fixes cockroachdb#33149. Fixes cockroachdb#34159. Fixes cockroachdb#34293. Fixes cockroachdb#32813. Fixes cockroachdb#30886. Fixes cockroachdb#34228. Fixes cockroachdb#34321. It is rare but possible for a replica to become a leaseholder but not learn about this until it applies a snapshot. Immediately upon the snapshot application's `ReplicaState` update, the replica will begin operating as a standard leaseholder. Before this change, leases acquired in this way would not trigger in-memory side-effects to be performed. This could result in a regression in the new leaseholder's timestamp cache compared to the previous leaseholder, allowing write-skew like we saw in cockroachdb#34025. This could presumably result in other anomalies as well, because all of the steps in `leasePostApply` were skipped. This PR fixes this bug by detecting lease updates when applying snapshots and making sure to react correctly to them. It also likely fixes the referenced issue. The new test demonstrated that without this fix, the serializable violation speculated about in the issue was possible. Release note (bug fix): Fix bug where lease transfers passed through Snapshots could forget to update in-memory state on the new leaseholder, allowing write-skew between read-modify-write operations.

34548: storage: apply lease change side-effects on snapshot recipients r=nvanbenschoten a=nvanbenschoten Fixes #34025. Fixes #33624. Fixes #33335. Fixes #33151. Fixes #33149. Fixes #34159. Fixes #34293. Fixes #32813. Fixes #30886. Fixes #34228. Fixes #34321. It is rare but possible for a replica to become a leaseholder but not learn about this until it applies a snapshot. Immediately upon the snapshot application's `ReplicaState` update, the replica will begin operating as a standard leaseholder. Before this change, leases acquired in this way would not trigger in-memory side-effects to be performed. This could result in a regression in the new leaseholder's timestamp cache compared to the previous leaseholder's cache, allowing write-skew like we saw in #34025. This could presumably result in other anomalies as well, because all of the steps in `leasePostApply` were skipped (as theorized by #34025 (comment)). This PR fixes this bug by detecting lease updates when applying snapshots and making sure to react correctly to them. It also likely fixes the referenced issue. The new test demonstrates that without this fix, the serializable violation speculated about in the issue was possible. Co-authored-by: Nathan VanBenschoten <[email protected]>

awoods187 assigned nvanbenschoten Jan 24, 2019

bdarnell mentioned this issue Jan 29, 2019

jepsen: bank/majority-ring failed with wrong-total #34321

Closed

nvanbenschoten mentioned this issue Feb 5, 2019

storage: apply lease change side-effects on snapshot recipients #34548

Merged

craig bot closed this as completed in #34548 Feb 5, 2019

awoods187 added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Mar 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent errors when running tpc-c on six node cluster #34228

Frequent errors when running tpc-c on six node cluster #34228

awoods187 commented Jan 24, 2019 •

edited

Loading

awoods187 commented Jan 24, 2019 •

edited

Loading

awoods187 commented Jan 25, 2019 •

edited

Loading

Frequent errors when running tpc-c on six node cluster #34228

Frequent errors when running tpc-c on six node cluster #34228

Comments

awoods187 commented Jan 24, 2019 • edited Loading

awoods187 commented Jan 24, 2019 • edited Loading

awoods187 commented Jan 25, 2019 • edited Loading

awoods187 commented Jan 24, 2019 •

edited

Loading

awoods187 commented Jan 24, 2019 •

edited

Loading

awoods187 commented Jan 25, 2019 •

edited

Loading