roachtest: tpccbench failures on 6/12 nodes #31460

awoods187 · 2018-10-16T12:03:33Z

Describe the problem
I saw unexpected tpcc results:
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 600.0s 16634.3 78.4% 4894.6 3623.9 9126.8 12884.9 30064.8 103079.2 --- FAIL: tpcc 1650 resulted in 16634.3 tpmC (80.0% of max tpmC)

This should be closer to 3,000 warehouses.

I'm trying to review the results for a 6 node run of tpcc and I can't connect to the cluster.

To Reproduce

Modified to use roachtest from 2 days ago:

Modified test to use partitioning and 6 nodes:

@@ -675,11 +675,12 @@ func registerTPCCBench(r *registry) {
                        // StoreDirVersion: "2.0-5",
                },
                {
-                       Nodes: 3,
+                       Nodes: 6,
                        CPUs:  16,
 
-                       LoadWarehouses: 2000,
-                       EstimatedMax:   1300,
+                       LoadWarehouses: 5000,
+                       EstimatedMax:   3000,
+                       LoadConfig:     singlePartitionedLoadgen,
                },

Ran:

bin/roachtest bench '^tpccbench/nodes=6/cpu=16/partition$$' --wipe=false --user=andy

Expected behavior
No dead nodes

Additional data / screenshots
Lost connection in the webui

I can't use roachprod adminurl --open for any of the nodes.

Environment:

CockroachDB version 2.1 beta 1008

Additional context
andy-1539646440-tpccbench-nodes-6-cpu-16-partition

The text was updated successfully, but these errors were encountered:

awoods187 · 2018-10-16T12:17:46Z

Logs.zip
Logs 2.zip

awoods187 · 2018-10-16T12:19:03Z

Full of E181016 01:48:34.622220 440 storage/replica.go:6485 [n1,s1,r1/1:/{Min-System/}] failed to gossip cluster ID: info not fresh

tbg · 2018-10-16T12:50:52Z

The cluster doesn't seem to be running. Did you accidentally shut it down?

tbg · 2018-10-16T13:25:49Z

@awoods187 it's unclear to me what the bug is here. The whole cluster is down without oom killer messages or fatals in the logs, so I assume there was a roachprod stop that killed everything. Now I don't know if that's the reason you wouldn't have been able to access the UI, but that is one cause.

There's probably other stuff that went wrong before. I'll send a PR to run this config in roachtest bench so that we can take a look at this in action.

tbg · 2018-10-16T13:29:31Z

@awoods187 your copy-pasta above doesn't seem to line up. What values should I use for a 6 node cluster? This looks like you copied your 24 node one.

awoods187 · 2018-10-16T13:31:12Z

just updated to put the 6 node info in

awoods187 · 2018-10-16T13:32:28Z

The cluster is still up (you can see it on roachprod list) but I can't connect to the webui. It happened while I was sleeping overnight so no action was taken on my part

awoods187 · 2018-10-16T13:34:10Z

There was definitely no roachprod stop

tbg · 2018-10-16T13:41:07Z

I'm 99% sure there was one. Perhaps your test runner failed and automatically shut down the cluster. The logs of all nodes end in synchrony on the same second, with the telltale (absence of) signs of a kill -9. Anyway, there's nothing to look at there except the logs, which don't look too happy. On all nodes, they basically switch from what I would call moderately overloaded to something like this (in the same moment that all processes exit):

E181016 06:51:45.321449 6640468 sql/distsqlrun/server.go:581  [n6] query execution canceled
I181016 06:51:45.330607 6662140 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=a8cc2498 key=/Table/
61/1/539/0 rw=true pri=0.01090169 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672703.571343382,0 orig=1539672703.571343382,0 max=1539672703.572306559,0 wto=false rop=false seq=14 i
nt=5
I181016 06:51:45.343086 6662293 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=d2ba9a01 key=/Table/61/1/2/0 rw=true pri=0.00333437 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672703.646022368,0 orig=1539672703.646022368,0 max=1539672703.654022687,0 wto=false rop=false seq=13 int=5
I181016 06:51:45.343298 6662502 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=4a1cfaf6 key=/Table/61/1/794/0 rw=true pri=0.02044230 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672704.455839841,0 orig=1539672704.455839841,0 max=1539672704.457616654,0 wto=false rop=false seq=13 int=5
I181016 06:51:45.343700 6662397 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=04983c72 key=/Table/55/1/47/1/-2256/0 rw=true pri=0.04565371 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672701.980489849,0 orig=1539672701.980489849,0 max=1539672701.990284129,0 wto=false rop=false seq=202 int=140
I181016 06:51:45.350620 6662217 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=8589735e key=/Table/58/1/623/2/0 rw=true pri=0.02422049 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672702.843608674,0 orig=1539672702.843608674,0 max=1539672702.845380896,0 wto=false rop=false seq=91 int=34
I181016 06:51:45.351606 6662289 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=bf1c0c40 key=/Table/58/1/644/10/0 rw=true pri=0.05155760 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672703.879932632,0 orig=1539672703.879932632,0 max=1539672703.881474264,0 wto=false rop=false seq=60 int=25

awoods187 · 2018-10-16T13:53:17Z

Here is the cl messages:

https://gist.github.com/tschottdorf/2a6ac9664ad58d7a13e7d825a42a5bfd

(edit by @tschottdorf: put the wall of text into a gist)

tbg · 2018-10-16T13:56:00Z

There are various roachprod stops in there. The relevant one appears to be this one:

awoods187 · 2018-10-16T14:14:19Z

Right, but those stops are part of the normal tpcc bench process. It stops a test after it runs and then starts a new one right after. It ran several more times after 3000k warehouses. The test actually worked as normal because it got a pass at 1625 then success failures down to 1650 then reported the average of the two 1637 and considered the test passed.

Normally, after doing this the cluster is easily accessible via the webui as long as the --wipe=false flag is passed (which it was). In this case, we can't access the webui at all even though the cluster is still up

tbg · 2018-10-16T14:21:39Z

Your log shows that the test runner issued a roachprod stop at the end. The --wipe flag shouldn't have anything to do with this, I thought that only affects the beginning of the next test (i.e. whether to wipe the existing cluster before starting the test or not):

--wipe                 wipe existing cluster before starting test (for use with --cluster) (default true)

Maybe you mean --debug? But I think this test nominally passed. Finally, there's a chance that @andreimatei's refactors changed something. I don't run these perf tests a lot, but it seems expected to me that the cluster would be shut down at the end of it. Either way, I'll soon see for myself!

Folding this into #31409 which has more activity.

tbg mentioned this issue Oct 16, 2018

Can't connect to 12 node cluster #31461

Closed

tbg changed the title ~~Can't connect to 6 node cluster~~ roachtest: tpccbench failures on 6/12 nodes Oct 16, 2018

tbg closed this as completed Oct 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench failures on 6/12 nodes #31460

roachtest: tpccbench failures on 6/12 nodes #31460

awoods187 commented Oct 16, 2018 •

edited by tbg

Loading

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018 •

edited

Loading

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018 •

edited by tbg

Loading

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

roachtest: tpccbench failures on 6/12 nodes #31460

roachtest: tpccbench failures on 6/12 nodes #31460

Comments

awoods187 commented Oct 16, 2018 • edited by tbg Loading

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

tbg commented Oct 16, 2018 • edited Loading

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018 • edited by tbg Loading

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018

tbg commented Oct 16, 2018

awoods187 commented Oct 16, 2018 •

edited by tbg

Loading

tbg commented Oct 16, 2018 •

edited

Loading

awoods187 commented Oct 16, 2018 •

edited by tbg

Loading