Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench failures on 6/12 nodes #31460

Closed
awoods187 opened this issue Oct 16, 2018 · 13 comments
Closed

roachtest: tpccbench failures on 6/12 nodes #31460

awoods187 opened this issue Oct 16, 2018 · 13 comments

Comments

@awoods187
Copy link
Contributor

awoods187 commented Oct 16, 2018

Describe the problem
I saw unexpected tpcc results:
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 600.0s 16634.3 78.4% 4894.6 3623.9 9126.8 12884.9 30064.8 103079.2 --- FAIL: tpcc 1650 resulted in 16634.3 tpmC (80.0% of max tpmC)

This should be closer to 3,000 warehouses.

I'm trying to review the results for a 6 node run of tpcc and I can't connect to the cluster.

To Reproduce

Modified to use roachtest from 2 days ago:

Modified test to use partitioning and 6 nodes:

@@ -675,11 +675,12 @@ func registerTPCCBench(r *registry) {
                        // StoreDirVersion: "2.0-5",
                },
                {
-                       Nodes: 3,
+                       Nodes: 6,
                        CPUs:  16,
 
-                       LoadWarehouses: 2000,
-                       EstimatedMax:   1300,
+                       LoadWarehouses: 5000,
+                       EstimatedMax:   3000,
+                       LoadConfig:     singlePartitionedLoadgen,
                },

Ran:

bin/roachtest bench '^tpccbench/nodes=6/cpu=16/partition$$' --wipe=false --user=andy

Expected behavior
No dead nodes

Additional data / screenshots
Lost connection in the webui
image

I can't use roachprod adminurl --open for any of the nodes.

Environment:

  • CockroachDB version 2.1 beta 1008

Additional context
andy-1539646440-tpccbench-nodes-6-cpu-16-partition

@awoods187
Copy link
Contributor Author

Logs.zip
Logs 2.zip

@awoods187
Copy link
Contributor Author

Full of E181016 01:48:34.622220 440 storage/replica.go:6485 [n1,s1,r1/1:/{Min-System/}] failed to gossip cluster ID: info not fresh

@tbg
Copy link
Member

tbg commented Oct 16, 2018

The cluster doesn't seem to be running. Did you accidentally shut it down?

@tbg
Copy link
Member

tbg commented Oct 16, 2018

@awoods187 it's unclear to me what the bug is here. The whole cluster is down without oom killer messages or fatals in the logs, so I assume there was a roachprod stop that killed everything. Now I don't know if that's the reason you wouldn't have been able to access the UI, but that is one cause.

There's probably other stuff that went wrong before. I'll send a PR to run this config in roachtest bench so that we can take a look at this in action.

@tbg
Copy link
Member

tbg commented Oct 16, 2018

@awoods187 your copy-pasta above doesn't seem to line up. What values should I use for a 6 node cluster? This looks like you copied your 24 node one.

@awoods187
Copy link
Contributor Author

just updated to put the 6 node info in

@awoods187
Copy link
Contributor Author

The cluster is still up (you can see it on roachprod list) but I can't connect to the webui. It happened while I was sleeping overnight so no action was taken on my part

@awoods187
Copy link
Contributor Author

There was definitely no roachprod stop

@tbg
Copy link
Member

tbg commented Oct 16, 2018

I'm 99% sure there was one. Perhaps your test runner failed and automatically shut down the cluster. The logs of all nodes end in synchrony on the same second, with the telltale (absence of) signs of a kill -9. Anyway, there's nothing to look at there except the logs, which don't look too happy. On all nodes, they basically switch from what I would call moderately overloaded to something like this (in the same moment that all processes exit):

E181016 06:51:45.321449 6640468 sql/distsqlrun/server.go:581  [n6] query execution canceled
I181016 06:51:45.330607 6662140 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=a8cc2498 key=/Table/
61/1/539/0 rw=true pri=0.01090169 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672703.571343382,0 orig=1539672703.571343382,0 max=1539672703.572306559,0 wto=false rop=false seq=14 i
nt=5
I181016 06:51:45.343086 6662293 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=d2ba9a01 key=/Table/61/1/2/0 rw=true pri=0.00333437 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672703.646022368,0 orig=1539672703.646022368,0 max=1539672703.654022687,0 wto=false rop=false seq=13 int=5
I181016 06:51:45.343298 6662502 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=4a1cfaf6 key=/Table/61/1/794/0 rw=true pri=0.02044230 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672704.455839841,0 orig=1539672704.455839841,0 max=1539672704.457616654,0 wto=false rop=false seq=13 int=5
I181016 06:51:45.343700 6662397 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=04983c72 key=/Table/55/1/47/1/-2256/0 rw=true pri=0.04565371 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672701.980489849,0 orig=1539672701.980489849,0 max=1539672701.990284129,0 wto=false rop=false seq=202 int=140
I181016 06:51:45.350620 6662217 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=8589735e key=/Table/58/1/623/2/0 rw=true pri=0.02422049 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672702.843608674,0 orig=1539672702.843608674,0 max=1539672702.845380896,0 wto=false rop=false seq=91 int=34
I181016 06:51:45.351606 6662289 internal/client/txn.go:629  [n6] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=bf1c0c40 key=/Table/58/1/644/10/0 rw=true pri=0.05155760 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1539672703.879932632,0 orig=1539672703.879932632,0 max=1539672703.881474264,0 wto=false rop=false seq=60 int=25

@tbg tbg changed the title Can't connect to 6 node cluster roachtest: tpccbench failures on 6/12 nodes Oct 16, 2018
@awoods187
Copy link
Contributor Author

awoods187 commented Oct 16, 2018

Here is the cl messages:

https://gist.github.com/tschottdorf/2a6ac9664ad58d7a13e7d825a42a5bfd

(edit by @tschottdorf: put the wall of text into a gist)

@tbg
Copy link
Member

tbg commented Oct 16, 2018

There are various roachprod stops in there. The relevant one appears to be this one:

image

@awoods187
Copy link
Contributor Author

Right, but those stops are part of the normal tpcc bench process. It stops a test after it runs and then starts a new one right after. It ran several more times after 3000k warehouses. The test actually worked as normal because it got a pass at 1625 then success failures down to 1650 then reported the average of the two 1637 and considered the test passed.

Normally, after doing this the cluster is easily accessible via the webui as long as the --wipe=false flag is passed (which it was). In this case, we can't access the webui at all even though the cluster is still up

@tbg
Copy link
Member

tbg commented Oct 16, 2018

Your log shows that the test runner issued a roachprod stop at the end. The --wipe flag shouldn't have anything to do with this, I thought that only affects the beginning of the next test (i.e. whether to wipe the existing cluster before starting the test or not):

--wipe                 wipe existing cluster before starting test (for use with --cluster) (default true)

Maybe you mean --debug? But I think this test nominally passed. Finally, there's a chance that @andreimatei's refactors changed something. I don't run these perf tests a lot, but it seems expected to me that the cluster would be shut down at the end of it. Either way, I'll soon see for myself!

Folding this into #31409 which has more activity.

@tbg tbg closed this as completed Oct 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants