-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench failures on 6/12 nodes #31460
Comments
Full of |
The cluster doesn't seem to be running. Did you accidentally shut it down? |
@awoods187 it's unclear to me what the bug is here. The whole cluster is down without oom killer messages or fatals in the logs, so I assume there was a There's probably other stuff that went wrong before. I'll send a PR to run this config in |
@awoods187 your copy-pasta above doesn't seem to line up. What values should I use for a 6 node cluster? This looks like you copied your 24 node one. |
just updated to put the 6 node info in |
The cluster is still up (you can see it on roachprod list) but I can't connect to the webui. It happened while I was sleeping overnight so no action was taken on my part |
There was definitely no roachprod stop |
I'm 99% sure there was one. Perhaps your test runner failed and automatically shut down the cluster. The logs of all nodes end in synchrony on the same second, with the telltale (absence of) signs of a
|
Here is the cl messages: https://gist.github.com/tschottdorf/2a6ac9664ad58d7a13e7d825a42a5bfd (edit by @tschottdorf: put the wall of text into a gist) |
Right, but those stops are part of the normal tpcc bench process. It stops a test after it runs and then starts a new one right after. It ran several more times after 3000k warehouses. The test actually worked as normal because it got a pass at 1625 then success failures down to 1650 then reported the average of the two 1637 and considered the test passed. Normally, after doing this the cluster is easily accessible via the webui as long as the --wipe=false flag is passed (which it was). In this case, we can't access the webui at all even though the cluster is still up |
Your log shows that the test runner issued a
Maybe you mean Folding this into #31409 which has more activity. |
Describe the problem
I saw unexpected tpcc results:
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 600.0s 16634.3 78.4% 4894.6 3623.9 9126.8 12884.9 30064.8 103079.2 --- FAIL: tpcc 1650 resulted in 16634.3 tpmC (80.0% of max tpmC)
This should be closer to 3,000 warehouses.
I'm trying to review the results for a 6 node run of tpcc and I can't connect to the cluster.
To Reproduce
Modified to use roachtest from 2 days ago:
Modified test to use partitioning and 6 nodes:
Ran:
bin/roachtest bench '^tpccbench/nodes=6/cpu=16/partition$$' --wipe=false --user=andy
Expected behavior
No dead nodes
Additional data / screenshots

Lost connection in the webui
I can't use roachprod adminurl --open for any of the nodes.
Environment:
Additional context
andy-1539646440-tpccbench-nodes-6-cpu-16-partition
The text was updated successfully, but these errors were encountered: