-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench: handle overload vm crash in last search iter #64205
Conversation
`tpccbench` is set up to "handle" (ignore) crashes during its line search on the assumption that these are due to pushing CRDB into overload territory, which at the time of writing it does not handle gracefully. There was a special case in which this was broken, namely that of the line search terminating in a final step with a crash. In that case, the cluster would be left running with one node down, which roachtest checks and emits as an error. Unconditionally restart the cluster after the line search (assuming it found a passing warehouse count, i.e. didn't error out itself) to make roachtest happy. Closes cockroachdb#64187. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/cmd/roachtest/tpcc.go, line 844 at r1 (raw file):
restart() time.Sleep(restartWait)
Will we need to sleep after the final restart()
call? If so, maybe consider moving this into restart()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/cmd/roachtest/tpcc.go, line 844 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Will we need to sleep after the final
restart()
call? If so, maybe consider moving this intorestart()
.
No, there shouldn't be any reason to - we usually don't randomly sleep after starting the cluster. I guess there also shouldn't be any reason for this to be necessary here, but I don't want to ruffle any feathers :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @tbg)
bors r=erikgrinaker |
Build failed (retrying...): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/cmd/roachtest/tpcc.go, line 812 at r1 (raw file):
} shortCtx, cancel := context.WithTimeout(ctx, 2*time.Minute) if err := c.StopE(shortCtx, roachNodes); err != nil {
Tobi, what's the point of this Stop
call after the previous c.Reset()
? What's there to stop? Are we simply using it as a sanity check that we can talk to the machines?
@andreimatei yes pretty much. I hope all of this bors r=erikgrinaker |
tpccbench
is set up to "handle" (ignore) crashes during its linesearch on the assumption that these are due to pushing CRDB into
overload territory, which at the time of writing it does not handle
gracefully.
There was a special case in which this was broken, namely that of
the line search terminating in a final step with a crash. In that
case, the cluster would be left running with one node down, which
roachtest checks and emits as an error.
Unconditionally restart the cluster after the line search (assuming
it found a passing warehouse count, i.e. didn't error out itself)
to make roachtest happy.
cc @cockroachdb/kv
Closes #64187.
Release note: None