-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: introduce new mixed-version testing API #92431
Conversation
For reference, this is what the output looks like when running Another note: I'm in the process of migrating |
80cdeb5
to
a11010d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 31 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 28 at r2 (raw file):
"github.com/cockroachdb/cockroach/pkg/roachprod/logger" "github.com/cockroachdb/cockroach/pkg/util/retry" "github.com/cockroachdb/cockroach/pkg/util/version"
Nit: this clashes with some of the variable names, can we give this import an alias rather?
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 41 at r2 (raw file):
// NB: version means major.minor[-internal]; the patch level isn't // returned. For example, a binary of version 19.2.4 will return 19.2. func BinaryVersion(ctx context.Context, db *gosql.DB) (roachpb.Version, error) {
Unsure if there is a reason for keeping ctx
as it is unused. I know most of the code originates from somewhere else, but might be worth doing some clean-up now.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 52 at r2 (raw file):
} bv, err := roachpb.ParseVersion(sv)
Nit: I think we can return this as is, without the extra code below. roachpb.ParseVersion
also seems to return an empty version on error.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 72 at r2 (raw file):
} cv, err := roachpb.ParseVersion(sv)
Nit: I think we can return this as is, without the extra code below. roachpb.ParseVersion
also seems to return an empty version on error.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 106 at r2 (raw file):
v := "v" + newVersion dir := v binaryName = filepath.Join(dir, "cockroach")
Nit: From a readability (not performance) perspective I would rephrase these 3 lines as:
binaryName := BinaryPathFromVersion(newVersion)
dir := filepath.Dir(binaryName)
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 180 at r2 (raw file):
c cluster.Cluster, nodes option.NodeListOption, startOpts option.StartOpts,
startOpts
is unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 51 at r2 (raw file):
// mvt.Run() // // Fuctions passed to `InMixedVersion` will be called at arbitrary
Small typo here Fuctions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 31 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)
a discussion (no related file):
Nice work! It's easy to see how much the test can be simplified with the mechanics of a mixed version environment being abstracted away.
Is it easy enough for developers to retain fine grained ordering of steps as they were previous able to do (cumbersomely)?
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 65 at r2 (raw file):
// NB: cluster versions are always major.minor[-internal]; there isn't // a patch level. func ClusterVersion(ctx context.Context, db *gosql.DB) (roachpb.Version, error) {
Are utility functions like this (and BinaryVersion
) better exposed somewhere more generic in roachtest, and perhaps implemented in roachprod?. Although relating to upgrades, they may be useful elsewhere.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 127 at r2 (raw file):
ctx context.Context, l *logger.Logger, c cluster.Cluster, nodes option.NodeListOption, v string, ) error { c.Run(ctx, nodes, "mkdir", "-p", "{store-dir}")
nit: Non interpolated commands read better as just a single string. "mkdir -p {store-dir}"
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 236 at r2 (raw file):
predicate := func(testContext Context) bool { // If the cluster is finalizing an upgrade, run this hook // according with the probability defined in the package.
nit: according to
pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go
line 71 at r2 (raw file):
// // TODO(renato): further opportunities for random exploration: going // back multiple releases instad of just one; picking a patch release
*instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 31 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 32 at r2 (raw file):
const ( // MainVersion is the default used to represent that the binary
Nit: sentinel
is more precise than default
. Proposed edits,
// MainVersion is the sentinel used to represent that the binary
// passed to roachtest should be uploaded when version is left unspecified.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 92 at r2 (raw file):
) (string, error) { binaryName := "./cockroach" if newVersion == "" {
newVersion == MainVersion
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 112 at r2 (raw file):
return "", err } if err := c.Stage(ctx, l, "release", v, dir, nodes); err != nil {
Might be worth adding a TODO since we'll likely want to support staging non-release binaries; see [1].
[1] #93229
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 143 at r2 (raw file):
} // Extract fixture. Fail if there's already an LSM in the store dir. c.Run(ctx, nodes, "cd {store-dir} && [ ! -f {store-dir}/CURRENT ] && tar -xf fixture.tgz")
FYI, I stumbled upon this recently but forgot to create an issue. As of [1], pebble is no longer creating CURRENT
. So, this non-existence check is no longer valid. I believe the same check is used elsewhere in roachtest in conjunction with roachtest monitor --ignore-empty-nodes
.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 229 at r2 (raw file):
// WaitForClusterUpgrade waits for the cluster version to reach the // first node's binary version (which is assumed to be every node's
It seems error-prone to expect that only the first node's binary version denotes the cluster upgrade version to wait for. It also prevents the use-case where the caller is oblivious to the nodes' binary versions and just wants to wait for a specific cluster version. Why not pass the cluster version as an explicit parameter?
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 15 at r2 (raw file):
// // Mixed-version is defined as points in time when some (potentially // not all) nodes have been upgraded to a certain binary version, or
I think we need to make this more precise. The sufficient condition for "Mixed-version" is that a pair of nodes are running different binary versions. The migrations don't necessarily have to be enabled to be in the mixed version state.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 32 at r2 (raw file):
// want to test in a mixed-version state. The framework will take care // of exercising that functionality in different mixed-version states, // randomizing the order of events in each run. A detailed test plan
A detailed test plan is logged _before_ the test runs; it facilitates debugging a test failure by summarizing all executed steps.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 53 at r2 (raw file):
// Fuctions passed to `InMixedVersion` will be called at arbitrary // points during an upgrade/downgrade process. They may also be called // separately or concurrently. This information will be included in
separately
-> sequentially
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 91 at r2 (raw file):
// mixed-version hook will also run while all nodes in the cluster // have been upgraded and the cluster is allowed to upgrade. runWhileMigratingProbability = 0.5
The comment conflates upgrading of the binary and the cluster versions. How about the following,
runWhileMigratingProbability is the probability that a mixed-version hook will run after all nodes in the cluster
have been upgraded to the same binary version, and the cluster version is allowed to be upgraded; this typically
involves the execution of migration steps before the new cluster version can be finalized.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 93 at r2 (raw file):
runWhileMigratingProbability = 0.5 noDBNeeded = -1
It wasn't immediately clear what this sentinel is for. In fact, it can be removed since it's used only by the pretty printer, and in all the cases with the exception of runHookStep
.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 119 at r2 (raw file):
} hookFunc func(*logger.Logger, *gosql.DB) error
userFunc
seems more precise unless the plan is to allow internal hooks which a user of this API cannot specify.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 127 at r2 (raw file):
versionUpgradeHook struct { name string predicate predicateFunc
predicate
doesn't say much on how this is intended to be used. In fact, later I'll argue that we don't even need it. In case we do keep it, maybe rename to isEnabled
?
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 134 at r2 (raw file):
// test. It can be a runnableStep (see below), or a "meta-step", // meaning that it combines other steps in some way (for instance, a // series of steps to run in sequence or concurrently).
sequence
--> sequentially
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 139 at r2 (raw file):
// runnableStep represents steps that implement the pieces on top of // which a mixed-version test is built. runnableStep interface {
In light of the above comment, I'd suggest renaming to singleStep
; runnable
is ambiguous because it typically denotes an intermediate state. Here, we mean that it's a step that doesn't compose sub-steps and hence can be directly executed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 147 at r2 (raw file):
// connection, this function should return the `noDBNeeded` // constant DBNode() int
As noted above, I think this can be removed. Instead, the pretty-printer can special-case runHookStep
.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 163 at r2 (raw file):
startup hooks mixedVersion hooks afterTest hooks
The name seems awkward. There isn't anything to execute after a test completed, in the literal sense. How about renaming to afterUpgradeFinalized
?
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 194 at r2 (raw file):
// a mixed-version test. Most of the parameters come from the // `test.Test` struct, which can't be directly passed here due to // import cycles.
I couldn't immediately see where the import cycle shows up.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 234 at r2 (raw file):
lastFromVersion := "invalid-version" var upgradedNodes int predicate := func(testContext Context) bool {
Since the predicate
is used exclusively with the mixed-version step(s), it could be inlined into Filter
below. It seems an unnecessary complication since all the other cases set the predicate to nil
.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 247 at r2 (raw file):
if testContext.FromVersion != lastFromVersion { lastFromVersion = testContext.FromVersion upgradedNodes = t.prng.Intn(len(t.opts.crdbNodes)) + 1
upgradedNodes
--> numUpgradedNodes
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 269 at r2 (raw file):
// test has started from a previous version, performed an upgrade, and // waited for the cluster version to upgrade on all nodes. func (t *Test) AfterTest(desc string, fn hookFunc) {
AfterUpgradeFinalized
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 273 at r2 (raw file):
} // Run runs the mixed-version test. It should be called once all
Run runs
reminded of the silly linter we have to appease :)
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 478 at r2 (raw file):
} // sequentialRunStep is a "meta-step" that indicates that a sequence
Nit:
sequentialRunStep is a "meta-step" which indicates enclosed steps are to be executed sequentially.
The default test runner already runs steps sequentially. This meta-step exists primarily as a way
to group related steps so that a test plan is easier to understand for a human.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 495 at r2 (raw file):
// with a random delay that should be honored before the step starts // executing. This is useful in steps that are to be executed // concurrently to avoid biases on a certain execution orders. While
`...avoid biases of certain execution orders."
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 497 at r2 (raw file):
// concurrently to avoid biases on a certain execution orders. While // the Go scheduler is non-deterministic, in practice schedules are not // uniformly distributed (as that's not the point of the scheduler),
I'd remove the parenthetical since it doesn't add information.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 499 at r2 (raw file):
// uniformly distributed (as that's not the point of the scheduler), // and the delay is inteded to expose scenarios where a specific // ordering of events (unlikely happen if all concurrent steps were
I'd also remove the parenthetical since it doesn't add information.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 517 at r2 (raw file):
delayedSteps := make([]testStep, 0, len(steps)) for _, step := range steps { delay := time.Duration(rng.Intn(1000)+1) * time.Millisecond // delays vary from 1ms to 1s
That's a pretty huge space--1000^numConcurrentSteps. Realistically, a random walk isn't going to capture this state space in a meaningful way. Since many user-specified steps are expected to take seconds to execute, why don't we reduce the state space by choosing from [100, 200]
? The added latency doesn't seem concerning. Plus, it's closer to emulating user behavior since an operation would incur network RTT latency, etc.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 542 at r2 (raw file):
} // Filter filters out hooks where the predicate does not return `true`
Another silly linter appeasement exercise :)
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 557 at r2 (raw file):
// AsSteps transforms the sequence of hooks into a corresponding test // step. If there is only one hook, the corresponding `runHookStep` is // returned. Otherwise, a `concurrentRunStep` is returned, where every
What if we want to have multiple steps to run sequentially at some later time? E.g., step k
could depend on result/side-effect of step k-1
which forces them to be executed sequentially.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go
line 62 at r2 (raw file):
// mixed-version hooks at arbitrary points). // - downgrade all nodes back to the predecessor version (running // mixed-version hooks at arbitrary points again).
They are not entirely arbitrary points; this is determined by the test planner and the implicit state machine. Maybe paraphrase to at times determined by the test planner
?
pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go
line 180 at r2 (raw file):
// PrettyPrint displays a tree-like view of the mixed-version test // plan, useful when debugging mixed-version test failures. Each step // is assigned an ID, making it easy to correspond the step that
Nit: ...making it easy to correlate the step...
pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go
line 199 at r2 (raw file):
out.WriteString(fmt.Sprintf("%s %s\n", prefix, label)) for i, subStep := range steps { nestedPrefix := strings.ReplaceAll(prefix, branchString, nestedBranchString)
The use of strings.ReplaceAll
and the output prefixes is a neat way to encode the (stack) pop
!
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 31 at r2 (raw file):
type ( testRunner struct {
Nit: source file name duplicates roachtest's test_runner.go
. Any chance we could rename the source to runner.go
and the struct to Runner
? The test prefix seems redundant since it's already under roachtest
. (The same could be done for all other source file names, planner.go
, etc.) This is a rather selfish ask since before I could just type test_runner
in GoLand, and it wouldn't clash with any other test runners :)
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 69 at r2 (raw file):
defer tr.closeConnections() tr.logger.Printf(tr.plan.PrettyPrint())
Seems like this belongs to the caller (Test.run
)? Strictly speaking it's not a concern of executing the test plan.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 82 at r2 (raw file):
// recursively in the case of sequentialRunStep and concurrentRunStep. func (tr *testRunner) runStep(step testStep) error { if rs, isRunnable := step.(runnableStep); isRunnable && rs.ID() > 1 {
I had to read this a couple of times due to isRunnable
. Typically, runnable
denotes a certain attribute (and behavior) of an entity. E.g., a thread is either runnable or waiting, or executing. I think a less ambiguous name for an individual step is singleStep
. Here, perhaps we could use ok
instead of isRunnable
since we're merely checking that the type cast holds?
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 102 at r2 (raw file):
case sequentialRunStep: for _, ss := range s.steps { if err := tr.runStep(ss); err != nil {
Methinks randomized delays might also be useful for sequential steps. Why not reuse the same idea as below?
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 123 at r2 (raw file):
default: rs := s.(runnableStep)
Nit: singleStep
seems less ambiguous.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 145 at r2 (raw file):
// reportError augments the error passed with extra // information. Specifically, the error message will include the ID of // the step that failed, the random seed used, the binary version on
The binary version was cached during step 1. So, it's possible that it would be stale at the time of this call. It should be refreshed in the same way that the cluster version is refreshed.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 227 at r2 (raw file):
conn, err := tr.cluster.ConnE(tr.ctx, tr.logger, node) if err != nil { return fmt.Errorf("failed to connect to node %d: %w", node, err)
Note, when this errors out, the cached field (here and in the above helpers) would be only partially initialized. So, it's important to ensure the test bails out; otherwise, you have inconsistent (internal) state which may result in index out of bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! This API is a huge improvement over the existing way of mixed-version testing via roachtest.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)
a discussion (no related file):
Great work! I like how the execution plan is calculated beforehand. This leaves the opportunity to extend the framework in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only gave this a high level look, but this looks great! One question I have is what the artifacts on failure look like. You mention in the commit messages that steps are assigned numerical IDs and logs are made using those IDs; any chance steps could also have something more human readable attached to them which would make it in the files? For roachprod run
, roachtest
currently generates files like these:
and when a command exits nonzero, we also touch <filename>.failed
so that you can visually pick out failing commands right away.
which isn't pretty but at least for me helps me find what I need pretty quickly (including doing things like tail -n 3 artifacts/something/else/run_*/*_run_kv.log
.
Going through one failure as a group might highlight a few UX improvements like that.
Another thought - and not suggesting this needs to be in this PR - is perhaps requiring that the provided hooks are idempotent. This will, down the line, make it easier to inject chaos while hooks are running because you can replay any in-progress hooks. Rather than going back at that point and possibly retrofitting many many hooks, you could "just" update the scheduler today to sometimes (or maybe even always!) invoke hooks twice (weeding out obvious non-idempotency).
Reviewed 1 of 31 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @srosenberg)
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 256 at r1 (raw file):
} // OnStartup registers a callback that is run once the cluster is
Here and for the other methods too, maybe worth pointing out what the concurrency contract here is? If I register five hooks, will they be called in the order in which they appear? Or may some of them be invoked concurrently? If this is the case, is there a way to opt out of concurrency for steps that want to run sequentially, or is the expectation that such operations need to be packaged into one hook (limiting, to a degree, the quality of the artifacts on failure)?
I think there's no concurrency and so the answer is simple, but a sentence on each of the methods to that effect would help.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 269 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
AfterUpgradeFinalized
+1
pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner_test.go
line 93 at r1 (raw file):
│ └── run "mixed-version 2", with connection to node 3 (9) ├── downgrade nodes :1-4 from "<current>" to "%[1]s" │ ├── restart node 3 with binary version %[1]s (10)
Goland says:
Is there something weird going on here?
a11010d
to
778a3aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention in the commit messages that steps are assigned numerical IDs and logs are made using those IDs
The reason I chose to follow this path is that it provides an unambiguous link between a point in the execution of the test and the corresponding logs (so you can look at the plan printed at the top and match that against the files under mixed-version-test
). That said, I agree there's value in having something more descriptive in the log files themselves.
I made a few changes in this latest set of updates:
- log files will still contain the ID, but will also include the step description. So if a step is executed multiple times during a test, the name will be the same, but the IDs will disambiguate them.
- If a step fails, I append a
_FAILED
to the log file so that it's clear by skimming themixed-version-tests
directory which step failed.
Here's the mixed-version-test logs directory on a branch where I injected a step that errors out:
I encourage everyone to play around with this branch and/or suggest UX improvements they would like to see!
is perhaps requiring that the provided hooks are idempotent
Definitely an interesting idea; something I won't do anything about in this PR (as you mentioned), but I'll take the time to think about this more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, @srosenberg, and @tbg)
a discussion (no related file):
I pushed quite a few changes based on the feedback received. Ready for another look please!
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 28 at r2 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
Nit: this clashes with some of the variable names, can we give this import an alias rather?
Renamed variables.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 32 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Nit:
sentinel
is more precise thandefault
. Proposed edits,// MainVersion is the sentinel used to represent that the binary // passed to roachtest should be uploaded when version is left unspecified.
Updated.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 41 at r2 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
Unsure if there is a reason for keeping
ctx
as it is unused. I know most of the code originates from somewhere else, but might be worth doing some clean-up now.
Removed.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 52 at r2 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
Nit: I think we can return this as is, without the extra code below.
roachpb.ParseVersion
also seems to return an empty version on error.
Done.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 65 at r2 (raw file):
Previously, smg260 (Miral Gadani) wrote…
Are utility functions like this (and
BinaryVersion
) better exposed somewhere more generic in roachtest, and perhaps implemented in roachprod?. Although relating to upgrades, they may be useful elsewhere.
Not sure I understand the question: in this PR, they are moved to a generic roachtest package (roachtestutil
) and can be reused by other tests if needed. Not sure if this belongs in roachprod, we can move it in the future if we see an advantage in doing that.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 72 at r2 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
Nit: I think we can return this as is, without the extra code below.
roachpb.ParseVersion
also seems to return an empty version on error.
Done.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 92 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
newVersion == MainVersion
Done
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 106 at r2 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
Nit: From a readability (not performance) perspective I would rephrase these 3 lines as:
binaryName := BinaryPathFromVersion(newVersion) dir := filepath.Dir(binaryName)
This is a little more than I'm comfortable doing in this PR, as it could impact other tests. I applied most of the simple improvements suggested in the comments. We can have a follow-up PR to clean this up further if necessary.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 112 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Might be worth adding a TODO since we'll likely want to support staging non-release binaries; see [1].
[1] #93229
What would the TODO
say here? I don't think we'd be staging SHAs in these upgrade tests.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 127 at r2 (raw file):
Previously, smg260 (Miral Gadani) wrote…
nit: Non interpolated commands read better as just a single string. "mkdir -p {store-dir}"
Done.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 143 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
FYI, I stumbled upon this recently but forgot to create an issue. As of [1], pebble is no longer creating
CURRENT
. So, this non-existence check is no longer valid. I believe the same check is used elsewhere in roachtest in conjunction withroachtest monitor --ignore-empty-nodes
.
Good point, created #95170.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 180 at r2 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
startOpts
is unused.
Removed.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go
line 229 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
It seems error-prone to expect that only the first node's binary version denotes the cluster upgrade version to wait for. It also prevents the use-case where the caller is oblivious to the nodes' binary versions and just wants to wait for a specific cluster version. Why not pass the cluster version as an explicit parameter?
I guess it depends on the contract of the function. Currently, the contract is: "this function should be called only when every node in the cluster has been restarted with the same binary version" (I added a comment to make this assumption explicit).
Given this assumption/contract, I prefer the current function signature. If we require callers to pass the specific expected version, that means we'll have to devise a mechanism to go from binary file (passed to roachtest
on the command line) to binary version. While possible, seems like it's work that doesn't give us much at the moment.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 256 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Here and for the other methods too, maybe worth pointing out what the concurrency contract here is? If I register five hooks, will they be called in the order in which they appear? Or may some of them be invoked concurrently? If this is the case, is there a way to opt out of concurrency for steps that want to run sequentially, or is the expectation that such operations need to be packaged into one hook (limiting, to a degree, the quality of the artifacts on failure)?
I think there's no concurrency and so the answer is simple, but a sentence on each of the methods to that effect would help.
Great point. In fact, multiple callbacks are executed concurrently. Added a note to make that explicit.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 15 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
I think we need to make this more precise. The sufficient condition for "Mixed-version" is that a pair of nodes are running different binary versions. The migrations don't necessarily have to be enabled to be in the mixed version state.
I think the issue is that we (or maybe just me?) use the term "mixed-version" to mean more than one situation: 1) clusters where nodes don't run the same binary version; and 2) clusters that run the same binary version, but the cluster version is not the same as the binary version (either because migrations are underway, or because preserve_downgrade_option
is set). I tried to make the comment a little clearer.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 32 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
A detailed test plan is logged _before_ the test runs; it facilitates debugging a test failure by summarizing all executed steps.
Fixed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 51 at r2 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
Small typo here
Fuctions
Fixed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 53 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
separately
->sequentially
Fixed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 91 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
The comment conflates upgrading of the binary and the cluster versions. How about the following,
runWhileMigratingProbability is the probability that a mixed-version hook will run after all nodes in the cluster have been upgraded to the same binary version, and the cluster version is allowed to be upgraded; this typically involves the execution of migration steps before the new cluster version can be finalized.
Updated, thanks.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 93 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
It wasn't immediately clear what this sentinel is for. In fact, it can be removed since it's used only by the pretty printer, and in all the cases with the exception of
runHookStep
.
Other steps that don't use noDBNeeded
are preserveDowngradeOptionStep
and finalizeUpgradeStep
. I suspect over time we'll have more steps that require a database connection, so I'm not removing the sentinel.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 119 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
userFunc
seems more precise unless the plan is to allow internal hooks which a user of this API cannot specify.
Renamed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 134 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
sequence
-->sequentially
Fixed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 139 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
In light of the above comment, I'd suggest renaming to
singleStep
;runnable
is ambiguous because it typically denotes an intermediate state. Here, we mean that it's a step that doesn't compose sub-steps and hence can be directly executed.
Makes sense; renamed to singleStep
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 147 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
As noted above, I think this can be removed. Instead, the pretty-printer can special-case
runHookStep
.
See previous comment; it's not the only step that requires a DB connection.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 163 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
The name seems awkward. There isn't anything to execute after a test completed, in the literal sense. How about renaming to
afterUpgradeFinalized
?
Renamed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 194 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
I couldn't immediately see where the import cycle shows up.
This comment was outdated and represented an earlier iteration of the code. There's indeed no import cycle, and I changed function signatures to take a test.Test
where that makes sense.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 234 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Since the
predicate
is used exclusively with the mixed-version step(s), it could be inlined intoFilter
below. It seems an unnecessary complication since all the other cases set the predicate tonil
.
The inclusion of predicate
is a little of foresight as I envisioned an API where callers would be able to pass their own predicates if the builtin ones didn't work for them. That's not possible yet, but I suspect it will be required as we move more mixed-version tests to this framework.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 236 at r2 (raw file):
Previously, smg260 (Miral Gadani) wrote…
nit:
according to
Fixed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 247 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
upgradedNodes
-->numUpgradedNodes
Renamed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 269 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
+1
Renamed
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 273 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Run runs
reminded of the silly linter we have to appease :)
No longer have to appease!
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 478 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Nit:
sequentialRunStep is a "meta-step" which indicates enclosed steps are to be executed sequentially. The default test runner already runs steps sequentially. This meta-step exists primarily as a way to group related steps so that a test plan is easier to understand for a human.
Updated.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 495 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
`...avoid biases of certain execution orders."
Updated.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 497 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
I'd remove the parenthetical since it doesn't add information.
Removed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 499 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
I'd also remove the parenthetical since it doesn't add information.
Removed.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 517 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
That's a pretty huge space--1000^numConcurrentSteps. Realistically, a random walk isn't going to capture this state space in a meaningful way. Since many user-specified steps are expected to take seconds to execute, why don't we reduce the state space by choosing from
[100, 200]
? The added latency doesn't seem concerning. Plus, it's closer to emulating user behavior since an operation would incur network RTT latency, etc.
Fair enough. Changed to pick randomly from a list of hardcoded delays (including no delay).
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go
line 557 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
What if we want to have multiple steps to run sequentially at some later time? E.g., step
k
could depend on result/side-effect of stepk-1
which forces them to be executed sequentially.
I chose to not encode dependencies in this API for now, as I feel that would overcomplicate things more than we want at this stage. If there are dependencies, they should for now be dealt with out of band: either by inlining them as a single callback passed to InMixedVersion
, or by having one wait for the other's completion (out of band channel or similar mechanism); or by passing a custom predicate
.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go
line 62 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
They are not entirely arbitrary points; this is determined by the test planner and the implicit state machine. Maybe paraphrase to
at times determined by the test planner
?
Good point, updated.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go
line 71 at r2 (raw file):
Previously, smg260 (Miral Gadani) wrote…
*instead
Fixed.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go
line 180 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Nit:
...making it easy to correlate the step...
Updated.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner_test.go
line 93 at r1 (raw file):
Ah, I copy and pasted the output of tree
on my laptop and didn't realized it used NBSP. Updated the pretty printer to use plain old whitespaces instead.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 31 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Nit: source file name duplicates roachtest's
test_runner.go
. Any chance we could rename the source torunner.go
and the struct toRunner
? The test prefix seems redundant since it's already underroachtest
. (The same could be done for all other source file names,planner.go
, etc.) This is a rather selfish ask since before I could just typetest_runner
in GoLand, and it wouldn't clash with any other test runners :)
Fair enough! Renamed to planner.go
and runner.go
.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 69 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Seems like this belongs to the caller (
Test.run
)? Strictly speaking it's not a concern of executing the test plan.
Moved.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 145 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
The binary version was cached during step 1. So, it's possible that it would be stale at the time of this call. It should be refreshed in the same way that the cluster version is refreshed.
By design, the binary version would not have changed behind our backs since it's only changed at specific points in the test. IMO, it would be wasteful (and confusing) to query those again.
pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go
line 227 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Note, when this errors out, the cached field (here and in the above helpers) would be only partially initialized. So, it's important to ensure the test bails out; otherwise, you have inconsistent (internal) state which may result in index out of bounds.
That should already be the case, as we never ignore errors returned by functions. The test would just bail.
778a3aa
to
a163b90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @tbg)
a discussion (no related file):
Previously, renatolabs (Renato Costa) wrote…
I pushed quite a few changes based on the feedback received. Ready for another look please!
Can't wait to see it running in nightlies!
pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go
line 79 at r6 (raw file):
addSteps := func(ss []testStep) { steps = append(steps, ss...) } addSteps(p.initSteps())
I wonder if we should make this an invariant that can be enforced (e.g., via assert
) in runStep
? Maybe a dummy/noop step which serves only as a marker? Or we can just assume that the TestPlan is immutable after it's created?
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go
line 88 at r6 (raw file):
// runStep contains the logic of running a single test step, called // recursively in the case of sequentialRunStep and concurrentRunStep. func (tr *testRunner) runStep(step testStep) error {
It just dawned on me that it would be nice to also randomize/shuffle the order in which the single steps are executed (modulo, first and last). Maybe add a TODO for the next iteration?
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go
line 92 at r6 (raw file):
// if we are running a singleStep that is *not* the first step, // we can initialize the database connections. This represents the // assumption that the first step in the test plan is the one that
It's more than just an assumption, right? It should be an enforceable invariant.
pkg/cmd/roachtest/tests/decommission_self.go
line 26 at r6 (raw file):
func runDecommissionSelf(ctx context.Context, t test.Test, c cluster.Cluster) { allNodes := c.All() u := newVersionUpgradeTest(c,
This should now be called oldVersionUpgradeTest
:)
a163b90
to
c71083b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, @srosenberg, and @tbg)
pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go
line 79 at r6 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
I wonder if we should make this an invariant that can be enforced (e.g., via
assert
) inrunStep
? Maybe a dummy/noop step which serves only as a marker? Or we can just assume that the TestPlan is immutable after it's created?
Not sure I'm reading it right, but what invariant are you referring to here?
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go
line 88 at r6 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
It just dawned on me that it would be nice to also randomize/shuffle the order in which the single steps are executed (modulo, first and last). Maybe add a TODO for the next iteration?
Not sure I understand this either. The test planner is the place where all of the randomization takes place. I intentionally designed the runner to be as dumb as possible and just follow the test plan verbatim so that the test plan (as displayed to the user) is an accurate description of what happened in the test.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go
line 92 at r6 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
It's more than just an assumption, right? It should be an enforceable invariant.
IMO it's still an assumption that the test runner makes about the output of the test planner: it assumes that after completion of the first step of the plan, cockroach nodes will be available. That might not always be the case as we change the test planner in the future.
I added a TODO and will revisit this in a next PR involving upgrade testing.
I'll run all mixed-version roachtests on this branch to make sure I didn't accidentally break tests that are using the "old" API when I refactored. For consideration
Since I changed an Option 1: Running the new API on bors and nightlies (current state)
Option 2: Leave test as-is on local runs (bors), run new API on nightlies
As can probably be noted, I'm partial to Option 1 as I think the benefits outweigh the risks, but I'm also happy with option 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @smg260, @srosenberg, and @tbg)
a discussion (no related file):
Everything I mentioned has been addressed and looks good to me, thanks! Also excited to see this running nightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also prefer Option 1. Before the merge though, it would be nice to stress several iteration of both modalities. We could also double-check that the average duration was not negatively affected.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, @smg260, and @tbg)
pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go
line 79 at r6 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Not sure I'm reading it right, but what invariant are you referring to here?
In runStep
, you say,
This represents the assumption that the first step in the test plan is the one...
That assumption could be made into an invariant.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go
line 88 at r6 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Not sure I understand this either. The test planner is the place where all of the randomization takes place. I intentionally designed the runner to be as dumb as possible and just follow the test plan verbatim so that the test plan (as displayed to the user) is an accurate description of what happened in the test.
Sorry, wrong placement of the comment. I was merely commenting that the planner should allow for the randomization of the order of the sequence of steps.
The `clusterupgrade` package is introduced in this commit as a means towards the end goal of reusing upgrade-related functionality in roachtests. Specifically, as we work towards a new API for mixed-version testing in roachtests, we want to continue to support existing upgrade tests (at least for the time being). For that reason, it's useful to be able to reuse certain low level building blocks that both APIs rely on. The `clusterupgrade` package includes functions to query the binary and cluster versions on a node, to replace the binary running on nodes, upload a certain binary, etc. Most of these functions were moved from the `versionupgrade.go` file. Epic: CRDB-19321 Release note: None
This introduces a new mixed-version API for roachtests, located under `roachtestutil/mixedversion`. The goal of this API is to provide a higher level framework than the existing `versionUpgradeTest` set of functions. Currently, all mixed-version roachtests rely on teams having to specify, step by step, which release version nodes should start from, how and in what order they upgrade, whether they attempt a rollback, at what point the feature being tested is invoked, etc. All of this makes the process of writing mixed-version roachtests time consuming and takes the focus away from what is actually being tested in mixed-version. In addition, it's typically not feasible to test multiple combinations of event orderings in mixed-version tests, and as a consequence most tests only exercise functionality after all nodes have upgraded to the new version, potentially hiding bugs that are only exposed if _some_ nodes are running a newer binary, while others are running an older release. The new API has the goal of automatically exploring different orderings in each run. In the most common scenario, teams will only need to specify what they want tested in a mixed-version setting, and the framework will decide when that feature will be invoked: in some runs, it will happen when all nodes are migrated to the new binary (most common scenario in existing tests); in other runs, it will happen when only some nodes have upgraded; sometimes it will run while upgrade migrations are running in the background; etc. It's also possible to specify multiple features to be tested in mixed-version state, and the framework will randomly schedule them during the test, potentially running them concurrently. The randomness is controlled by a seed that is logged when the test is executed. By reusing the same seed, the same mixed-version test is generated which helps with reproducibility (but, of course, does not make the test deterministic). Mixed-version tests that use the `mixedversion` package can be broken down into two stages: _planning_ and _execution_. During planning, all mixed-version hooks will be scheduled and a test plan is generated. All randomness is introduced during test plan generation. The execution stage just takes care of running the plan, and does not introduce any other randomness. When the mixed-version test is run, the test plan is printed in a tree-like format, providing a high level overview of what the test is doing. In addition, each step of the test is assigned a numeric ID, which can be used to reference the output of that step during execution. Each step logs to a different file, and the artifacts containing the output of each step will live in `mixed-version-test/{ID}.log`. The new framework builds on top of the recently introduced `clusterupgrade` package, which provides basic, low-level fiunctions that are also used by the existing mixed-version tests. Epic: CRDB-19321 Release note: None
This commit migrates the `acceptace/version-upgrade` test to use the new API in the `mixedversion` package. Two features were previously being tested in mixed-version state in this test: backups, and basic functionality that can be tested by running a SQL statement. This change simplifies the code as it is no longer required to set up the nodes, and the concept of the `versionFeatureStep` was removed as it is not necessary and was only being used by this test. Epic: CRDB-19321 Release note: None
c71083b
to
bbccd8f
Compare
This has been stressed over 200 times in local runs (as is done by bors builds). It's also been run maybe a dozen times in real clusters in the process of testing the code here.
Results over 10 runs:
So new the API takes generally less time to finish but its run time is also more variable. Both of these are expected as the test planner may decide to run some of the mixed-version test code concurrently, whereas on master everything is always sequential. |
bors r=srosenberg |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
roachtest: introduce
clusterupgrade
packageThe
clusterupgrade
package is introduced in this commit as a meanstowards the end goal of reusing upgrade-related functionality in
roachtests. Specifically, as we work towards a new API for
mixed-version testing in roachtests, we want to continue to support
existing upgrade tests (at least for the time being). For that reason,
it's useful to be able to reuse certain low level building blocks that
both APIs rely on. The
clusterupgrade
package includes functions toquery the binary and cluster versions on a node, to replace the binary
running on nodes, upload a certain binary, etc. Most of these
functions were moved from the
versionupgrade.go
file.roachtest: introduce new mixed-version testing API
This introduces a new mixed-version API for roachtests, located under
roachtestutil/mixedversion
. The goal of this API is to provide ahigher level framework than the existing
versionUpgradeTest
set offunctions.
Currently, all mixed-version roachtests rely on teams having to
specify, step by step, which release version nodes should start from,
how and in what order they upgrade, whether they attempt a rollback,
at what point the feature being tested is invoked, etc. All of this
makes the process of writing mixed-version roachtests time consuming
and takes the focus away from what is actually being tested in
mixed-version. In addition, it's typically not feasible to test
multiple combinations of event orderings in mixed-version tests, and
as a consequence most tests only exercise functionality after all
nodes have upgraded to the new version, potentially hiding bugs that
are only exposed if some nodes are running a newer binary, while
others are running an older release.
The new API has the goal of automatically exploring different
orderings in each run. In the most common scenario, teams will only
need to specify what they want tested in a mixed-version setting, and
the framework will decide when that feature will be invoked: in some
runs, it will happen when all nodes are migrated to the new
binary (most common scenario in existing tests); in other runs, it
will happen when only some nodes have upgraded; sometimes it will run
while upgrade migrations are running in the background; etc. It's also
possible to specify multiple features to be tested in mixed-version
state, and the framework will randomly schedule them during the test,
potentially running them concurrently. The randomness is controlled by
a seed that is logged when the test is executed. By reusing the same
seed, the same mixed-version test is generated which helps with
reproducibility (but, of course, does not make the test
deterministic).
Mixed-version tests that use the
mixedversion
package can be brokendown into two stages: planning and execution. During planning, all
mixed-version hooks will be scheduled and a test plan is
generated. All randomness is introduced during test plan
generation. The execution stage just takes care of running the plan,
and does not introduce any other randomness. When the mixed-version
test is run, the test plan is printed in a tree-like format, providing
a high level overview of what the test is doing. In addition, each
step of the test is assigned a numeric ID, which can be used to
reference the output of that step during execution. Each step logs to
a different file, and the artifacts containing the output of each
step will live in
mixed-version-test/{ID}.log
.The new framework builds on top of the recently introduced
clusterupgrade
package, which provides basic, low-level fiunctionsthat are also used by the existing mixed-version tests.
roachtest: rewrite acceptance/version-upgrade to use new API
This commit migrates the
acceptace/version-upgrade
test to use thenew API in the
mixedversion
package. Two features were previouslybeing tested in mixed-version state in this test: backups, and basic
functionality that can be tested by running a SQL statement.
This change simplifies the code as it is no longer required to set up
the nodes, and the concept of the
versionFeatureStep
was removed asit is not necessary and was only being used by this test.
Epic: CRDB-19321
Release note: None