roachtest: introduce new mixed-version testing API #92431

renatolabs · 2022-11-24T00:07:09Z

roachtest: introduce clusterupgrade package
The clusterupgrade package is introduced in this commit as a means
towards the end goal of reusing upgrade-related functionality in
roachtests. Specifically, as we work towards a new API for
mixed-version testing in roachtests, we want to continue to support
existing upgrade tests (at least for the time being). For that reason,
it's useful to be able to reuse certain low level building blocks that
both APIs rely on. The clusterupgrade package includes functions to
query the binary and cluster versions on a node, to replace the binary
running on nodes, upload a certain binary, etc. Most of these
functions were moved from the versionupgrade.go file.

roachtest: introduce new mixed-version testing API
This introduces a new mixed-version API for roachtests, located under
roachtestutil/mixedversion. The goal of this API is to provide a
higher level framework than the existing versionUpgradeTest set of
functions.

Currently, all mixed-version roachtests rely on teams having to
specify, step by step, which release version nodes should start from,
how and in what order they upgrade, whether they attempt a rollback,
at what point the feature being tested is invoked, etc. All of this
makes the process of writing mixed-version roachtests time consuming
and takes the focus away from what is actually being tested in
mixed-version. In addition, it's typically not feasible to test
multiple combinations of event orderings in mixed-version tests, and
as a consequence most tests only exercise functionality after all
nodes have upgraded to the new version, potentially hiding bugs that
are only exposed if some nodes are running a newer binary, while
others are running an older release.

The new API has the goal of automatically exploring different
orderings in each run. In the most common scenario, teams will only
need to specify what they want tested in a mixed-version setting, and
the framework will decide when that feature will be invoked: in some
runs, it will happen when all nodes are migrated to the new
binary (most common scenario in existing tests); in other runs, it
will happen when only some nodes have upgraded; sometimes it will run
while upgrade migrations are running in the background; etc. It's also
possible to specify multiple features to be tested in mixed-version
state, and the framework will randomly schedule them during the test,
potentially running them concurrently. The randomness is controlled by
a seed that is logged when the test is executed. By reusing the same
seed, the same mixed-version test is generated which helps with
reproducibility (but, of course, does not make the test
deterministic).

Mixed-version tests that use the mixedversion package can be broken
down into two stages: planning and execution. During planning, all
mixed-version hooks will be scheduled and a test plan is
generated. All randomness is introduced during test plan
generation. The execution stage just takes care of running the plan,
and does not introduce any other randomness. When the mixed-version
test is run, the test plan is printed in a tree-like format, providing
a high level overview of what the test is doing. In addition, each
step of the test is assigned a numeric ID, which can be used to
reference the output of that step during execution. Each step logs to
a different file, and the artifacts containing the output of each
step will live in mixed-version-test/{ID}.log.

The new framework builds on top of the recently introduced
clusterupgrade package, which provides basic, low-level fiunctions
that are also used by the existing mixed-version tests.

roachtest: rewrite acceptance/version-upgrade to use new API
This commit migrates the acceptace/version-upgrade test to use the
new API in the mixedversion package. Two features were previously
being tested in mixed-version state in this test: backups, and basic
functionality that can be tested by running a SQL statement.

This change simplifies the code as it is no longer required to set up
the nodes, and the concept of the versionFeatureStep was removed as
it is not necessary and was only being used by this test.

Epic: CRDB-19321
Release note: None

cockroach-teamcity · 2022-11-24T00:07:22Z

This change is

renatolabs · 2022-11-24T00:13:13Z

For reference, this is what the output looks like when running acceptance/version-upgrade on this branch: https://gist.github.com/renatolabs/b9f9ed7c2bc77b65341f5d0c2acbfcc5. You can see the test plan associated with the seed picked for that run, and each step being carried out from there.

Another note: I'm in the process of migrating cdc/mixed-versions to this new API as well, but since this PR is already quite large, I'll introduce that as a follow-up or stacked PR.

herkolategan

Reviewed 4 of 31 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 28 at r2 (raw file):

	"github.com/cockroachdb/cockroach/pkg/roachprod/logger"
	"github.com/cockroachdb/cockroach/pkg/util/retry"
	"github.com/cockroachdb/cockroach/pkg/util/version"

Nit: this clashes with some of the variable names, can we give this import an alias rather?

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 41 at r2 (raw file):

// NB: version means major.minor[-internal]; the patch level isn't
// returned. For example, a binary of version 19.2.4 will return 19.2.
func BinaryVersion(ctx context.Context, db *gosql.DB) (roachpb.Version, error) {

Unsure if there is a reason for keeping ctx as it is unused. I know most of the code originates from somewhere else, but might be worth doing some clean-up now.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 52 at r2 (raw file):

	}

	bv, err := roachpb.ParseVersion(sv)

Nit: I think we can return this as is, without the extra code below. roachpb.ParseVersion also seems to return an empty version on error.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 72 at r2 (raw file):

	}

	cv, err := roachpb.ParseVersion(sv)

Nit: I think we can return this as is, without the extra code below. roachpb.ParseVersion also seems to return an empty version on error.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 106 at r2 (raw file):

		v := "v" + newVersion
		dir := v
		binaryName = filepath.Join(dir, "cockroach")

Nit: From a readability (not performance) perspective I would rephrase these 3 lines as:

binaryName := BinaryPathFromVersion(newVersion)
dir := filepath.Dir(binaryName)

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 180 at r2 (raw file):

	c cluster.Cluster,
	nodes option.NodeListOption,
	startOpts option.StartOpts,

startOpts is unused.

herkolategan

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 51 at r2 (raw file):

//	mvt.Run()
//
// Fuctions passed to `InMixedVersion` will be called at arbitrary

Small typo here Fuctions

smg260

Reviewed 5 of 31 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)

a discussion (no related file):
Nice work! It's easy to see how much the test can be simplified with the mechanics of a mixed version environment being abstracted away.

Is it easy enough for developers to retain fine grained ordering of steps as they were previous able to do (cumbersomely)?

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 65 at r2 (raw file):

// NB: cluster versions are always major.minor[-internal]; there isn't
// a patch level.
func ClusterVersion(ctx context.Context, db *gosql.DB) (roachpb.Version, error) {

Are utility functions like this (and BinaryVersion) better exposed somewhere more generic in roachtest, and perhaps implemented in roachprod?. Although relating to upgrades, they may be useful elsewhere.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 127 at r2 (raw file):

	ctx context.Context, l *logger.Logger, c cluster.Cluster, nodes option.NodeListOption, v string,
) error {
	c.Run(ctx, nodes, "mkdir", "-p", "{store-dir}")

nit: Non interpolated commands read better as just a single string. "mkdir -p {store-dir}"

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 236 at r2 (raw file):

	predicate := func(testContext Context) bool {
		// If the cluster is finalizing an upgrade, run this hook
		// according with the probability defined in the package.

nit: according to

pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go line 71 at r2 (raw file):

//
// TODO(renato): further opportunities for random exploration: going
// back multiple releases instad of just one; picking a patch release

*instead

srosenberg

Reviewed 6 of 31 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 32 at r2 (raw file):

const (
	// MainVersion is the default used to represent that the binary

Nit: sentinel is more precise than default. Proposed edits,

// MainVersion is the sentinel used to represent that the binary
// passed to roachtest should be uploaded when version is left unspecified.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 92 at r2 (raw file):

) (string, error) {
	binaryName := "./cockroach"
	if newVersion == "" {

newVersion == MainVersion

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 112 at r2 (raw file):

				return "", err
			}
			if err := c.Stage(ctx, l, "release", v, dir, nodes); err != nil {

Might be worth adding a TODO since we'll likely want to support staging non-release binaries; see [1].

[1] #93229

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 143 at r2 (raw file):

	}
	// Extract fixture. Fail if there's already an LSM in the store dir.
	c.Run(ctx, nodes, "cd {store-dir} && [ ! -f {store-dir}/CURRENT ] && tar -xf fixture.tgz")

FYI, I stumbled upon this recently but forgot to create an issue. As of [1], pebble is no longer creating CURRENT. So, this non-existence check is no longer valid. I believe the same check is used elsewhere in roachtest in conjunction with roachtest monitor --ignore-empty-nodes.

[1] cockroachdb/pebble#1227

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 229 at r2 (raw file):

// WaitForClusterUpgrade waits for the cluster version to reach the
// first node's binary version (which is assumed to be every node's

It seems error-prone to expect that only the first node's binary version denotes the cluster upgrade version to wait for. It also prevents the use-case where the caller is oblivious to the nodes' binary versions and just wants to wait for a specific cluster version. Why not pass the cluster version as an explicit parameter?

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 15 at r2 (raw file):

//
// Mixed-version is defined as points in time when some (potentially
// not all) nodes have been upgraded to a certain binary version, or

I think we need to make this more precise. The sufficient condition for "Mixed-version" is that a pair of nodes are running different binary versions. The migrations don't necessarily have to be enabled to be in the mixed version state.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 32 at r2 (raw file):

// want to test in a mixed-version state. The framework will take care
// of exercising that functionality in different mixed-version states,
// randomizing the order of events in each run. A detailed test plan

A detailed test plan is logged _before_ the test runs; it facilitates debugging a test failure by summarizing all executed steps.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 53 at r2 (raw file):

// Fuctions passed to `InMixedVersion` will be called at arbitrary
// points during an upgrade/downgrade process. They may also be called
// separately or concurrently. This information will be included in

separately -> sequentially

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 91 at r2 (raw file):

	// mixed-version hook will also run while all nodes in the cluster
	// have been upgraded and the cluster is allowed to upgrade.
	runWhileMigratingProbability = 0.5

The comment conflates upgrading of the binary and the cluster versions. How about the following,

runWhileMigratingProbability is the probability that a mixed-version hook will run after all nodes in the cluster
have been upgraded to the same binary version, and the cluster version is allowed to be upgraded; this typically
involves the execution of migration steps before the new cluster version can be finalized.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 93 at r2 (raw file):

	runWhileMigratingProbability = 0.5

	noDBNeeded = -1

It wasn't immediately clear what this sentinel is for. In fact, it can be removed since it's used only by the pretty printer, and in all the cases with the exception of runHookStep.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 119 at r2 (raw file):

	}

	hookFunc      func(*logger.Logger, *gosql.DB) error

userFunc seems more precise unless the plan is to allow internal hooks which a user of this API cannot specify.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 127 at r2 (raw file):

	versionUpgradeHook struct {
		name      string
		predicate predicateFunc

predicate doesn't say much on how this is intended to be used. In fact, later I'll argue that we don't even need it. In case we do keep it, maybe rename to isEnabled?

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 134 at r2 (raw file):

	// test. It can be a runnableStep (see below), or a "meta-step",
	// meaning that it combines other steps in some way (for instance, a
	// series of steps to run in sequence or concurrently).

sequence --> sequentially

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 139 at r2 (raw file):

	// runnableStep represents steps that implement the pieces on top of
	// which a mixed-version test is built.
	runnableStep interface {

In light of the above comment, I'd suggest renaming to singleStep; runnable is ambiguous because it typically denotes an intermediate state. Here, we mean that it's a step that doesn't compose sub-steps and hence can be directly executed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 147 at r2 (raw file):

		// connection, this function should return the `noDBNeeded`
		// constant
		DBNode() int

As noted above, I think this can be removed. Instead, the pretty-printer can special-case runHookStep.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 163 at r2 (raw file):

		startup      hooks
		mixedVersion hooks
		afterTest    hooks

The name seems awkward. There isn't anything to execute after a test completed, in the literal sense. How about renaming to afterUpgradeFinalized?

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 194 at r2 (raw file):

// a mixed-version test. Most of the parameters come from the
// `test.Test` struct, which can't be directly passed here due to
// import cycles.

I couldn't immediately see where the import cycle shows up.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 234 at r2 (raw file):

	lastFromVersion := "invalid-version"
	var upgradedNodes int
	predicate := func(testContext Context) bool {

Since the predicate is used exclusively with the mixed-version step(s), it could be inlined into Filter below. It seems an unnecessary complication since all the other cases set the predicate to nil.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 247 at r2 (raw file):

		if testContext.FromVersion != lastFromVersion {
			lastFromVersion = testContext.FromVersion
			upgradedNodes = t.prng.Intn(len(t.opts.crdbNodes)) + 1

upgradedNodes --> numUpgradedNodes

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 269 at r2 (raw file):

// test has started from a previous version, performed an upgrade, and
// waited for the cluster version to upgrade on all nodes.
func (t *Test) AfterTest(desc string, fn hookFunc) {

AfterUpgradeFinalized

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 273 at r2 (raw file):

}

// Run runs the mixed-version test. It should be called once all

Run runs reminded of the silly linter we have to appease :)

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 478 at r2 (raw file):

}

// sequentialRunStep is a "meta-step" that indicates that a sequence

Nit:

sequentialRunStep is a "meta-step" which indicates enclosed steps are to be executed sequentially.
The default test runner already runs steps sequentially. This meta-step exists primarily as a way
to group related steps so that a test plan is easier to understand for a human.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 495 at r2 (raw file):

// with a random delay that should be honored before the step starts
// executing. This is useful in steps that are to be executed
// concurrently to avoid biases on a certain execution orders. While

`...avoid biases of certain execution orders."

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 497 at r2 (raw file):

// concurrently to avoid biases on a certain execution orders. While
// the Go scheduler is non-deterministic, in practice schedules are not
// uniformly distributed (as that's not the point of the scheduler),

I'd remove the parenthetical since it doesn't add information.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 499 at r2 (raw file):

// uniformly distributed (as that's not the point of the scheduler),
// and the delay is inteded to expose scenarios where a specific
// ordering of events (unlikely happen if all concurrent steps were

I'd also remove the parenthetical since it doesn't add information.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 517 at r2 (raw file):

	delayedSteps := make([]testStep, 0, len(steps))
	for _, step := range steps {
		delay := time.Duration(rng.Intn(1000)+1) * time.Millisecond // delays vary from 1ms to 1s

That's a pretty huge space--1000^numConcurrentSteps. Realistically, a random walk isn't going to capture this state space in a meaningful way. Since many user-specified steps are expected to take seconds to execute, why don't we reduce the state space by choosing from [100, 200]? The added latency doesn't seem concerning. Plus, it's closer to emulating user behavior since an operation would incur network RTT latency, etc.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 542 at r2 (raw file):

}

// Filter filters out hooks where the predicate does not return `true`

Another silly linter appeasement exercise :)

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 557 at r2 (raw file):

// AsSteps transforms the sequence of hooks into a corresponding test
// step. If there is only one hook, the corresponding `runHookStep` is
// returned. Otherwise, a `concurrentRunStep` is returned, where every

What if we want to have multiple steps to run sequentially at some later time? E.g., step k could depend on result/side-effect of step k-1 which forces them to be executed sequentially.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go line 62 at r2 (raw file):

//     mixed-version hooks at arbitrary points).
//   - downgrade all nodes back to the predecessor version (running
//     mixed-version hooks at arbitrary points again).

They are not entirely arbitrary points; this is determined by the test planner and the implicit state machine. Maybe paraphrase to at times determined by the test planner?

pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go line 180 at r2 (raw file):

// PrettyPrint displays a tree-like view of the mixed-version test
// plan, useful when debugging mixed-version test failures. Each step
// is assigned an ID, making it easy to correspond the step that

Nit: ...making it easy to correlate the step...

pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go line 199 at r2 (raw file):

		out.WriteString(fmt.Sprintf("%s %s\n", prefix, label))
		for i, subStep := range steps {
			nestedPrefix := strings.ReplaceAll(prefix, branchString, nestedBranchString)

The use of strings.ReplaceAll and the output prefixes is a neat way to encode the (stack) pop!

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 31 at r2 (raw file):

type (
	testRunner struct {

Nit: source file name duplicates roachtest's test_runner.go. Any chance we could rename the source to runner.go and the struct to Runner? The test prefix seems redundant since it's already under roachtest. (The same could be done for all other source file names, planner.go, etc.) This is a rather selfish ask since before I could just type test_runner in GoLand, and it wouldn't clash with any other test runners :)

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 69 at r2 (raw file):

	defer tr.closeConnections()

	tr.logger.Printf(tr.plan.PrettyPrint())

Seems like this belongs to the caller (Test.run)? Strictly speaking it's not a concern of executing the test plan.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 82 at r2 (raw file):

// recursively in the case of sequentialRunStep and concurrentRunStep.
func (tr *testRunner) runStep(step testStep) error {
	if rs, isRunnable := step.(runnableStep); isRunnable && rs.ID() > 1 {

I had to read this a couple of times due to isRunnable. Typically, runnable denotes a certain attribute (and behavior) of an entity. E.g., a thread is either runnable or waiting, or executing. I think a less ambiguous name for an individual step is singleStep. Here, perhaps we could use ok instead of isRunnable since we're merely checking that the type cast holds?

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 102 at r2 (raw file):

	case sequentialRunStep:
		for _, ss := range s.steps {
			if err := tr.runStep(ss); err != nil {

Methinks randomized delays might also be useful for sequential steps. Why not reuse the same idea as below?

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 123 at r2 (raw file):

	default:
		rs := s.(runnableStep)

Nit: singleStep seems less ambiguous.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 145 at r2 (raw file):

// reportError augments the error passed with extra
// information. Specifically, the error message will include the ID of
// the step that failed, the random seed used, the binary version on

The binary version was cached during step 1. So, it's possible that it would be stale at the time of this call. It should be refreshed in the same way that the cluster version is refreshed.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 227 at r2 (raw file):

		conn, err := tr.cluster.ConnE(tr.ctx, tr.logger, node)
		if err != nil {
			return fmt.Errorf("failed to connect to node %d: %w", node, err)

Note, when this errors out, the cached field (here and in the above helpers) would be only partially initialized. So, it's important to ensure the test bails out; otherwise, you have inconsistent (internal) state which may result in index out of bounds.

srosenberg

Nice work! This API is a huge improvement over the existing way of mixed-version testing via roachtest.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)

herkolategan

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)

a discussion (no related file):
Great work! I like how the execution plan is calculated beforehand. This leaves the opportunity to extend the framework in the future.

tbg

Only gave this a high level look, but this looks great! One question I have is what the artifacts on failure look like. You mention in the commit messages that steps are assigned numerical IDs and logs are made using those IDs; any chance steps could also have something more human readable attached to them which would make it in the files? For roachprod run, roachtest currently generates files like these:

and when a command exits nonzero, we also touch <filename>.failed so that you can visually pick out failing commands right away.

which isn't pretty but at least for me helps me find what I need pretty quickly (including doing things like tail -n 3 artifacts/something/else/run_*/*_run_kv.log.

Going through one failure as a group might highlight a few UX improvements like that.

Another thought - and not suggesting this needs to be in this PR - is perhaps requiring that the provided hooks are idempotent. This will, down the line, make it easier to inject chaos while hooks are running because you can replay any in-progress hooks. Rather than going back at that point and possibly retrofitting many many hooks, you could "just" update the scheduler today to sometimes (or maybe even always!) invoke hooks twice (weeding out obvious non-idempotency).

Reviewed 1 of 31 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @srosenberg)

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 256 at r1 (raw file):

}

// OnStartup registers a callback that is run once the cluster is

Here and for the other methods too, maybe worth pointing out what the concurrency contract here is? If I register five hooks, will they be called in the order in which they appear? Or may some of them be invoked concurrently? If this is the case, is there a way to opt out of concurrency for steps that want to run sequentially, or is the expectation that such operations need to be packaged into one hook (limiting, to a degree, the quality of the artifacts on failure)?

I think there's no concurrency and so the answer is simple, but a sentence on each of the methods to that effect would help.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 269 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

AfterUpgradeFinalized

+1

pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner_test.go line 93 at r1 (raw file):

│   └── run "mixed-version 2", with connection to node 3 (9)
├── downgrade nodes :1-4 from "<current>" to "%[1]s"
│   ├── restart node 3 with binary version %[1]s (10)

Goland says:

Is there something weird going on here?

renatolabs

You mention in the commit messages that steps are assigned numerical IDs and logs are made using those IDs

The reason I chose to follow this path is that it provides an unambiguous link between a point in the execution of the test and the corresponding logs (so you can look at the plan printed at the top and match that against the files under mixed-version-test). That said, I agree there's value in having something more descriptive in the log files themselves.

I made a few changes in this latest set of updates:

log files will still contain the ID, but will also include the step description. So if a step is executed multiple times during a test, the name will be the same, but the IDs will disambiguate them.
If a step fails, I append a _FAILED to the log file so that it's clear by skimming the mixed-version-tests directory which step failed.

Here's the mixed-version-test logs directory on a branch where I injected a step that errors out:

I encourage everyone to play around with this branch and/or suggest UX improvements they would like to see!

is perhaps requiring that the provided hooks are idempotent

Definitely an interesting idea; something I won't do anything about in this PR (as you mentioned), but I'll take the time to think about this more.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, @srosenberg, and @tbg)

a discussion (no related file):
I pushed quite a few changes based on the feedback received. Ready for another look please!

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 28 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: this clashes with some of the variable names, can we give this import an alias rather?

Renamed variables.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 32 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: sentinel is more precise than default. Proposed edits,

// MainVersion is the sentinel used to represent that the binary
// passed to roachtest should be uploaded when version is left unspecified.

Updated.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 41 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Unsure if there is a reason for keeping ctx as it is unused. I know most of the code originates from somewhere else, but might be worth doing some clean-up now.

Removed.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 52 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: I think we can return this as is, without the extra code below. roachpb.ParseVersion also seems to return an empty version on error.

Done.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 65 at r2 (raw file):

Previously, smg260 (Miral Gadani) wrote…

Are utility functions like this (and BinaryVersion) better exposed somewhere more generic in roachtest, and perhaps implemented in roachprod?. Although relating to upgrades, they may be useful elsewhere.

Not sure I understand the question: in this PR, they are moved to a generic roachtest package (roachtestutil) and can be reused by other tests if needed. Not sure if this belongs in roachprod, we can move it in the future if we see an advantage in doing that.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 72 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: I think we can return this as is, without the extra code below. roachpb.ParseVersion also seems to return an empty version on error.

Done.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 92 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

newVersion == MainVersion

Done

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 106 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: From a readability (not performance) perspective I would rephrase these 3 lines as:
binaryName := BinaryPathFromVersion(newVersion)
dir := filepath.Dir(binaryName)

This is a little more than I'm comfortable doing in this PR, as it could impact other tests. I applied most of the simple improvements suggested in the comments. We can have a follow-up PR to clean this up further if necessary.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 112 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Might be worth adding a TODO since we'll likely want to support staging non-release binaries; see [1].

[1] #93229

What would the TODO say here? I don't think we'd be staging SHAs in these upgrade tests.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 127 at r2 (raw file):

Previously, smg260 (Miral Gadani) wrote…

nit: Non interpolated commands read better as just a single string. "mkdir -p {store-dir}"

Done.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 143 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

FYI, I stumbled upon this recently but forgot to create an issue. As of [1], pebble is no longer creating CURRENT. So, this non-existence check is no longer valid. I believe the same check is used elsewhere in roachtest in conjunction with roachtest monitor --ignore-empty-nodes.

[1] cockroachdb/pebble#1227

Good point, created #95170.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 180 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

startOpts is unused.

Removed.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 229 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

It seems error-prone to expect that only the first node's binary version denotes the cluster upgrade version to wait for. It also prevents the use-case where the caller is oblivious to the nodes' binary versions and just wants to wait for a specific cluster version. Why not pass the cluster version as an explicit parameter?

I guess it depends on the contract of the function. Currently, the contract is: "this function should be called only when every node in the cluster has been restarted with the same binary version" (I added a comment to make this assumption explicit).

Given this assumption/contract, I prefer the current function signature. If we require callers to pass the specific expected version, that means we'll have to devise a mechanism to go from binary file (passed to roachtest on the command line) to binary version. While possible, seems like it's work that doesn't give us much at the moment.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 256 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Here and for the other methods too, maybe worth pointing out what the concurrency contract here is? If I register five hooks, will they be called in the order in which they appear? Or may some of them be invoked concurrently? If this is the case, is there a way to opt out of concurrency for steps that want to run sequentially, or is the expectation that such operations need to be packaged into one hook (limiting, to a degree, the quality of the artifacts on failure)?

I think there's no concurrency and so the answer is simple, but a sentence on each of the methods to that effect would help.

Great point. In fact, multiple callbacks are executed concurrently. Added a note to make that explicit.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 15 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I think we need to make this more precise. The sufficient condition for "Mixed-version" is that a pair of nodes are running different binary versions. The migrations don't necessarily have to be enabled to be in the mixed version state.

I think the issue is that we (or maybe just me?) use the term "mixed-version" to mean more than one situation: 1) clusters where nodes don't run the same binary version; and 2) clusters that run the same binary version, but the cluster version is not the same as the binary version (either because migrations are underway, or because preserve_downgrade_option is set). I tried to make the comment a little clearer.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 32 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

A detailed test plan is logged _before_ the test runs; it facilitates debugging a test failure by summarizing all executed steps.

Fixed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 51 at r2 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Small typo here Fuctions

Fixed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 53 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

separately -> sequentially

Fixed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 91 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

The comment conflates upgrading of the binary and the cluster versions. How about the following,

runWhileMigratingProbability is the probability that a mixed-version hook will run after all nodes in the cluster
have been upgraded to the same binary version, and the cluster version is allowed to be upgraded; this typically
involves the execution of migration steps before the new cluster version can be finalized.

Updated, thanks.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 93 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

It wasn't immediately clear what this sentinel is for. In fact, it can be removed since it's used only by the pretty printer, and in all the cases with the exception of runHookStep.

Other steps that don't use noDBNeeded are preserveDowngradeOptionStep and finalizeUpgradeStep. I suspect over time we'll have more steps that require a database connection, so I'm not removing the sentinel.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 119 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

userFunc seems more precise unless the plan is to allow internal hooks which a user of this API cannot specify.

Renamed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 134 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

sequence --> sequentially

Fixed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 139 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

In light of the above comment, I'd suggest renaming to singleStep; runnable is ambiguous because it typically denotes an intermediate state. Here, we mean that it's a step that doesn't compose sub-steps and hence can be directly executed.

Makes sense; renamed to singleStep

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 147 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

As noted above, I think this can be removed. Instead, the pretty-printer can special-case runHookStep.

See previous comment; it's not the only step that requires a DB connection.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 163 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

The name seems awkward. There isn't anything to execute after a test completed, in the literal sense. How about renaming to afterUpgradeFinalized?

Renamed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 194 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I couldn't immediately see where the import cycle shows up.

This comment was outdated and represented an earlier iteration of the code. There's indeed no import cycle, and I changed function signatures to take a test.Test where that makes sense.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 234 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Since the predicate is used exclusively with the mixed-version step(s), it could be inlined into Filter below. It seems an unnecessary complication since all the other cases set the predicate to nil.

The inclusion of predicate is a little of foresight as I envisioned an API where callers would be able to pass their own predicates if the builtin ones didn't work for them. That's not possible yet, but I suspect it will be required as we move more mixed-version tests to this framework.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 236 at r2 (raw file):

Previously, smg260 (Miral Gadani) wrote…

nit: according to

Fixed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 247 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

upgradedNodes --> numUpgradedNodes

Renamed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 269 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

+1

Renamed

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 273 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Run runs reminded of the silly linter we have to appease :)

No longer have to appease!

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 478 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit:

sequentialRunStep is a "meta-step" which indicates enclosed steps are to be executed sequentially.
The default test runner already runs steps sequentially. This meta-step exists primarily as a way
to group related steps so that a test plan is easier to understand for a human.

Updated.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 495 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

`...avoid biases of certain execution orders."

Updated.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 497 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I'd remove the parenthetical since it doesn't add information.

Removed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 499 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I'd also remove the parenthetical since it doesn't add information.

Removed.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 517 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

That's a pretty huge space--1000^numConcurrentSteps. Realistically, a random walk isn't going to capture this state space in a meaningful way. Since many user-specified steps are expected to take seconds to execute, why don't we reduce the state space by choosing from [100, 200]? The added latency doesn't seem concerning. Plus, it's closer to emulating user behavior since an operation would incur network RTT latency, etc.

Fair enough. Changed to pick randomly from a list of hardcoded delays (including no delay).

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 557 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

What if we want to have multiple steps to run sequentially at some later time? E.g., step k could depend on result/side-effect of step k-1 which forces them to be executed sequentially.

I chose to not encode dependencies in this API for now, as I feel that would overcomplicate things more than we want at this stage. If there are dependencies, they should for now be dealt with out of band: either by inlining them as a single callback passed to InMixedVersion, or by having one wait for the other's completion (out of band channel or similar mechanism); or by passing a custom predicate.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go line 62 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

They are not entirely arbitrary points; this is determined by the test planner and the implicit state machine. Maybe paraphrase to at times determined by the test planner?

Good point, updated.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go line 71 at r2 (raw file):

Previously, smg260 (Miral Gadani) wrote…

*instead

Fixed.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner.go line 180 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: ...making it easy to correlate the step...

Updated.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_planner_test.go line 93 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Goland says:

Is there something weird going on here?

Ah, I copy and pasted the output of tree on my laptop and didn't realized it used NBSP. Updated the pretty printer to use plain old whitespaces instead.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 31 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: source file name duplicates roachtest's test_runner.go. Any chance we could rename the source to runner.go and the struct to Runner? The test prefix seems redundant since it's already under roachtest. (The same could be done for all other source file names, planner.go, etc.) This is a rather selfish ask since before I could just type test_runner in GoLand, and it wouldn't clash with any other test runners :)

Fair enough! Renamed to planner.go and runner.go.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 69 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Seems like this belongs to the caller (Test.run)? Strictly speaking it's not a concern of executing the test plan.

Moved.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 145 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

The binary version was cached during step 1. So, it's possible that it would be stale at the time of this call. It should be refreshed in the same way that the cluster version is refreshed.

By design, the binary version would not have changed behind our backs since it's only changed at specific points in the test. IMO, it would be wasteful (and confusing) to query those again.

pkg/cmd/roachtest/roachtestutil/mixedversion/test_runner.go line 227 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Note, when this errors out, the cached field (here and in the above helpers) would be only partially initialized. So, it's important to ensure the test bails out; otherwise, you have inconsistent (internal) state which may result in index out of bounds.

That should already be the case, as we never ignore errors returned by functions. The test would just bail.

srosenberg

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @tbg)

a discussion (no related file):

Previously, renatolabs (Renato Costa) wrote…

I pushed quite a few changes based on the feedback received. Ready for another look please!

Can't wait to see it running in nightlies!

pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go line 79 at r6 (raw file):

	addSteps := func(ss []testStep) { steps = append(steps, ss...) }

	addSteps(p.initSteps())

I wonder if we should make this an invariant that can be enforced (e.g., via assert) in runStep? Maybe a dummy/noop step which serves only as a marker? Or we can just assume that the TestPlan is immutable after it's created?

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 88 at r6 (raw file):

// runStep contains the logic of running a single test step, called
// recursively in the case of sequentialRunStep and concurrentRunStep.
func (tr *testRunner) runStep(step testStep) error {

It just dawned on me that it would be nice to also randomize/shuffle the order in which the single steps are executed (modulo, first and last). Maybe add a TODO for the next iteration?

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 92 at r6 (raw file):

		// if we are running a singleStep that is *not* the first step,
		// we can initialize the database connections. This represents the
		// assumption that the first step in the test plan is the one that

It's more than just an assumption, right? It should be an enforceable invariant.

pkg/cmd/roachtest/tests/decommission_self.go line 26 at r6 (raw file):

func runDecommissionSelf(ctx context.Context, t test.Test, c cluster.Cluster) {
	allNodes := c.All()
	u := newVersionUpgradeTest(c,

This should now be called oldVersionUpgradeTest :)

renatolabs

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, @srosenberg, and @tbg)

pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go line 79 at r6 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I wonder if we should make this an invariant that can be enforced (e.g., via assert) in runStep? Maybe a dummy/noop step which serves only as a marker? Or we can just assume that the TestPlan is immutable after it's created?

Not sure I'm reading it right, but what invariant are you referring to here?

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 88 at r6 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

It just dawned on me that it would be nice to also randomize/shuffle the order in which the single steps are executed (modulo, first and last). Maybe add a TODO for the next iteration?

Not sure I understand this either. The test planner is the place where all of the randomization takes place. I intentionally designed the runner to be as dumb as possible and just follow the test plan verbatim so that the test plan (as displayed to the user) is an accurate description of what happened in the test.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 92 at r6 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

It's more than just an assumption, right? It should be an enforceable invariant.

IMO it's still an assumption that the test runner makes about the output of the test planner: it assumes that after completion of the first step of the plan, cockroach nodes will be available. That might not always be the case as we change the test planner in the future.

I added a TODO and will revisit this in a next PR involving upgrade testing.

renatolabs · 2023-01-19T22:30:00Z

I'll run all mixed-version roachtests on this branch to make sure I didn't accidentally break tests that are using the "old" API when I refactored.

For consideration

Can't wait to see it running in nightlies!

Since I changed an acceptance test here, the new test will (as it stands) run on bors builds too. I wanted to hear your thoughts about this. We could merge as-is, or we could have two versions of acceptance/version-upgrade, one for bors builds (in which case we run the old API), and one for nightly builds (new API).

Option 1: Running the new API on bors and nightlies (current state)

Pro: since each run of the tests creates a new, random test plan, running on bors gives more exposure to random orders and could uncover bugs.
Pro: having a single version of the test is cleaner
Pro: the test is fairly stable. I've been running acceptance/version-upgrade locally on a gceworker for several hours straight and it hasn't failed yet.
Con: failures can be quite disruptive as it blocks people from merging their PRs.
Con: this is a large change, and there could be hidden behaviors that will only surface in CI

Option 2: Leave test as-is on local runs (bors), run new API on nightlies

Pro: the less risky of the options, as bors will run code that is mostly the same as it has been running for a while
Con: a lot less random exploration, as new API will only run a few times nightly.

As can probably be noted, I'm partial to Option 1 as I think the benefits outweigh the risks, but I'm also happy with option 2.

herkolategan

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @smg260, @srosenberg, and @tbg)

a discussion (no related file):

Previously, srosenberg (Stan Rosenberg) wrote…

Can't wait to see it running in nightlies!

Everything I mentioned has been addressed and looks good to me, thanks! Also excited to see this running nightly.

srosenberg

I'd also prefer Option 1. Before the merge though, it would be nice to stress several iteration of both modalities. We could also double-check that the average duration was not negatively affected.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, @smg260, and @tbg)

pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go line 79 at r6 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Not sure I'm reading it right, but what invariant are you referring to here?

In runStep, you say,

This represents the assumption that the first step in the test plan is the one...

That assumption could be made into an invariant.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 88 at r6 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Not sure I understand this either. The test planner is the place where all of the randomization takes place. I intentionally designed the runner to be as dumb as possible and just follow the test plan verbatim so that the test plan (as displayed to the user) is an accurate description of what happened in the test.

Sorry, wrong placement of the comment. I was merely commenting that the planner should allow for the randomization of the order of the sequence of steps.

The `clusterupgrade` package is introduced in this commit as a means towards the end goal of reusing upgrade-related functionality in roachtests. Specifically, as we work towards a new API for mixed-version testing in roachtests, we want to continue to support existing upgrade tests (at least for the time being). For that reason, it's useful to be able to reuse certain low level building blocks that both APIs rely on. The `clusterupgrade` package includes functions to query the binary and cluster versions on a node, to replace the binary running on nodes, upload a certain binary, etc. Most of these functions were moved from the `versionupgrade.go` file. Epic: CRDB-19321 Release note: None

This introduces a new mixed-version API for roachtests, located under `roachtestutil/mixedversion`. The goal of this API is to provide a higher level framework than the existing `versionUpgradeTest` set of functions. Currently, all mixed-version roachtests rely on teams having to specify, step by step, which release version nodes should start from, how and in what order they upgrade, whether they attempt a rollback, at what point the feature being tested is invoked, etc. All of this makes the process of writing mixed-version roachtests time consuming and takes the focus away from what is actually being tested in mixed-version. In addition, it's typically not feasible to test multiple combinations of event orderings in mixed-version tests, and as a consequence most tests only exercise functionality after all nodes have upgraded to the new version, potentially hiding bugs that are only exposed if _some_ nodes are running a newer binary, while others are running an older release. The new API has the goal of automatically exploring different orderings in each run. In the most common scenario, teams will only need to specify what they want tested in a mixed-version setting, and the framework will decide when that feature will be invoked: in some runs, it will happen when all nodes are migrated to the new binary (most common scenario in existing tests); in other runs, it will happen when only some nodes have upgraded; sometimes it will run while upgrade migrations are running in the background; etc. It's also possible to specify multiple features to be tested in mixed-version state, and the framework will randomly schedule them during the test, potentially running them concurrently. The randomness is controlled by a seed that is logged when the test is executed. By reusing the same seed, the same mixed-version test is generated which helps with reproducibility (but, of course, does not make the test deterministic). Mixed-version tests that use the `mixedversion` package can be broken down into two stages: _planning_ and _execution_. During planning, all mixed-version hooks will be scheduled and a test plan is generated. All randomness is introduced during test plan generation. The execution stage just takes care of running the plan, and does not introduce any other randomness. When the mixed-version test is run, the test plan is printed in a tree-like format, providing a high level overview of what the test is doing. In addition, each step of the test is assigned a numeric ID, which can be used to reference the output of that step during execution. Each step logs to a different file, and the artifacts containing the output of each step will live in `mixed-version-test/{ID}.log`. The new framework builds on top of the recently introduced `clusterupgrade` package, which provides basic, low-level fiunctions that are also used by the existing mixed-version tests. Epic: CRDB-19321 Release note: None

This commit migrates the `acceptace/version-upgrade` test to use the new API in the `mixedversion` package. Two features were previously being tested in mixed-version state in this test: backups, and basic functionality that can be tested by running a SQL statement. This change simplifies the code as it is no longer required to set up the nodes, and the concept of the `versionFeatureStep` was removed as it is not necessary and was only being used by this test. Epic: CRDB-19321 Release note: None

renatolabs · 2023-01-20T20:11:10Z

Before the merge though, it would be nice to stress several iteration of both modalities

This has been stressed over 200 times in local runs (as is done by bors builds). It's also been run maybe a dozen times in real clusters in the process of testing the code here.

We could also double-check that the average duration was not negatively affected.

Results over 10 runs:

	mean (s)	stdev (s)
master	220.7	17.133
new API	204.2	29.562

So new the API takes generally less time to finish but its run time is also more variable. Both of these are expected as the test planner may decide to run some of the mixed-version test code concurrently, whereas on master everything is always sequential.

renatolabs · 2023-01-20T20:15:15Z

bors r=srosenberg

craig · 2023-01-20T20:25:37Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-01-21T00:21:55Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-01-21T02:07:07Z

Build succeeded:

Bazel Essential CI (Cockroach)

renatolabs requested a review from a team as a code owner November 24, 2022 00:07

renatolabs requested review from herkolategan and smg260 and removed request for a team November 24, 2022 00:07

renatolabs force-pushed the new-upgrade-testing-api branch from 80cdeb5 to a11010d Compare November 24, 2022 01:17

srosenberg self-requested a review November 28, 2022 17:40

herkolategan reviewed Dec 2, 2022

View reviewed changes

herkolategan reviewed Dec 7, 2022

View reviewed changes

smg260 reviewed Dec 9, 2022

View reviewed changes

srosenberg reviewed Dec 12, 2022

View reviewed changes

herkolategan reviewed Dec 12, 2022

View reviewed changes

srosenberg requested a review from tbg December 20, 2022 04:58

tbg reviewed Dec 20, 2022

View reviewed changes

renatolabs force-pushed the new-upgrade-testing-api branch from a11010d to 778a3aa Compare January 11, 2023 20:26

renatolabs commented Jan 12, 2023

View reviewed changes

renatolabs force-pushed the new-upgrade-testing-api branch from 778a3aa to a163b90 Compare January 13, 2023 14:58

srosenberg approved these changes Jan 19, 2023

View reviewed changes

renatolabs force-pushed the new-upgrade-testing-api branch from a163b90 to c71083b Compare January 19, 2023 22:17

renatolabs commented Jan 19, 2023

View reviewed changes

herkolategan reviewed Jan 20, 2023

View reviewed changes

srosenberg reviewed Jan 20, 2023

View reviewed changes

renatolabs added 3 commits January 20, 2023 11:36

renatolabs force-pushed the new-upgrade-testing-api branch from c71083b to bbccd8f Compare January 20, 2023 16:36

craig bot merged commit 2861a4e into cockroachdb:master Jan 21, 2023

healthy-pod mentioned this pull request Jan 22, 2023

roachtest: multitenant-upgrade failed #95648

Closed

matthewtodd mentioned this pull request Jan 23, 2023

create migration tests #71536

Closed

msbutler mentioned this pull request Feb 1, 2023

backupccl: write a mixed version restore test #96367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: introduce new mixed-version testing API #92431

roachtest: introduce new mixed-version testing API #92431

renatolabs commented Nov 24, 2022

cockroach-teamcity commented Nov 24, 2022

renatolabs commented Nov 24, 2022

herkolategan left a comment

herkolategan left a comment

smg260 left a comment

srosenberg left a comment

srosenberg left a comment

herkolategan left a comment

tbg left a comment

renatolabs left a comment

srosenberg left a comment

renatolabs left a comment

renatolabs commented Jan 19, 2023

herkolategan left a comment

srosenberg left a comment

renatolabs commented Jan 20, 2023

renatolabs commented Jan 20, 2023

craig bot commented Jan 20, 2023

craig bot commented Jan 21, 2023

craig bot commented Jan 21, 2023

roachtest: introduce new mixed-version testing API #92431

roachtest: introduce new mixed-version testing API #92431

Conversation

renatolabs commented Nov 24, 2022

cockroach-teamcity commented Nov 24, 2022

renatolabs commented Nov 24, 2022

herkolategan left a comment

Choose a reason for hiding this comment

herkolategan left a comment

Choose a reason for hiding this comment

smg260 left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

herkolategan left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

renatolabs left a comment

Choose a reason for hiding this comment

renatolabs commented Jan 19, 2023

For consideration

Option 1: Running the new API on bors and nightlies (current state)

Option 2: Leave test as-is on local runs (bors), run new API on nightlies

herkolategan left a comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

renatolabs commented Jan 20, 2023

renatolabs commented Jan 20, 2023

craig bot commented Jan 20, 2023

craig bot commented Jan 21, 2023

craig bot commented Jan 21, 2023