backupccl: introduce backup fixture generator framework #102821

msbutler · 2023-05-05T21:49:32Z

Previously, we either created backup fixtures by manually creating and running workloads on roachprod or using unweildy bash scripts. This patch introduces a framework that makes it easy to generate a backup fixture via the roachtest api. Once the fixture writer specifies a foreground workload (e.g. tpce) and a scheduled backup specification, a single roachprod run invocation will create the fixture in a cloud bucket that can be easily fetched by restore roachtests.

The fixture creator can initialize a foreground workload using a workload init cmd or by restoring from an old fixture.

Note that the vast majority of the test specifications are "skipped" so they are not run in the nightly roachtest suite. Creating large fixtures is expensive and only need to be recreated once a major release. This patch creates 5 new "roachtests":

backupFixture/tpce/15GB/aws [disaster-recovery]
backupFixture/tpce/32TB/aws [disaster-recovery] (skipped)
backupFixture/tpce/400GB/aws [disaster-recovery] (skipped)
backupFixture/tpce/400GB/gce [disaster-recovery] (skipped)
backupFixture/tpce/8TB/aws [disaster-recovery] (skipped)

In the future, this framework should be extended to make it easier to write backup-restore roundtrip tests as well.

Fixes #99787

Release note: None

cockroach-teamcity · 2023-05-05T21:49:41Z

This change is

renatolabs

This definitely seems like an improvement over what we had before! 👏

I find the interaction of the different *Spec* structs and how they overwrite each other a little confusing/hard to wrap my head around, but it might just be that I'm less familiar with this code.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msbutler, @rhu713, and @srosenberg)

-- commits line 5 at r1:
Nit: unwieldy

-- commits line 8 at r1:
Did you mean roachtest run? Or is someone is expected to run roachprod directly at some point?

pkg/cmd/roachtest/tests/backup_fixtures.go line 40 at r1 (raw file):

	override.ignoreExistingBackups = specs.ignoreExistingBackups
	// TODO(msbutler): validate the crdb version roachtest will use. We don't want to create a 23.1.0
	// backup with a master binary, for example.

Sounds like something we'll want to do before actually running this code to generate fixtures.

pkg/cmd/roachtest/tests/backup_fixtures.go line 46 at r1 (raw file):

// defaultBackupFixtureSpecs defines the default scheduled backup used to create a fixture.
var defaultBackupFixtureSpecs = scheduledBackupSpecs{
	crontab: "*/5 * * * *",

Nit: would be nice to have an English translation to help people that don't speak fluent crontab. 🙂

pkg/cmd/roachtest/tests/backup_fixtures.go line 48 at r1 (raw file):

	crontab: "*/5 * * * *",
	backupSpecs: backupSpecs{
		version:         "23.1.0",

The actual backupFixtures/* tests specify this version with the leading v. Which one is right?

Also, is the expectation that someone would have to change this code on every release when generating fixtures?

pkg/cmd/roachtest/tests/backup_fixtures.go line 51 at r1 (raw file):

		cloud:           spec.AWS,
		fullBackupDir:   "LATEST",
		backupsIncluded: 24,

Might be worth mentioning why we are using 2x the default here.

pkg/cmd/roachtest/tests/backup_fixtures.go line 65 at r1 (raw file):

	// of false prevents roachtest users from overriding the latest backup in a
	// collection, which may be used in restore roachtests.
	ignoreExistingBackups bool

Nit: IMO, the "default option of false" part of this comment belongs to the defaultBackupFixtureSpecs definition.

pkg/cmd/roachtest/tests/backup_fixtures.go line 127 at r1 (raw file):

	if bd.c.Spec().Cloud != bd.sp.backup.cloud {
		// For now, only run the test on the cloud provider that also stores the backup.
		bd.t.Skip(fmt.Sprintf("test configured to run on %s", bd.sp.backup.cloud))

This is mostly an FYI at this point, but this style of skipping tests is expensive because is requires roachtest to create a cluster (which includes provisioning VMs, running all sorts of scripts, etc), only to then abort the test 2s in.

Using tags would be preferred (although not that much better due to our inconsistent use of them (e.g., aws meaning "run in this test on GCE and AWS")). We'll have a better API in the near future.

pkg/cmd/roachtest/tests/backup_fixtures.go line 152 at r1 (raw file):

func (bd *backupDriver) initWorkload(ctx context.Context) {
	if bd.sp.initFromBackupSpecs.version == "" {
		bd.t.L().Printf(`Initializing workload via ./workload init`)

This is not really ./workload init (especially with tpce), is it?

pkg/cmd/roachtest/tests/backup_fixtures.go line 189 at r1 (raw file):

		time.Sleep(1 * time.Minute)
		var activeScheduleCount int
		sql.QueryRow(bd.t, `SELECT count(*) FROM [SHOW SCHEDULES] WHERE label ='schedule_cluster' and schedule_status='ACTIVE'`).Scan(&activeScheduleCount)

Can we make schedule_cluster a constant?

Nit: inconsistent SQL formatting: and is not upper case, label = vs schedule_status=.

pkg/cmd/roachtest/tests/backup_fixtures.go line 199 at r1 (raw file):

		bd.t.L().Printf(`%d scheduled backups taken`, backupCount)
		if backupCount >= bd.sp.backup.backupsIncluded {
			sql.QueryRow(bd.t, `PAUSE SCHEDULES WITH x AS (SHOW SCHEDULES) SELECT id FROM x WHERE label = 'schedule_cluster'`)

Interesting, is this the only way to filter schedules while pausing? all_schedules (or similar) is probably better than x.

pkg/cmd/roachtest/tests/backup_fixtures.go line 227 at r1 (raw file):

		{
			// 15 GB Backup Fixture. Note, this fixture is created every night to
			// ensure the fixture generation code works.

Do we need to run this every night on both GCE and AWS? Would weekly be good enough?

pkg/cmd/roachtest/tests/backup_fixtures.go line 236 at r1 (raw file):

						backupsIncluded: 4,
						workload:        tpceRestore{customers: 1000}}}),
			initFromBackupSpecs: backupSpecs{version: "v22.2.1", backupProperties: "inc-count=48"},

Is the expectation that someone would update this every release?

pkg/cmd/roachtest/tests/backup_fixtures.go line 283 at r1 (raw file):

				workloadCtx, workloadCancel := context.WithCancel(ctx)
				defer workloadCancel()

This will cause the workload routine to race with roachtest's test teardown (the Run function will return). One way to deal with that would be:

defer func() {
    workloadCancel()
    m.Wait()
}()

pkg/cmd/roachtest/tests/backup_fixtures.go line 286 at r1 (raw file):

				workloadDoneCh := make(chan struct{})
				m.Go(func(ctx context.Context) error {
					defer close(workloadDoneCh)

This channel doesn't seem to be used.

pkg/cmd/roachtest/tests/backup_fixtures.go line 289 at r1 (raw file):

					err := bd.runWorkload(workloadCtx)
					// The workload should only return an error if the roachtest driver cancels the
					// workloadCtx is cancelled after the backup schedule completes.

This comment reads a little weird.

Also, if I'm reading how these contexts are wired up correctly, I don't think this will work. If a node dies, the monitor will cancel ctx, but workloadCtx will continue to be valid.

I think you want to pass workloadCtx to the monitor, and use the ctx argument in this closure.

pkg/cmd/roachtest/tests/restore.go line 476 at r1 (raw file):

		return hw.nodes + 1
	}
	return 0

So getWorkloadNode is never expected to be called if workloadNode is false? (0 is not a valid node ID). If so, a panic would be a better fit.

pkg/cmd/roachtest/tests/restore.go line 623 at r1 (raw file):

	// initWorkload loads the cluster with the workload's schema and initial data.
	initWorkload(ctx context.Context, t test.Test, c cluster.Cluster, sp hardwareSpecs)

Any reason why this is not just called init?

pkg/cmd/roachtest/tests/restore.go line 627 at r1 (raw file):

	// foregroundRun begins a foreground workload that runs indefinitely until the passed context
	// is cancelled.
	foregroundRun(ctx context.Context, t test.Test, c cluster.Cluster, sp hardwareSpecs) error

What makes it foreground? IMO, it's still the caller's decision at the end of the day.

pkg/cmd/roachtest/tests/restore.go line 647 at r1 (raw file):

	ctx context.Context, t test.Test, c cluster.Cluster, sp hardwareSpecs,
) error {
	tpceSpec, err := initTPCESpec(ctx, t.L(), c, sp.getWorkloadNode(), sp.getCRDBNodes())

Can we cache this? initTPCESpec does some quite expensive things, including installing Docker.

msbutler

thanks for the review and apologies for such a slow response! I sympathize with how hard it is to read the spec structs. I suppose I could implement some functional options instead. Before I do that, I'd like to figure out how I can leverage these spec structs for a general purpose backup-restore roundtrip test framework.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @rhu713, and @srosenberg)

-- commits line 8 at r1: