Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: introduce backup fixture generator framework #102821

Merged
merged 2 commits into from
Jun 14, 2023

Conversation

msbutler
Copy link
Collaborator

@msbutler msbutler commented May 5, 2023

Previously, we either created backup fixtures by manually creating and running workloads on roachprod or using unweildy bash scripts. This patch introduces a framework that makes it easy to generate a backup fixture via the roachtest api. Once the fixture writer specifies a foreground workload (e.g. tpce) and a scheduled backup specification, a single roachprod run invocation will create the fixture in a cloud bucket that can be easily fetched by restore roachtests.

The fixture creator can initialize a foreground workload using a workload init cmd or by restoring from an old fixture.

Note that the vast majority of the test specifications are "skipped" so they are not run in the nightly roachtest suite. Creating large fixtures is expensive and only need to be recreated once a major release. This patch creates 5 new "roachtests":

  • backupFixture/tpce/15GB/aws [disaster-recovery]
  • backupFixture/tpce/32TB/aws [disaster-recovery] (skipped)
  • backupFixture/tpce/400GB/aws [disaster-recovery] (skipped)
  • backupFixture/tpce/400GB/gce [disaster-recovery] (skipped)
  • backupFixture/tpce/8TB/aws [disaster-recovery] (skipped)

In the future, this framework should be extended to make it easier to write backup-restore roundtrip tests as well.

Fixes #99787

Release note: None

@msbutler msbutler self-assigned this May 5, 2023
@msbutler msbutler requested a review from a team as a code owner May 5, 2023 21:49
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler force-pushed the butler-backup-framework branch 2 times, most recently from 1ed504c to 91d7859 Compare May 5, 2023 22:20
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely seems like an improvement over what we had before! 👏

I find the interaction of the different *Spec* structs and how they overwrite each other a little confusing/hard to wrap my head around, but it might just be that I'm less familiar with this code.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @msbutler, @rhu713, and @srosenberg)


-- commits line 5 at r1:
Nit: unwieldy


-- commits line 8 at r1:
Did you mean roachtest run? Or is someone is expected to run roachprod directly at some point?


pkg/cmd/roachtest/tests/backup_fixtures.go line 40 at r1 (raw file):

	override.ignoreExistingBackups = specs.ignoreExistingBackups
	// TODO(msbutler): validate the crdb version roachtest will use. We don't want to create a 23.1.0
	// backup with a master binary, for example.

Sounds like something we'll want to do before actually running this code to generate fixtures.


pkg/cmd/roachtest/tests/backup_fixtures.go line 46 at r1 (raw file):

// defaultBackupFixtureSpecs defines the default scheduled backup used to create a fixture.
var defaultBackupFixtureSpecs = scheduledBackupSpecs{
	crontab: "*/5 * * * *",

Nit: would be nice to have an English translation to help people that don't speak fluent crontab. 🙂


pkg/cmd/roachtest/tests/backup_fixtures.go line 48 at r1 (raw file):

	crontab: "*/5 * * * *",
	backupSpecs: backupSpecs{
		version:         "23.1.0",

The actual backupFixtures/* tests specify this version with the leading v. Which one is right?

Also, is the expectation that someone would have to change this code on every release when generating fixtures?


pkg/cmd/roachtest/tests/backup_fixtures.go line 51 at r1 (raw file):

		cloud:           spec.AWS,
		fullBackupDir:   "LATEST",
		backupsIncluded: 24,

Might be worth mentioning why we are using 2x the default here.


pkg/cmd/roachtest/tests/backup_fixtures.go line 65 at r1 (raw file):

	// of false prevents roachtest users from overriding the latest backup in a
	// collection, which may be used in restore roachtests.
	ignoreExistingBackups bool

Nit: IMO, the "default option of false" part of this comment belongs to the defaultBackupFixtureSpecs definition.


pkg/cmd/roachtest/tests/backup_fixtures.go line 127 at r1 (raw file):

	if bd.c.Spec().Cloud != bd.sp.backup.cloud {
		// For now, only run the test on the cloud provider that also stores the backup.
		bd.t.Skip(fmt.Sprintf("test configured to run on %s", bd.sp.backup.cloud))

This is mostly an FYI at this point, but this style of skipping tests is expensive because is requires roachtest to create a cluster (which includes provisioning VMs, running all sorts of scripts, etc), only to then abort the test 2s in.

Using tags would be preferred (although not that much better due to our inconsistent use of them (e.g., aws meaning "run in this test on GCE and AWS")). We'll have a better API in the near future.


pkg/cmd/roachtest/tests/backup_fixtures.go line 152 at r1 (raw file):

func (bd *backupDriver) initWorkload(ctx context.Context) {
	if bd.sp.initFromBackupSpecs.version == "" {
		bd.t.L().Printf(`Initializing workload via ./workload init`)

This is not really ./workload init (especially with tpce), is it?


pkg/cmd/roachtest/tests/backup_fixtures.go line 189 at r1 (raw file):

		time.Sleep(1 * time.Minute)
		var activeScheduleCount int
		sql.QueryRow(bd.t, `SELECT count(*) FROM [SHOW SCHEDULES] WHERE label ='schedule_cluster' and schedule_status='ACTIVE'`).Scan(&activeScheduleCount)

Can we make schedule_cluster a constant?

Nit: inconsistent SQL formatting: and is not upper case, label = vs schedule_status=.


pkg/cmd/roachtest/tests/backup_fixtures.go line 199 at r1 (raw file):

		bd.t.L().Printf(`%d scheduled backups taken`, backupCount)
		if backupCount >= bd.sp.backup.backupsIncluded {
			sql.QueryRow(bd.t, `PAUSE SCHEDULES WITH x AS (SHOW SCHEDULES) SELECT id FROM x WHERE label = 'schedule_cluster'`)

Interesting, is this the only way to filter schedules while pausing? all_schedules (or similar) is probably better than x.


pkg/cmd/roachtest/tests/backup_fixtures.go line 227 at r1 (raw file):

		{
			// 15 GB Backup Fixture. Note, this fixture is created every night to
			// ensure the fixture generation code works.

Do we need to run this every night on both GCE and AWS? Would weekly be good enough?


pkg/cmd/roachtest/tests/backup_fixtures.go line 236 at r1 (raw file):

						backupsIncluded: 4,
						workload:        tpceRestore{customers: 1000}}}),
			initFromBackupSpecs: backupSpecs{version: "v22.2.1", backupProperties: "inc-count=48"},

Is the expectation that someone would update this every release?


pkg/cmd/roachtest/tests/backup_fixtures.go line 283 at r1 (raw file):

				workloadCtx, workloadCancel := context.WithCancel(ctx)
				defer workloadCancel()

This will cause the workload routine to race with roachtest's test teardown (the Run function will return). One way to deal with that would be:

defer func() {
    workloadCancel()
    m.Wait()
}()

pkg/cmd/roachtest/tests/backup_fixtures.go line 286 at r1 (raw file):

				workloadDoneCh := make(chan struct{})
				m.Go(func(ctx context.Context) error {
					defer close(workloadDoneCh)

This channel doesn't seem to be used.


pkg/cmd/roachtest/tests/backup_fixtures.go line 289 at r1 (raw file):

					err := bd.runWorkload(workloadCtx)
					// The workload should only return an error if the roachtest driver cancels the
					// workloadCtx is cancelled after the backup schedule completes.

This comment reads a little weird.

Also, if I'm reading how these contexts are wired up correctly, I don't think this will work. If a node dies, the monitor will cancel ctx, but workloadCtx will continue to be valid.

I think you want to pass workloadCtx to the monitor, and use the ctx argument in this closure.


pkg/cmd/roachtest/tests/restore.go line 476 at r1 (raw file):

		return hw.nodes + 1
	}
	return 0

So getWorkloadNode is never expected to be called if workloadNode is false? (0 is not a valid node ID). If so, a panic would be a better fit.


pkg/cmd/roachtest/tests/restore.go line 623 at r1 (raw file):

	// initWorkload loads the cluster with the workload's schema and initial data.
	initWorkload(ctx context.Context, t test.Test, c cluster.Cluster, sp hardwareSpecs)

Any reason why this is not just called init?


pkg/cmd/roachtest/tests/restore.go line 627 at r1 (raw file):

	// foregroundRun begins a foreground workload that runs indefinitely until the passed context
	// is cancelled.
	foregroundRun(ctx context.Context, t test.Test, c cluster.Cluster, sp hardwareSpecs) error

What makes it foreground? IMO, it's still the caller's decision at the end of the day.


pkg/cmd/roachtest/tests/restore.go line 647 at r1 (raw file):

	ctx context.Context, t test.Test, c cluster.Cluster, sp hardwareSpecs,
) error {
	tpceSpec, err := initTPCESpec(ctx, t.L(), c, sp.getWorkloadNode(), sp.getCRDBNodes())

Can we cache this? initTPCESpec does some quite expensive things, including installing Docker.

@msbutler msbutler force-pushed the butler-backup-framework branch 2 times, most recently from 33f0911 to eb1bd55 Compare May 19, 2023 19:40
Copy link
Collaborator Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review and apologies for such a slow response! I sympathize with how hard it is to read the spec structs. I suppose I could implement some functional options instead. Before I do that, I'd like to figure out how I can leverage these spec structs for a general purpose backup-restore roundtrip test framework.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @rhu713, and @srosenberg)


-- commits line 8 at r1:

Previously, renatolabs (Renato Costa) wrote…

Did you mean roachtest run? Or is someone is expected to run roachprod directly at some point?

oops, meant roachtest run. fixed.


pkg/cmd/roachtest/tests/backup_fixtures.go line 40 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Sounds like something we'll want to do before actually running this code to generate fixtures.

Just added a runtime assertion.


pkg/cmd/roachtest/tests/backup_fixtures.go line 46 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: would be nice to have an English translation to help people that don't speak fluent crontab. 🙂

heh. Added.


pkg/cmd/roachtest/tests/backup_fixtures.go line 48 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

The actual backupFixtures/* tests specify this version with the leading v. Which one is right?

Also, is the expectation that someone would have to change this code on every release when generating fixtures?

Not sure I follow the first question-- the test names do not contain the cluster version?

❯ roachtest list | grep 'backupFixture'
backupFixture/tpce/15GB/aws [disaster-recovery]
backupFixture/tpce/32TB/aws [disaster-recovery] (skipped: only for fixture generation)
backupFixture/tpce/400GB/aws [disaster-recovery] (skipped: only for fixture generation)
backupFixture/tpce/400GB/gce [disaster-recovery] (skipped: only for fixture generation)
backupFixture/tpce/8TB/aws [disaster-recovery] (skipped: only for fixture generation)

Yes, I expect that someone will change this code on every release while generating fixtures.


pkg/cmd/roachtest/tests/backup_fixtures.go line 51 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Might be worth mentioning why we are using 2x the default here.

oof good catch. that was inadvertent.


pkg/cmd/roachtest/tests/backup_fixtures.go line 65 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: IMO, the "default option of false" part of this comment belongs to the defaultBackupFixtureSpecs definition.

Done


pkg/cmd/roachtest/tests/backup_fixtures.go line 127 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This is mostly an FYI at this point, but this style of skipping tests is expensive because is requires roachtest to create a cluster (which includes provisioning VMs, running all sorts of scripts, etc), only to then abort the test 2s in.

Using tags would be preferred (although not that much better due to our inconsistent use of them (e.g., aws meaning "run in this test on GCE and AWS")). We'll have a better API in the near future.

ack. I still need this gate to prevent a gce test from running on aws though, i.e. I want the following to manual run to fail:

roachtest run backupFixture/tpce/400GB/gce [...] --cloud aws

pkg/cmd/roachtest/tests/backup_fixtures.go line 152 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This is not really ./workload init (especially with tpce), is it?

true. removing that bit.


pkg/cmd/roachtest/tests/backup_fixtures.go line 189 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Can we make schedule_cluster a constant?

Nit: inconsistent SQL formatting: and is not upper case, label = vs schedule_status=.

Fixed


pkg/cmd/roachtest/tests/backup_fixtures.go line 199 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Interesting, is this the only way to filter schedules while pausing? all_schedules (or similar) is probably better than x.

Idk, I found this query in the docs 🤷

https://www.cockroachlabs.com/docs/dev/manage-a-backup-schedule.html#pause-the-schedule


pkg/cmd/roachtest/tests/backup_fixtures.go line 227 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Do we need to run this every night on both GCE and AWS? Would weekly be good enough?

Good point. We can do this weekly.


pkg/cmd/roachtest/tests/backup_fixtures.go line 236 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Is the expectation that someone would update this every release?

yup, at least for now


pkg/cmd/roachtest/tests/backup_fixtures.go line 283 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This will cause the workload routine to race with roachtest's test teardown (the Run function will return). One way to deal with that would be:

defer func() {
    workloadCancel()
    m.Wait()
}()

ah nice catch.


pkg/cmd/roachtest/tests/backup_fixtures.go line 286 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This channel doesn't seem to be used.

whoops. no longer necessary.


pkg/cmd/roachtest/tests/backup_fixtures.go line 289 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

This comment reads a little weird.

Also, if I'm reading how these contexts are wired up correctly, I don't think this will work. If a node dies, the monitor will cancel ctx, but workloadCtx will continue to be valid.

I think you want to pass workloadCtx to the monitor, and use the ctx argument in this closure.

ah, nice catch! I was under the impression that the ctx passed into monitor will get cancelled on node death, but after looking the implementation of Monitor, I learned that it creates new cancellable context for node monitoring.

Improved the comment as well.


pkg/cmd/roachtest/tests/restore.go line 476 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

So getWorkloadNode is never expected to be called if workloadNode is false? (0 is not a valid node ID). If so, a panic would be a better fit.

Done


pkg/cmd/roachtest/tests/restore.go line 623 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Any reason why this is not just called init?

good point


pkg/cmd/roachtest/tests/restore.go line 627 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

What makes it foreground? IMO, it's still the caller's decision at the end of the day.

another good point.


pkg/cmd/roachtest/tests/restore.go line 647 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Can we cache this? initTPCESpec does some quite expensive things, including installing Docker.

Done

@msbutler msbutler force-pushed the butler-backup-framework branch from eb1bd55 to 9f76177 Compare May 19, 2023 19:59
ignoreExistingBackupsOpt = "ignore_existing_backups"
}
backupCmd := fmt.Sprintf(`BACKUP INTO %s WITH revision_history`, sbs.backupCollection())
cmd := fmt.Sprintf(`CREATE SCHEDULE %s FOR %s RECURRING '%s' FULL BACKUP '@weekly' WITH SCHEDULE OPTIONS first_run = 'now', %s`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering given that we are having a goroutine that monitors the existence of a backup created by a schedule whether it's better to manually invoke a backup for every layer that we want, instead of using a schedule. This allows us to get better error handling if any individual backups fail, and allows for frequencies that are more specific than the crontab, etc.

Copy link
Collaborator Author

@msbutler msbutler May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point of this set of roachtests and this specific call is to create a fixture. I could imagine for a generic backup-restore roundtrip test test, we'd want a more flexible api for scheduling backups, as you say, but i think that's out of scope here.

Previously, we either created backup fixtures by manually creating and running
workloads on roachprod or using unwieldy bash scripts. This patch introduces a
framework that makes it easy to generate a backup fixture via the roachtest
api. Once the fixture writer specifies a foreground workload (e.g. tpce) and a
scheduled backup specification, a single `roachtest run` invocation will create
the fixture in a cloud bucket that can be easily fetched by restore roachtests.

The fixture creator can initialize a foreground workload using a `workload
init` cmd or by restoring from an old fixture.

Note that the vast majority of the test specifications are "skipped" so they
are not run in the nightly roachtest suite. Creating large fixtures is
expensive and only need to be recreated once a major release. This patch
creates 5 new "roachtests":
- backupFixture/tpce/15GB/aws [disaster-recovery]
- backupFixture/tpce/32TB/aws [disaster-recovery] (skipped)
- backupFixture/tpce/400GB/aws [disaster-recovery] (skipped)
- backupFixture/tpce/400GB/gce [disaster-recovery] (skipped)
- backupFixture/tpce/8TB/aws [disaster-recovery] (skipped)

In the future, this framework should be extended to make it easier to write
backup-restore roundtrip tests as well.

Fixes cockroachdb#99787

Release note: None
@msbutler msbutler force-pushed the butler-backup-framework branch from 9f76177 to 6655584 Compare June 12, 2023 18:19
@msbutler
Copy link
Collaborator Author

@renatolabs @rhu713 friendly ping here!

@msbutler msbutler force-pushed the butler-backup-framework branch from 6655584 to 37463f5 Compare June 13, 2023 09:48
@msbutler
Copy link
Collaborator Author

two flakey tests, restarting ci

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for addressing all the comments!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rhu713 and @srosenberg)

@msbutler
Copy link
Collaborator Author

TFTRs!!

bors r=rhu713, renatolabs

@craig
Copy link
Contributor

craig bot commented Jun 14, 2023

Build succeeded:

@craig craig bot merged commit 244ef1c into cockroachdb:master Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

backupccl: write new backup roachtest driver
4 participants