-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: introduce backup fixture generator framework #102821
Conversation
1ed504c
to
91d7859
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely seems like an improvement over what we had before! 👏
I find the interaction of the different *Spec*
structs and how they overwrite each other a little confusing/hard to wrap my head around, but it might just be that I'm less familiar with this code.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @msbutler, @rhu713, and @srosenberg)
-- commits
line 5 at r1:
Nit: unwieldy
-- commits
line 8 at r1:
Did you mean roachtest run
? Or is someone is expected to run roachprod
directly at some point?
pkg/cmd/roachtest/tests/backup_fixtures.go
line 40 at r1 (raw file):
override.ignoreExistingBackups = specs.ignoreExistingBackups // TODO(msbutler): validate the crdb version roachtest will use. We don't want to create a 23.1.0 // backup with a master binary, for example.
Sounds like something we'll want to do before actually running this code to generate fixtures.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 46 at r1 (raw file):
// defaultBackupFixtureSpecs defines the default scheduled backup used to create a fixture. var defaultBackupFixtureSpecs = scheduledBackupSpecs{ crontab: "*/5 * * * *",
Nit: would be nice to have an English translation to help people that don't speak fluent crontab. 🙂
pkg/cmd/roachtest/tests/backup_fixtures.go
line 48 at r1 (raw file):
crontab: "*/5 * * * *", backupSpecs: backupSpecs{ version: "23.1.0",
The actual backupFixtures/*
tests specify this version with the leading v
. Which one is right?
Also, is the expectation that someone would have to change this code on every release when generating fixtures?
pkg/cmd/roachtest/tests/backup_fixtures.go
line 51 at r1 (raw file):
cloud: spec.AWS, fullBackupDir: "LATEST", backupsIncluded: 24,
Might be worth mentioning why we are using 2x the default here.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 65 at r1 (raw file):
// of false prevents roachtest users from overriding the latest backup in a // collection, which may be used in restore roachtests. ignoreExistingBackups bool
Nit: IMO, the "default option of false" part of this comment belongs to the defaultBackupFixtureSpecs
definition.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 127 at r1 (raw file):
if bd.c.Spec().Cloud != bd.sp.backup.cloud { // For now, only run the test on the cloud provider that also stores the backup. bd.t.Skip(fmt.Sprintf("test configured to run on %s", bd.sp.backup.cloud))
This is mostly an FYI at this point, but this style of skipping tests is expensive because is requires roachtest to create a cluster (which includes provisioning VMs, running all sorts of scripts, etc), only to then abort the test 2s in.
Using tags would be preferred (although not that much better due to our inconsistent use of them (e.g., aws
meaning "run in this test on GCE and AWS")). We'll have a better API in the near future.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 152 at r1 (raw file):
func (bd *backupDriver) initWorkload(ctx context.Context) { if bd.sp.initFromBackupSpecs.version == "" { bd.t.L().Printf(`Initializing workload via ./workload init`)
This is not really ./workload init
(especially with tpce), is it?
pkg/cmd/roachtest/tests/backup_fixtures.go
line 189 at r1 (raw file):
time.Sleep(1 * time.Minute) var activeScheduleCount int sql.QueryRow(bd.t, `SELECT count(*) FROM [SHOW SCHEDULES] WHERE label ='schedule_cluster' and schedule_status='ACTIVE'`).Scan(&activeScheduleCount)
Can we make schedule_cluster
a constant?
Nit: inconsistent SQL formatting: and
is not upper case, label =
vs schedule_status=
.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 199 at r1 (raw file):
bd.t.L().Printf(`%d scheduled backups taken`, backupCount) if backupCount >= bd.sp.backup.backupsIncluded { sql.QueryRow(bd.t, `PAUSE SCHEDULES WITH x AS (SHOW SCHEDULES) SELECT id FROM x WHERE label = 'schedule_cluster'`)
Interesting, is this the only way to filter schedules while pausing? all_schedules
(or similar) is probably better than x
.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 227 at r1 (raw file):
{ // 15 GB Backup Fixture. Note, this fixture is created every night to // ensure the fixture generation code works.
Do we need to run this every night on both GCE and AWS? Would weekly be good enough?
pkg/cmd/roachtest/tests/backup_fixtures.go
line 236 at r1 (raw file):
backupsIncluded: 4, workload: tpceRestore{customers: 1000}}}), initFromBackupSpecs: backupSpecs{version: "v22.2.1", backupProperties: "inc-count=48"},
Is the expectation that someone would update this every release?
pkg/cmd/roachtest/tests/backup_fixtures.go
line 283 at r1 (raw file):
workloadCtx, workloadCancel := context.WithCancel(ctx) defer workloadCancel()
This will cause the workload routine to race with roachtest's test teardown (the Run
function will return). One way to deal with that would be:
defer func() {
workloadCancel()
m.Wait()
}()
pkg/cmd/roachtest/tests/backup_fixtures.go
line 286 at r1 (raw file):
workloadDoneCh := make(chan struct{}) m.Go(func(ctx context.Context) error { defer close(workloadDoneCh)
This channel doesn't seem to be used.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 289 at r1 (raw file):
err := bd.runWorkload(workloadCtx) // The workload should only return an error if the roachtest driver cancels the // workloadCtx is cancelled after the backup schedule completes.
This comment reads a little weird.
Also, if I'm reading how these contexts are wired up correctly, I don't think this will work. If a node dies, the monitor will cancel ctx
, but workloadCtx
will continue to be valid.
I think you want to pass workloadCtx
to the monitor, and use the ctx
argument in this closure.
pkg/cmd/roachtest/tests/restore.go
line 476 at r1 (raw file):
return hw.nodes + 1 } return 0
So getWorkloadNode
is never expected to be called if workloadNode
is false? (0
is not a valid node ID). If so, a panic would be a better fit.
pkg/cmd/roachtest/tests/restore.go
line 623 at r1 (raw file):
// initWorkload loads the cluster with the workload's schema and initial data. initWorkload(ctx context.Context, t test.Test, c cluster.Cluster, sp hardwareSpecs)
Any reason why this is not just called init
?
pkg/cmd/roachtest/tests/restore.go
line 627 at r1 (raw file):
// foregroundRun begins a foreground workload that runs indefinitely until the passed context // is cancelled. foregroundRun(ctx context.Context, t test.Test, c cluster.Cluster, sp hardwareSpecs) error
What makes it foreground? IMO, it's still the caller's decision at the end of the day.
pkg/cmd/roachtest/tests/restore.go
line 647 at r1 (raw file):
ctx context.Context, t test.Test, c cluster.Cluster, sp hardwareSpecs, ) error { tpceSpec, err := initTPCESpec(ctx, t.L(), c, sp.getWorkloadNode(), sp.getCRDBNodes())
Can we cache this? initTPCESpec
does some quite expensive things, including installing Docker.
33f0911
to
eb1bd55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the review and apologies for such a slow response! I sympathize with how hard it is to read the spec structs. I suppose I could implement some functional options instead. Before I do that, I'd like to figure out how I can leverage these spec structs for a general purpose backup-restore roundtrip test framework.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @rhu713, and @srosenberg)
Previously, renatolabs (Renato Costa) wrote…
Did you mean
roachtest run
? Or is someone is expected to runroachprod
directly at some point?
oops, meant roachtest run
. fixed.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 40 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Sounds like something we'll want to do before actually running this code to generate fixtures.
Just added a runtime assertion.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 46 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: would be nice to have an English translation to help people that don't speak fluent crontab. 🙂
heh. Added.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 48 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
The actual
backupFixtures/*
tests specify this version with the leadingv
. Which one is right?Also, is the expectation that someone would have to change this code on every release when generating fixtures?
Not sure I follow the first question-- the test names do not contain the cluster version?
❯ roachtest list | grep 'backupFixture'
backupFixture/tpce/15GB/aws [disaster-recovery]
backupFixture/tpce/32TB/aws [disaster-recovery] (skipped: only for fixture generation)
backupFixture/tpce/400GB/aws [disaster-recovery] (skipped: only for fixture generation)
backupFixture/tpce/400GB/gce [disaster-recovery] (skipped: only for fixture generation)
backupFixture/tpce/8TB/aws [disaster-recovery] (skipped: only for fixture generation)
Yes, I expect that someone will change this code on every release while generating fixtures.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 51 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Might be worth mentioning why we are using 2x the default here.
oof good catch. that was inadvertent.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 65 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: IMO, the "default option of false" part of this comment belongs to the
defaultBackupFixtureSpecs
definition.
Done
pkg/cmd/roachtest/tests/backup_fixtures.go
line 127 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This is mostly an FYI at this point, but this style of skipping tests is expensive because is requires roachtest to create a cluster (which includes provisioning VMs, running all sorts of scripts, etc), only to then abort the test 2s in.
Using tags would be preferred (although not that much better due to our inconsistent use of them (e.g.,
aws
meaning "run in this test on GCE and AWS")). We'll have a better API in the near future.
ack. I still need this gate to prevent a gce test from running on aws though, i.e. I want the following to manual run to fail:
roachtest run backupFixture/tpce/400GB/gce [...] --cloud aws
pkg/cmd/roachtest/tests/backup_fixtures.go
line 152 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This is not really
./workload init
(especially with tpce), is it?
true. removing that bit.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 189 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Can we make
schedule_cluster
a constant?Nit: inconsistent SQL formatting:
and
is not upper case,label =
vsschedule_status=
.
Fixed
pkg/cmd/roachtest/tests/backup_fixtures.go
line 199 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Interesting, is this the only way to filter schedules while pausing?
all_schedules
(or similar) is probably better thanx
.
Idk, I found this query in the docs 🤷
https://www.cockroachlabs.com/docs/dev/manage-a-backup-schedule.html#pause-the-schedule
pkg/cmd/roachtest/tests/backup_fixtures.go
line 227 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Do we need to run this every night on both GCE and AWS? Would weekly be good enough?
Good point. We can do this weekly.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 236 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Is the expectation that someone would update this every release?
yup, at least for now
pkg/cmd/roachtest/tests/backup_fixtures.go
line 283 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This will cause the workload routine to race with roachtest's test teardown (the
Run
function will return). One way to deal with that would be:defer func() { workloadCancel() m.Wait() }()
ah nice catch.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 286 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This channel doesn't seem to be used.
whoops. no longer necessary.
pkg/cmd/roachtest/tests/backup_fixtures.go
line 289 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This comment reads a little weird.
Also, if I'm reading how these contexts are wired up correctly, I don't think this will work. If a node dies, the monitor will cancel
ctx
, butworkloadCtx
will continue to be valid.I think you want to pass
workloadCtx
to the monitor, and use thectx
argument in this closure.
ah, nice catch! I was under the impression that the ctx
passed into monitor
will get cancelled on node death, but after looking the implementation of Monitor, I learned that it creates new cancellable context for node monitoring.
Improved the comment as well.
pkg/cmd/roachtest/tests/restore.go
line 476 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
So
getWorkloadNode
is never expected to be called ifworkloadNode
is false? (0
is not a valid node ID). If so, a panic would be a better fit.
Done
pkg/cmd/roachtest/tests/restore.go
line 623 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Any reason why this is not just called
init
?
good point
pkg/cmd/roachtest/tests/restore.go
line 627 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
What makes it foreground? IMO, it's still the caller's decision at the end of the day.
another good point.
pkg/cmd/roachtest/tests/restore.go
line 647 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Can we cache this?
initTPCESpec
does some quite expensive things, including installing Docker.
Done
eb1bd55
to
9f76177
Compare
ignoreExistingBackupsOpt = "ignore_existing_backups" | ||
} | ||
backupCmd := fmt.Sprintf(`BACKUP INTO %s WITH revision_history`, sbs.backupCollection()) | ||
cmd := fmt.Sprintf(`CREATE SCHEDULE %s FOR %s RECURRING '%s' FULL BACKUP '@weekly' WITH SCHEDULE OPTIONS first_run = 'now', %s`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering given that we are having a goroutine that monitors the existence of a backup created by a schedule whether it's better to manually invoke a backup for every layer that we want, instead of using a schedule. This allows us to get better error handling if any individual backups fail, and allows for frequencies that are more specific than the crontab, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point of this set of roachtests and this specific call is to create a fixture. I could imagine for a generic backup-restore roundtrip test test, we'd want a more flexible api for scheduling backups, as you say, but i think that's out of scope here.
Previously, we either created backup fixtures by manually creating and running workloads on roachprod or using unwieldy bash scripts. This patch introduces a framework that makes it easy to generate a backup fixture via the roachtest api. Once the fixture writer specifies a foreground workload (e.g. tpce) and a scheduled backup specification, a single `roachtest run` invocation will create the fixture in a cloud bucket that can be easily fetched by restore roachtests. The fixture creator can initialize a foreground workload using a `workload init` cmd or by restoring from an old fixture. Note that the vast majority of the test specifications are "skipped" so they are not run in the nightly roachtest suite. Creating large fixtures is expensive and only need to be recreated once a major release. This patch creates 5 new "roachtests": - backupFixture/tpce/15GB/aws [disaster-recovery] - backupFixture/tpce/32TB/aws [disaster-recovery] (skipped) - backupFixture/tpce/400GB/aws [disaster-recovery] (skipped) - backupFixture/tpce/400GB/gce [disaster-recovery] (skipped) - backupFixture/tpce/8TB/aws [disaster-recovery] (skipped) In the future, this framework should be extended to make it easier to write backup-restore roundtrip tests as well. Fixes cockroachdb#99787 Release note: None
9f76177
to
6655584
Compare
@renatolabs @rhu713 friendly ping here! |
6655584
to
37463f5
Compare
two flakey tests, restarting ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all the comments!
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @rhu713 and @srosenberg)
TFTRs!! bors r=rhu713, renatolabs |
Build succeeded: |
Previously, we either created backup fixtures by manually creating and running workloads on roachprod or using unweildy bash scripts. This patch introduces a framework that makes it easy to generate a backup fixture via the roachtest api. Once the fixture writer specifies a foreground workload (e.g. tpce) and a scheduled backup specification, a single
roachprod run
invocation will create the fixture in a cloud bucket that can be easily fetched by restore roachtests.The fixture creator can initialize a foreground workload using a
workload init
cmd or by restoring from an old fixture.Note that the vast majority of the test specifications are "skipped" so they are not run in the nightly roachtest suite. Creating large fixtures is expensive and only need to be recreated once a major release. This patch creates 5 new "roachtests":
In the future, this framework should be extended to make it easier to write backup-restore roundtrip tests as well.
Fixes #99787
Release note: None