-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: roachtest refactor and initial-scan-only #91394
changefeedccl: roachtest refactor and initial-scan-only #91394
Conversation
pkg/cmd/roachtest/tests/cdc.go
Outdated
WithPrometheusNode(t.workloadNode.InstallNodes()[0]). | ||
WithCluster(t.crdbNodes.InstallNodes()). | ||
WithNodeExporter(t.crdbNodes.InstallNodes()). | ||
WithGrafanaDashboard("https://gist.githubusercontent.com/samiskin/75b10ccd4a08280cf1f5d20e916188ae/raw/a5556a5c0c3d82b3449f3140107366995fa16de3/dashboard.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a simple one I threw together right now, will be tweaking it prior to merge and then I'll put it in as a proper file in this repository and link to that, just wanted to get this up for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack.
b7d169b
to
310f441
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all nits
pkg/cmd/roachtest/tests/cdc.go
Outdated
WithPrometheusNode(t.workloadNode.InstallNodes()[0]). | ||
WithCluster(t.crdbNodes.InstallNodes()). | ||
WithNodeExporter(t.crdbNodes.InstallNodes()). | ||
WithGrafanaDashboard("https://gist.githubusercontent.com/samiskin/75b10ccd4a08280cf1f5d20e916188ae/raw/a5556a5c0c3d82b3449f3140107366995fa16de3/dashboard.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack.
310f441
to
9ca758b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Thanks for refactoring, this looks great! Minor comments and questions.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, and @samiskin)
pkg/cmd/roachtest/tests/cdc.go
line 114 at r3 (raw file):
Timer: Periodic{Period: 2 * time.Minute, DownTime: 20 * time.Second}, Target: t.crdbNodes.RandNode, Stopper: chaosStopper,
Idea: how about randomly setting the DrainAndQuit
option here? Right now the chaos loop will always force-kill nodes; randomly letting them drain could expose bugs (as has been the case in the past in the cdc/mixed-versions test).
pkg/cmd/roachtest/tests/cdc.go
line 190 at r3 (raw file):
// TolerateErrors if crdbChaos is true; otherwise, the workload will fail // if it attempts to use the node which was brought down by chaos. tolerateErrors: t.tolerateNodeErrors,
Comment on API: the relationship between startCRDBChaos
and this function is subtle. It's not clear (especially to the caller) that you are supposed to call the chaos function first so that this tpcc function will know that it should tolerate errors. IMO, a better API would be to pass, in some way, tolerateErrors
as a parameter to here. This makes things more explicit and makes it possible to run the workload before starting the chaos loop.
pkg/cmd/roachtest/tests/cdc.go
line 200 at r3 (raw file):
} if args.duration != "" {
Is this what you intended? Only run the workload if a duration is passed and print a skip
message if not passed? Looks confusing.
pkg/cmd/roachtest/tests/cdc.go
line 242 at r3 (raw file):
func (t *cdcTester) DB() *gosql.DB { return t.cluster.Conn(t.ctx, t.rt.L(), 1)
Since, IMO, one of the goals here is to provide a struct that automatically gives you good defaults, we could take the chance to select a random node when a DB connection is needed.
pkg/cmd/roachtest/tests/cdc.go
line 304 at r3 (raw file):
feedOptions := make(map[string]string) feedOptions["min_checkpoint_frequency"] = "'10s'" if args.sinkType == cloudStorageSink || args.sinkType == webhookSink {
I'm curious if we still think these defaults make sense. Should we really be setting these options based on the sink? I'm not from the CDC team, but I wouldn't know why 10s
is a good setting for resolved
for cloud/webhook sinks, but an empty resolved
option is a good default otherwise. Should we be explicit? If this is indeed better, maybe we could at least have a comment explaining why the distinction exists and why the defaults are sensible.
pkg/cmd/roachtest/tests/cdc.go
line 751 at r3 (raw file):
} func runCDCKafkaAuth(ctx context.Context, rt test.Test, c cluster.Cluster) {
Nit: why did we change the parameter name here and in other tests? I believe pretty much every other roachtest uses t
for the test.Test
struct. Keeping it the same, in my opinion, reduces the cognitive burden. I'm also not sure what the r
stands for in rt
9ca758b
to
c9ad50e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, @miretskiy, and @renatolabs)
pkg/cmd/roachtest/tests/cdc.go
line 438 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
ack.
Updated this with a go.crdb.dev url that we can freely change.
pkg/cmd/roachtest/tests/cdc.go
line 121 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
can we switch to
switch
instead?
Done.
pkg/cmd/roachtest/tests/cdc.go
line 204 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
wdyt about moving these if/else branches to their own helpers (e.g. runLedgerWorkload, runTPCCWorkload)?
Done.
pkg/cmd/roachtest/tests/cdc.go
line 114 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Idea: how about randomly setting the
DrainAndQuit
option here? Right now the chaos loop will always force-kill nodes; randomly letting them drain could expose bugs (as has been the case in the past in the cdc/mixed-versions test).
I thought about it but I wanted to leave this PR as refactor-only-no-functionality-change as possible at first so that if we do start seeing failures I know its a roachtest problem rather than a changefeed problem.
pkg/cmd/roachtest/tests/cdc.go
line 190 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Comment on API: the relationship between
startCRDBChaos
and this function is subtle. It's not clear (especially to the caller) that you are supposed to call the chaos function first so that this tpcc function will know that it should tolerate errors. IMO, a better API would be to pass, in some way,tolerateErrors
as a parameter to here. This makes things more explicit and makes it possible to run the workload before starting the chaos loop.
Good point 👍, I made tolerateErrors
just be an argument for the feed job / workload
pkg/cmd/roachtest/tests/cdc.go
line 200 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Is this what you intended? Only run the workload if a duration is passed and print a
skip
message if not passed? Looks confusing.
Yep, since for initial scan tests we don't necessarily care about a workload running, only the initial fixture.
pkg/cmd/roachtest/tests/cdc.go
line 242 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Since, IMO, one of the goals here is to provide a struct that automatically gives you good defaults, we could take the chance to select a random node when a DB connection is needed.
Same as above I want to leave functionality unchanged for this pr to make it easier to isolate the reason for a test failure.
pkg/cmd/roachtest/tests/cdc.go
line 304 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I'm curious if we still think these defaults make sense. Should we really be setting these options based on the sink? I'm not from the CDC team, but I wouldn't know why
10s
is a good setting forresolved
for cloud/webhook sinks, but an emptyresolved
option is a good default otherwise. Should we be explicit? If this is indeed better, maybe we could at least have a comment explaining why the distinction exists and why the defaults are sensible.
Not exactly sure why the 10s was picked, but I added a comment about envelope=wrapped. We could rethink these but for now leaving all functionality unchanged.
pkg/cmd/roachtest/tests/cdc.go
line 751 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: why did we change the parameter name here and in other tests? I believe pretty much every other roachtest uses
t
for thetest.Test
struct. Keeping it the same, in my opinion, reduces the cognitive burden. I'm also not sure what ther
stands for inrt
I had rt
for roachtest since its from the roachtest package. I changed it to ct
for cdcTester and t
for roachtest 👍, forgot to consider folks like test eng that would be dealing with multiple team's roachtests.
there ended up being a few failures after running all the roachtests manually that I couldn't fully resolve in time so this'll have to wait until I return from PTO next tuesday |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the debug flag works as intended. I tried roachtest run cdc/initial-scan-only --debug --cockroach=artifacts/cockroach
and the grafana dashboard dies after the test. I think this is because the nodes shut down.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @miretskiy, @renatolabs, and @samiskin)
@jayshrivastava try naming the cluster |
4b883c6
to
79c9ab7
Compare
bors r+ |
🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set. |
376f780
to
9d84e25
Compare
Changefeed roachtests were setup focused on running a workload for a specific duration and then quitting, making it difficult to run an `initial_scan_only` test that terminated upon Job success. We as a team have also noticed a greater need to test and observe changefeeds running in production against real sinks to catch issues we are unable to mock or observe from simple unit tests. This is currently a notable hassle as one has to set up each individual sink and run them, ensure the changefeed is pointing to the right URI, and then be able to monitor the metrics of this long running process. This change refactors the cdcBasicTest into distinct pieces that are then put together in a test. This allows for easier experimentation with live tests, allowing us to spin up a cluster and a workload, run one or more changefeeds on it, set up a poller to print out job details,have an accessible grafana URL to view metrics, and wait for some completion condition. Changing the specialized `runCDCKafkaAuth`, `runCDCBank`, and `runCDCSchemaRegistry` functions were left out of scope for this first big change. The main APIs involved in basic roachtests are now: - `newCDCTester`: This creates a tester struct to run the rest of the APIs and initializes the database - `tester.runTPCCWorkload(tpccArgs)`: Starts a TPCC workload from the last node in the cluster - `tester.runLedgerWorkload(ledgerArgs)`: Starts a Ledger workload from the last node in the cluster - `tester.newChangefeed(feedArgs)`: starts a new changefeed on the cluster and returns `changefeedJob` object - `tester.runFeedLatencyVerifier(changefeedJob, latencyTargets)`: starts a routine that monitors the changefeed latency until the tester is `Close`'d - `tester.waitForWorkload`: waits for a workload started by `setupAndRunWorkload` to complete its duration - `changefeedJob.waitForCompletion`: waits for a changefeed to complete (either success or failure) - `tester.startCRDBChaos`: This starts a Chaos routine that periodically shuts nodes down and brings them back up APIs that are going to be more useful for experimentation are: - `tester.startGrafana`: Sets up a grafana instance on the last node of the cluster and prints out a link to a grafana dashboard with some basic changefeed metrics - `changefeedJob.runFeedPoller(ctx, stopper, onInfo)`: runs a given callback every second with the changefeed info Roachtests can be ran locally with the `--local` flag or on an existing cluster without destroying it afterwards with `--cluster="my-cluter" --debug` Ex: After adding a new test (lets say "cdc/my-test") to the registerCDC function you can keep running ``` ./dev build cockroach --cross # if changes made to crdb ./dev build roachtest # if changes made to the test ./bin/roachtest run cdc/my-test --cluster="my-cluster" --debug ``` as you try out different changes or options. If you want to try a set of steps against different versions of the app you could download those binaries and use the --cockroach="path-to-binary" flag to test against those instead. If you want to set up a large TPCC database on a cluster and reuse it for tests this can be done with roachtests's --wipe and --skip-init flags. Release note: None
9d84e25
to
46c75a5
Compare
bors r+ |
Build succeeded: |
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-19057
Changefeed roachtests were setup focused on running a workload for a specific duration and then quitting, making it difficult to run an
initial_scan_only
test that terminated upon Job success.We as a team have also noticed a greater need to test and observe changefeeds running in production against real sinks to catch issues we are unable to mock or observe from simple unit tests. This is currently a notable hassle as one has to set up each individual sink and run them, ensure the changefeed is pointing to the right URI, and then be able to monitor the metrics of this long running process.
This change refactors the cdcBasicTest into distinct pieces that are then put together in a test. This allows for easier experimentation with live tests, allowing us to spin up a cluster and a workload, run one or more changefeeds on it, set up a poller to print out job details,have an accessible grafana URL to view metrics, and wait for some completion condition.
Changing the specialized
runCDCKafkaAuth
,runCDCBank
, andrunCDCSchemaRegistry
functions were left out of scope for this first big change.The main APIs involved in basic roachtests are now:
newCDCTester
: This creates a tester struct to run the rest of the APIs and initializes the databasetester.runTPCCWorkload(tpccArgs)
: Starts a TPCC workload from the last node in the clustertester.runLedgerWorkload(ledgerArgs)
: Starts a Ledger workload from the last node in the clustertester.runFeedLatencyVerifier(changefeedJob, latencyTargets)
: starts a routine that monitors the changefeed latency until the tester isClose
'dtester.waitForWorkload
: waits for a workload started bysetupAndRunWorkload
to complete its durationtester.startCRDBChaos
: This starts a Chaos routine that periodically shuts nodes down and brings them back uptester.newChangefeed(feedArgs)
: starts a new changefeed on the cluster and returnschangefeedJob
objectchangefeedJob.waitForCompletion
: waits for a changefeed to complete (either success or failure)tester.startGrafana
: Sets up a grafana instance on the last node of the cluster and prints out a link to a grafana, this automatically runs unless--skip-init
is provided. If--debug
is not used,StopGrafana
will be called on test teardown to publish prometheus metrics to the artifacts directory.An API that is going to be more useful for experimentation are:
changefeedJob.runFeedPoller(ctx, stopper, onInfo)
: runs a given callback every second with the changefeed infoRoachtests can be ran locally with the
--local
flag or on an existing cluster without destroying it afterwards with--cluster="my-cluster" --debug
Ex: After adding a new test (lets say
"cdc/my-test"
) to theregisterCDC
function you can keep runningas you try out different changes or options. If you want to try a set of steps against different versions of the app you could download those binaries and use the
--cockroach="path-to-binary"
flag to test against those instead.If you want to set up a large TPCC database on a cluster and reuse it for tests this can be done with roachtests's
--wipe
and--skip-init
flags.Release note: None