Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: roachtest refactor and initial-scan-only #91394

Merged

Conversation

samiskin
Copy link
Contributor

@samiskin samiskin commented Nov 7, 2022

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-19057

Changefeed roachtests were setup focused on running a workload for a specific duration and then quitting, making it difficult to run an initial_scan_only test that terminated upon Job success.

We as a team have also noticed a greater need to test and observe changefeeds running in production against real sinks to catch issues we are unable to mock or observe from simple unit tests. This is currently a notable hassle as one has to set up each individual sink and run them, ensure the changefeed is pointing to the right URI, and then be able to monitor the metrics of this long running process.

This change refactors the cdcBasicTest into distinct pieces that are then put together in a test. This allows for easier experimentation with live tests, allowing us to spin up a cluster and a workload, run one or more changefeeds on it, set up a poller to print out job details,have an accessible grafana URL to view metrics, and wait for some completion condition.

Changing the specialized runCDCKafkaAuth, runCDCBank, and runCDCSchemaRegistry functions were left out of scope for this first big change.

The main APIs involved in basic roachtests are now:

  • newCDCTester: This creates a tester struct to run the rest of the APIs and initializes the database
  • tester.runTPCCWorkload(tpccArgs): Starts a TPCC workload from the last node in the cluster
  • tester.runLedgerWorkload(ledgerArgs): Starts a Ledger workload from the last node in the cluster
  • tester.runFeedLatencyVerifier(changefeedJob, latencyTargets): starts a routine that monitors the changefeed latency until the tester is Close'd
  • tester.waitForWorkload: waits for a workload started by setupAndRunWorkload to complete its duration
  • tester.startCRDBChaos: This starts a Chaos routine that periodically shuts nodes down and brings them back up
  • tester.newChangefeed(feedArgs): starts a new changefeed on the cluster and returns changefeedJob object
  • changefeedJob.waitForCompletion: waits for a changefeed to complete (either success or failure)
  • tester.startGrafana: Sets up a grafana instance on the last node of the cluster and prints out a link to a grafana, this automatically runs unless --skip-init is provided. If --debug is not used, StopGrafana will be called on test teardown to publish prometheus metrics to the artifacts directory.

An API that is going to be more useful for experimentation are:

  • changefeedJob.runFeedPoller(ctx, stopper, onInfo): runs a given callback every second with the changefeed info

Roachtests can be ran locally with the --local flag or on an existing cluster without destroying it afterwards with --cluster="my-cluster" --debug

Ex: After adding a new test (lets say "cdc/my-test") to the registerCDC function you can keep running

./dev build cockroach --cross # if changes made to crdb
./dev build roachtest         # if changes made to the test

./bin/roachtest run cdc/my-test --cluster="my-cluster" --debug

as you try out different changes or options. If you want to try a set of steps against different versions of the app you could download those binaries and use the --cockroach="path-to-binary" flag to test against those instead.

If you want to set up a large TPCC database on a cluster and reuse it for tests this can be done with roachtests's --wipe and --skip-init flags.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

WithPrometheusNode(t.workloadNode.InstallNodes()[0]).
WithCluster(t.crdbNodes.InstallNodes()).
WithNodeExporter(t.crdbNodes.InstallNodes()).
WithGrafanaDashboard("https://gist.githubusercontent.com/samiskin/75b10ccd4a08280cf1f5d20e916188ae/raw/a5556a5c0c3d82b3449f3140107366995fa16de3/dashboard.json")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a simple one I threw together right now, will be tweaking it prior to merge and then I'll put it in as a proper file in this repository and link to that, just wanted to get this up for review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack.

@samiskin samiskin force-pushed the roachtest-initial-scan-only-refactor branch 2 times, most recently from b7d169b to 310f441 Compare November 7, 2022 13:12
@samiskin samiskin marked this pull request as ready for review November 7, 2022 13:13
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all nits

WithPrometheusNode(t.workloadNode.InstallNodes()[0]).
WithCluster(t.crdbNodes.InstallNodes()).
WithNodeExporter(t.crdbNodes.InstallNodes()).
WithGrafanaDashboard("https://gist.githubusercontent.com/samiskin/75b10ccd4a08280cf1f5d20e916188ae/raw/a5556a5c0c3d82b3449f3140107366995fa16de3/dashboard.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack.

@samiskin samiskin force-pushed the roachtest-initial-scan-only-refactor branch from 310f441 to 9ca758b Compare November 7, 2022 15:56
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Thanks for refactoring, this looks great! Minor comments and questions.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, and @samiskin)


pkg/cmd/roachtest/tests/cdc.go line 114 at r3 (raw file):

		Timer:   Periodic{Period: 2 * time.Minute, DownTime: 20 * time.Second},
		Target:  t.crdbNodes.RandNode,
		Stopper: chaosStopper,

Idea: how about randomly setting the DrainAndQuit option here? Right now the chaos loop will always force-kill nodes; randomly letting them drain could expose bugs (as has been the case in the past in the cdc/mixed-versions test).


pkg/cmd/roachtest/tests/cdc.go line 190 at r3 (raw file):

		// TolerateErrors if crdbChaos is true; otherwise, the workload will fail
		// if it attempts to use the node which was brought down by chaos.
		tolerateErrors: t.tolerateNodeErrors,

Comment on API: the relationship between startCRDBChaos and this function is subtle. It's not clear (especially to the caller) that you are supposed to call the chaos function first so that this tpcc function will know that it should tolerate errors. IMO, a better API would be to pass, in some way, tolerateErrors as a parameter to here. This makes things more explicit and makes it possible to run the workload before starting the chaos loop.


pkg/cmd/roachtest/tests/cdc.go line 200 at r3 (raw file):

	}

	if args.duration != "" {

Is this what you intended? Only run the workload if a duration is passed and print a skip message if not passed? Looks confusing.


pkg/cmd/roachtest/tests/cdc.go line 242 at r3 (raw file):

func (t *cdcTester) DB() *gosql.DB {
	return t.cluster.Conn(t.ctx, t.rt.L(), 1)

Since, IMO, one of the goals here is to provide a struct that automatically gives you good defaults, we could take the chance to select a random node when a DB connection is needed.


pkg/cmd/roachtest/tests/cdc.go line 304 at r3 (raw file):

	feedOptions := make(map[string]string)
	feedOptions["min_checkpoint_frequency"] = "'10s'"
	if args.sinkType == cloudStorageSink || args.sinkType == webhookSink {

I'm curious if we still think these defaults make sense. Should we really be setting these options based on the sink? I'm not from the CDC team, but I wouldn't know why 10s is a good setting for resolved for cloud/webhook sinks, but an empty resolved option is a good default otherwise. Should we be explicit? If this is indeed better, maybe we could at least have a comment explaining why the distinction exists and why the defaults are sensible.


pkg/cmd/roachtest/tests/cdc.go line 751 at r3 (raw file):

}

func runCDCKafkaAuth(ctx context.Context, rt test.Test, c cluster.Cluster) {

Nit: why did we change the parameter name here and in other tests? I believe pretty much every other roachtest uses t for the test.Test struct. Keeping it the same, in my opinion, reduces the cognitive burden. I'm also not sure what the r stands for in rt

@samiskin samiskin force-pushed the roachtest-initial-scan-only-refactor branch from 9ca758b to c9ad50e Compare November 10, 2022 15:43
@samiskin samiskin requested a review from a team as a code owner November 10, 2022 15:43
Copy link
Contributor Author

@samiskin samiskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, @miretskiy, and @renatolabs)


pkg/cmd/roachtest/tests/cdc.go line 438 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

ack.

Updated this with a go.crdb.dev url that we can freely change.


pkg/cmd/roachtest/tests/cdc.go line 121 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

can we switch to switch instead?

Done.


pkg/cmd/roachtest/tests/cdc.go line 204 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

wdyt about moving these if/else branches to their own helpers (e.g. runLedgerWorkload, runTPCCWorkload)?

Done.


pkg/cmd/roachtest/tests/cdc.go line 114 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Idea: how about randomly setting the DrainAndQuit option here? Right now the chaos loop will always force-kill nodes; randomly letting them drain could expose bugs (as has been the case in the past in the cdc/mixed-versions test).

I thought about it but I wanted to leave this PR as refactor-only-no-functionality-change as possible at first so that if we do start seeing failures I know its a roachtest problem rather than a changefeed problem.


pkg/cmd/roachtest/tests/cdc.go line 190 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Comment on API: the relationship between startCRDBChaos and this function is subtle. It's not clear (especially to the caller) that you are supposed to call the chaos function first so that this tpcc function will know that it should tolerate errors. IMO, a better API would be to pass, in some way, tolerateErrors as a parameter to here. This makes things more explicit and makes it possible to run the workload before starting the chaos loop.

Good point 👍, I made tolerateErrors just be an argument for the feed job / workload


pkg/cmd/roachtest/tests/cdc.go line 200 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Is this what you intended? Only run the workload if a duration is passed and print a skip message if not passed? Looks confusing.

Yep, since for initial scan tests we don't necessarily care about a workload running, only the initial fixture.


pkg/cmd/roachtest/tests/cdc.go line 242 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Since, IMO, one of the goals here is to provide a struct that automatically gives you good defaults, we could take the chance to select a random node when a DB connection is needed.

Same as above I want to leave functionality unchanged for this pr to make it easier to isolate the reason for a test failure.


pkg/cmd/roachtest/tests/cdc.go line 304 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I'm curious if we still think these defaults make sense. Should we really be setting these options based on the sink? I'm not from the CDC team, but I wouldn't know why 10s is a good setting for resolved for cloud/webhook sinks, but an empty resolved option is a good default otherwise. Should we be explicit? If this is indeed better, maybe we could at least have a comment explaining why the distinction exists and why the defaults are sensible.

Not exactly sure why the 10s was picked, but I added a comment about envelope=wrapped. We could rethink these but for now leaving all functionality unchanged.


pkg/cmd/roachtest/tests/cdc.go line 751 at r3 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: why did we change the parameter name here and in other tests? I believe pretty much every other roachtest uses t for the test.Test struct. Keeping it the same, in my opinion, reduces the cognitive burden. I'm also not sure what the r stands for in rt

I had rt for roachtest since its from the roachtest package. I changed it to ct for cdcTester and t for roachtest 👍, forgot to consider folks like test eng that would be dealing with multiple team's roachtests.

@samiskin
Copy link
Contributor Author

samiskin commented Nov 13, 2022

there ended up being a few failures after running all the roachtests manually that I couldn't fully resolve in time so this'll have to wait until I return from PTO next tuesday

Copy link
Contributor

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the debug flag works as intended. I tried roachtest run cdc/initial-scan-only --debug --cockroach=artifacts/cockroach and the grafana dashboard dies after the test. I think this is because the nodes shut down.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @miretskiy, @renatolabs, and @samiskin)

@samiskin
Copy link
Contributor Author

@jayshrivastava try naming the cluster

@samiskin samiskin force-pushed the roachtest-initial-scan-only-refactor branch 2 times, most recently from 4b883c6 to 79c9ab7 Compare November 23, 2022 02:43
@samiskin
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 23, 2022

🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

@samiskin samiskin force-pushed the roachtest-initial-scan-only-refactor branch 2 times, most recently from 376f780 to 9d84e25 Compare November 23, 2022 14:26
Changefeed roachtests were setup focused on running a workload for a
specific duration and then quitting, making it difficult to run an
`initial_scan_only` test that terminated upon Job success.

We as a team have also noticed a greater need to test and observe
changefeeds running in production against real sinks to catch issues we
are unable to mock or observe from simple unit tests. This is currently
a notable hassle as one has to set up each individual sink and run them,
ensure the changefeed is pointing to the right URI, and then be able to
monitor the metrics of this long running process.

This change refactors the cdcBasicTest into distinct pieces that are
then put together in a test.  This allows for easier experimentation
with live tests, allowing us to spin up a cluster and a workload, run
one or more changefeeds on it, set up a poller to print out job
details,have an accessible grafana URL to view metrics, and wait for
some completion condition.

Changing the specialized `runCDCKafkaAuth`, `runCDCBank`, and
`runCDCSchemaRegistry` functions were left out of scope for this first
big change.

The main APIs involved in basic roachtests are now:
- `newCDCTester`: This creates a tester struct to run the rest of the
  APIs and initializes the database
- `tester.runTPCCWorkload(tpccArgs)`: Starts a TPCC workload from the
  last node in the cluster
- `tester.runLedgerWorkload(ledgerArgs)`: Starts a Ledger workload from
  the last node in the cluster
- `tester.newChangefeed(feedArgs)`: starts a new changefeed on the
  cluster and returns `changefeedJob` object
- `tester.runFeedLatencyVerifier(changefeedJob, latencyTargets)`: starts
  a routine that monitors the changefeed latency until the tester is
  `Close`'d
- `tester.waitForWorkload`: waits for a workload started by
  `setupAndRunWorkload` to complete its duration
- `changefeedJob.waitForCompletion`: waits for a changefeed to complete
  (either success or failure)
- `tester.startCRDBChaos`: This starts a Chaos routine that periodically
  shuts nodes down and brings them back up

APIs that are going to be more useful for experimentation are:
- `tester.startGrafana`: Sets up a grafana instance on the last node of
  the cluster and prints out a link to a grafana dashboard with some
  basic changefeed metrics
- `changefeedJob.runFeedPoller(ctx, stopper, onInfo)`: runs a given
  callback every second with the changefeed info

Roachtests can be ran locally with the `--local` flag or on an existing
cluster without destroying it afterwards with `--cluster="my-cluter"
--debug`

Ex: After adding a new test (lets say "cdc/my-test") to the registerCDC
function you can keep running

```
./dev build cockroach --cross # if changes made to crdb
./dev build roachtest         # if changes made to the test

./bin/roachtest run cdc/my-test --cluster="my-cluster" --debug
```

as you try out different changes or options. If you want to try a set of
steps against different versions of the app you could download those
binaries and use the --cockroach="path-to-binary" flag to test against
those instead.

If you want to set up a large TPCC database on a cluster and reuse it
for tests this can be done with roachtests's --wipe and --skip-init
flags.

Release note: None
@samiskin samiskin force-pushed the roachtest-initial-scan-only-refactor branch from 9d84e25 to 46c75a5 Compare November 23, 2022 14:28
@samiskin
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 23, 2022

Build succeeded:

@craig craig bot merged commit 1d04cec into cockroachdb:master Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants