Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workload: fix rand workload #88362

Merged
merged 2 commits into from
Sep 28, 2022
Merged

Conversation

renatolabs
Copy link
Contributor

The rand workload generates random table definitions when
intialized, and performs random writes to the first table created when
run. It relies on the randgen package to generate random data and
has internal logic to convert from Datum to the corresponding Go data
structure to be passed to a database call.

This means that whenever randgen is able to generate new types of
data, the rand workload needs to be changed accordingly. Since there
were no automated means to detect when the rand workload had
drifted (and this workload is probably not run very frequently),
running rand could fail because it wouldn't know how to generate a
Go data structure for a specific datum.

This commit updates the rand workload to add support for the missing
data types and also adds a unit test that verifies that it is able to
generate and insert data for all possible column types. If a new type
is added to randgen.RandomDatum, this test should fail.

This also changes the random seed used in rand in each run.

@renatolabs renatolabs requested review from a team and herkolategan and removed request for a team September 21, 2022 16:34
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs renatolabs requested a review from a team September 21, 2022 16:34
@renatolabs
Copy link
Contributor Author

Note that this does not guarantee that no errors are seen when running the rand workload. One issue I observed in my tests is that random data generation is not constrained by the schema definition. In other words, the following is possible:

  • table is created with column c1 as INT8, and column c2 is defined as c1 + 200
  • randgen generates MaxInt64-1 as a value for c1
  • error: c2 overflows

Fixing this type of "bug" is out of scope for this PR as it's not related to the workload itself.

@@ -67,10 +85,11 @@ var randMeta = workload.Meta{
g.flags.StringVar(&g.tableName, `table`, ``, `Table to write to`)
g.flags.IntVar(&g.batchSize, `batch`, 1, `Number of rows to insert in a single SQL statement`)
g.flags.StringVar(&g.method, `method`, `upsert`, `Choice of DML name: insert, upsert, ioc-update (insert on conflict update), ioc-nothing (insert on conflict no nothing)`)
g.flags.Int64Var(&g.seed, `seed`, 1, `Key hash seed.`)
g.flags.Int64Var(&g.seed, `seed`, defaultRandomSeed(), `Key hash seed.`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I wasn't aware we were defaulting to 1, so I checked other workloads. They seem to follow the same pattern, e.g., https://github.com/cockroachdb/cockroach/blob/master/pkg/workload/tpcc/tpcc.go#L187

I wonder why?

go_test(
name = "rand_test",
srcs = ["rand_test.go"],
args = ["-test.timeout=295s"],
Copy link
Member

@srosenberg srosenberg Sep 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bazel defaults to "medium" but it seems like this test doesn't need more than a few seconds, so maybe size = "small"?

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)


pkg/workload/rand/BUILD.bazel line 29 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Bazel defaults to "medium" but it seems like this test doesn't need more than a few seconds, so maybe size = "small"?

Good point, fixed.


pkg/workload/rand/rand.go line 88 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nice! I wasn't aware we were defaulting to 1, so I checked other workloads. They seem to follow the same pattern, e.g., https://github.com/cockroachdb/cockroach/blob/master/pkg/workload/tpcc/tpcc.go#L187

I wonder why?

Yes, after making this change I realized the other workloads also use 1 as the default seed. IMO, making the seed different on each run is a better default though. Especially in the context of workloads being used in tests.

Copy link
Contributor

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r1, 2 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it doesn't attempt to satisfy types or constraints; adding a couple more examples from my local runs,

Error: pq: duplicate key value violates unique constraint "table6234468500453514806_col6234468500453514806_4_expr_key"
Error: pq: bit string length 1 does not match type BIT(36)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @srosenberg)


pkg/workload/rand/rand.go line 88 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Yes, after making this change I realized the other workloads also use 1 as the default seed. IMO, making the seed different on each run is a better default though. Especially in the context of workloads being used in tests.

I agree that it shouldn't default to a fixed value. I am just surprised it's been like this since this commit [1] (circa 2019). We could have been fuzzing more states :(
We should update all workloads to use a new seed by default; obviously, it's out of scope for this PR.

[1] 626b1a6

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r1, 2 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @srosenberg)


pkg/workload/rand/rand.go line 67 at r4 (raw file):

func defaultRandomSeed() int64 {
	defaultSeedOnce.Do(func() {
		defaultSeed = timeutil.Now().UTC().UnixNano()

nit: UTC is redundant.

On second thought, a time-based seed is not as uniform as a crypto-based seed. cli.Main is already using a seed by reading it from /dev/urandom [1]. I think we should do the same in all the workloads.

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/cli/cli.go#L53


pkg/workload/rand/rand_test.go line 44 at r4 (raw file):

// insert it into a column
func TestRandRun(t *testing.T) {
	defer leaktest.AfterTest(t)()

Let's also add defer log.Scope(t).Close(t) since there are quite a few subtests; collectively, their output is verbose. In case of failure, it's nice to have a dedicated log with all the output.

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @srosenberg)


pkg/workload/rand/rand.go line 88 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I agree that it shouldn't default to a fixed value. I am just surprised it's been like this since this commit [1] (circa 2019). We could have been fuzzing more states :(
We should update all workloads to use a new seed by default; obviously, it's out of scope for this PR.

[1] 626b1a6

Note, another side-effect of this change is that the workload binary needs to be invoked with --init. Previously, it was possible to reuse the same tables between runs. I don't think we need to preserve that behavior, just noting the (obvious) difference :)

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)


pkg/workload/rand/rand.go line 88 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Note, another side-effect of this change is that the workload binary needs to be invoked with --init. Previously, it was possible to reuse the same tables between runs. I don't think we need to preserve that behavior, just noting the (obvious) difference :)

It doesn't need to be run with --init; but the caller does need to make sure they pass the same seed that was used when initializing. If a user is running the commands interactively, they can copy and paste the seed that is logged. If calling workload programatically, the best way to go is to probably generate the seed out of band and pass it to every workload call.


pkg/workload/rand/rand.go line 67 at r4 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

nit: UTC is redundant.

On second thought, a time-based seed is not as uniform as a crypto-based seed. cli.Main is already using a seed by reading it from /dev/urandom [1]. I think we should do the same in all the workloads.

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/cli/cli.go#L53

Good catch, fixed.


pkg/workload/rand/rand_test.go line 44 at r4 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Let's also add defer log.Scope(t).Close(t) since there are quite a few subtests; collectively, their output is verbose. In case of failure, it's nice to have a dedicated log with all the output.

Good point, added.

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r5, 2 of 2 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

a discussion (no related file):
Created issue #88566 to deal with the default seeds in workloads.


Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the first one is a conflict on a non-primary-key index. It's technically possible to deal with them (the workload already deals with conflicts on the primary key), but it doesn't do that right now.

As for the second one, it's not that it doesn't try to satisfy the type, but more that the type OID read from pg_catalog.pg_attribute for the BIT type maps to a variable-length BIT type even though the schema was defined with as BIT of a specific length. I'm not entirely sure why that is the case.

@ajwerner would you know if there's a reliable way for this workload to get the width of the BIT column? I'm asking you directly because I copied a change you made in the schemachange workload [1] in this PR, and I don't understand why it's needed.

[1]

// Special case CHAR to have the right width.
if objectID == oid.T_bpchar {
t := *types.OidToType[objectID]
t.InternalType.Width = 1
return &t, nil
}

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

@ajwerner
Copy link
Contributor

The OIDs don't imply the widths. I guess you can make a char(2) type or whatever width you want but if you don't name the width I guess we do 1? I think I just wanted to make sure randgen always made a valid value and 1 will do that for all char widths.

I've learned now that you can access column widths via the information_schema.columns table's character_maximum_length column. That has to be done on a per column basis.

The `rand` workload generates random table definitions when
intialized, and performs random writes to the first table created when
run. It relies on the `randgen` package to generate random data and
has internal logic to convert from Datum to the corresponding Go data
structure to be passed to a database call.

This means that whenever `randgen` is able to generate new types of
data, the `rand` workload needs to be changed accordingly. Since there
were no automated means to detect when the `rand` workload had
drifted (and this workload is probably not run very frequently),
running `rand` could fail because it wouldn't know how to generate a
Go data structure for a specific datum.

This commit updates the `rand` workload to add support for the missing
data types and also adds a unit test that verifies that it is able to
generate and insert data for all possible column types. If a new type
is added to `randgen.RandomDatum`, this test should fail.

Release note: None
The `rand` workload would default to the using `1` as the random seed
when the corresponding command line flag was not passed. Since it's
easy to _not_ pass this flag, it's also easy to always run the same
workload over and over again.

This commit updates the workload to use a different random seed each
time. A consequence of this change is that the seed used to `init` the
workload will have to be passed when running it.

Release note: None
Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)

a discussion (no related file):
Pushed a couple of changes here:

  • Improved logging of seed, especially in the case where the caller passes --seed. Not great, but current solution works. We should come up with a better, reusable strategy in Use different workload seeds by default #88566.
  • Changed the test to actually test all possible column types that randgen can generate; specifically, arrays of different types of data are now part of the test as well.
  • A consequence of the above is that I had to change the logic for arrays; arrays of geography/geometry data types use : as separator.
  • I used the information on @ajwerner's comment about information_schema.columns to get the width of bit/char columns. We shouldn't see mismatched width errors when trying to insert anymore.

@srosenberg or @smg260 could you take another look, please?


a discussion (no related file):
@srosenberg taking a second look at the errors you posted:

Error: pq: duplicate key value violates unique constraint "table6234468500453514806_col6234468500453514806_4_expr_key"

I've seen this happen when I run the workload without first clearing the data from a previous run. As I said, it's technically possible to see this error in other circumstances, but the random generator would have to generate the same value twice, which is quite unlikely for most column types. Could you try running init with --drop and let me know if the issue persists (if you happen to have access to the seed that was used in this run)?

Error: pq: bit string length 1 does not match type BIT(36)

This has to do with the wrong width information being used in the rand workload, which should now be fixed. If you still have the seed for this one, I'd appreciate if you could double check that the fix works.


Copy link
Contributor

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// defaultRandomSeed sets the `defaultSeed` package variable if it's
// not yet set. This makes it possible for every run of the `rand`
// workload to use a different seed by default
func defaultRandomSeed() int64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification - why is sync.Once required over simply binding randutil.NewPseudoSeed() to &g.seedon L92

Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @srosenberg)


pkg/workload/rand/rand.go line 70 at r8 (raw file):

Previously, smg260 (Miral Gadani) wrote…

Clarification - why is sync.Once required over simply binding randutil.NewPseudoSeed() to &g.seedon L92

Because New is called multiple times, and I just want to set the default seed to a random value once.

@renatolabs
Copy link
Contributor Author

bors r=smg260

TFTR!

@craig
Copy link
Contributor

craig bot commented Sep 28, 2022

Build succeeded:

@craig craig bot merged commit b3c07be into cockroachdb:master Sep 28, 2022
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Sep 29, 2022
Add test-eng as a code owner/watcher for pkg/workload.

In light of recent and future improvements [1], [2],
TestEng would prefer to be in sync with all changes
to the workload code. Over time, the team plans to build
expertise in this area.

[1] cockroachdb#88362
[2] cockroachdb#88566

Release note: None
Release justification: test only change
craig bot pushed a commit that referenced this pull request Sep 30, 2022
89014: jobs: Clear out claim info when pausing r=miretskiy a=miretskiy

Clear out job claim information when job is paused. Clearing out claim information is beneficial since it allows operator to pause/resume job if they want to try to move job coordinator to another node.

Addresses #82698

Release note: none

89026: kvserver: add `SmallEngineBlocks` testing knob and metamorphic params r=erikgrinaker a=erikgrinaker

`@cockroachdb/repl-prs` to do the main review, tagging other teams for visibility/review of metamorphic test params.

Resolves #86648.

---

**kvserver: add `SmallEngineBlocks` testing knob**

This patch adds a store testing knob `SmallEngineBlocks` which
configures Pebble with a block size of 1 byte. This will store every key
in a separate block, which can provoke bugs in time-bound iterators.

Release note: None
  
**sql/logictest: add metamorphic test param for small engine blocks**

Uses a Pebble block size of 1 byte, to provoke bugs in time-bound
iterators.

Release note: None
  
**kvserver/rangefeed: add metamorphic test param for small engine blocks**

Uses a Pebble block size of 1 byte, to provoke bugs in time-bound
iterators.

Release note: None
  
**kvserver/gc: add metamorphic test param for small engine blocks**

Uses a Pebble block size of 1 byte, to provoke bugs in time-bound
iterators.

Release note: None

  
**backupccl: add metamorphic test param for small engine blocks**

Uses a Pebble block size of 1 byte, to provoke bugs in time-bound
iterators.

Release note: None

89030: codeowners: add test-eng to owners of pkg/workload r=srosenberg a=srosenberg

Add test-eng as a code owner/watcher for pkg/workload.

In light of recent and future improvements [1], [2], TestEng would prefer to be in sync with all changes to the workload code. Over time, the team plans to build expertise in this area.

[1] #88362 [2] #88566

Release note: None
Release justification: test only change

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Stan Rosenberg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants