Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: drain-and-decommission/nodes=9 failed #84128

Closed
cockroach-teamcity opened this issue Jul 9, 2022 · 4 comments · Fixed by #84392
Closed

roachtest: drain-and-decommission/nodes=9 failed #84128

cockroach-teamcity opened this issue Jul 9, 2022 · 4 comments · Fixed by #84392
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 9, 2022

roachtest.drain-and-decommission/nodes=9 failed with artifacts on release-22.1 @ f9e7181a96fa72e48e3ac0df730843fed4a09ec4:

		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 13
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... remaining: 12
		  | node is draining... 
		  | ERROR: drain timeout, consider adjusting --drain-wait, especially under custom server.shutdown.{drain,query,connection,lease_transfer}_wait cluster settings
		  | Failed running "node drain"
		  |
		  | stdout:
		Wraps: (4) COMMAND_PROBLEM
		Wraps: (5) Node 8. Command with error:
		  | ``````
		  | ./cockroach node drain --insecure
		  | ``````
		Wraps: (6) exit status 1
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError
Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-17477

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 9, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.1 milestone Jul 9, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Jul 9, 2022
@lidorcarmel
Copy link
Contributor

Some thoughts:

The test fails because node 8 is not draining - instead of a few seconds it takes more than 10 minutes where 12 ranges are stuck.

I think that this is because those 12 ranges have 3 replicas on the 3 nodes that we are draining, so the lease cannot be drained to another machine. I'm assuming that we don't have the code to initiate a replica rebalance in this case in order to move the lease to a node that is not being drained.

A few more details: the cluster has 9 nodes, where the test decommissions node 6 and drains nodes 7,8,9. The test finishes when all drains and the decommission are done. After writing data (using workload init) we change the replication factor from 3 to 5. I suspect we do this because we want to drain 3 nodes and we don't want to get to a state where all replicas of a range are on drained nodes (I might be wrong here). But, we only wait for a 3x and not 5x replication.

An interesting range to look at is r655 that has only 3 replicas, all on nodes 7,8,9.

One thing to clarify: why don't we upreplicate r655 (from 3 to 5 copies)? do we first try to move away from a drained node and then upreplicate?

Anyway, too many assumptions here, I'll get more info about this, for now cc @aayushshah15 and @nvanbenschoten who maybe had some related changes.

@aayushshah15
Copy link
Contributor

Thanks for that investigation and write-up, @lidorcarmel!

As you point out, we're setting the replication factor to 5 in this test (precisely to avoid the hazard we're running into here, with all the replicas of a given node being on draining nodes) but only waiting for 3x replication before proceeding.

only waiting for 3x replication before proceeding

It looks like this recently regressed with 90d5c80. Before this, we were indeed waiting for 5x replication.

@aayushshah15
Copy link
Contributor

I'll remove the release blocker label from this issue now. I'll also send a fix out for this test.

@aayushshah15 aayushshah15 removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 13, 2022
@lidorcarmel
Copy link
Contributor

Thanks Aayush!!

This will be closed in a bit, mental note: it would be great to rebalance when all replicas are drained (filed #84395).

craig bot pushed a commit that referenced this issue Jul 13, 2022
84360: sql/sem/builtins: move definitions map to new package r=Xiang-Gu a=ajwerner

Previously, the definition of builtin functions live in the `builtins`
package. This was undesirable because various other packages need to
acceess builtins properties by name, but it has a been a headache to
achieve this without importing the `builtins` package, which stands
pretty high in the dependecy chain (e.g. `seqexpr`, `memo`).

This PR moves builtins definition into a new registry package that the
`builtins` package calls to register builtin functions, which happens
in the `init()` function. This way, other lower level packages, who
wish to access builtins properties, need only to import the newly
created `builtinsregistry` package.

Release note: None

84376: opt: add assertion that selectivity is never NaN r=rytaft a=rytaft

This commit addresses a leftover comment from #84366.

Release note: None

84392: roachtest: wait for a 5x replication instead of 3x r=lidorcarmel a=lidorcarmel

Flaky test: we wait for a 3x replication and then drain 3 nodes. Then we
sometimes have ranges with all 3 replicas on those 3 nodes, stuck forever.

Instead the test should wait for a 5x replication before starting the drain.

Fixes #84128.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Lidor Carmel <[email protected]>
@craig craig bot closed this as completed in 2b62631 Jul 13, 2022
AlexTalks pushed a commit to AlexTalks/cockroach that referenced this issue Oct 14, 2022
Flaky test: we wait for a 3x replication and then drain 3 nodes. Then we
sometimes have ranges with all 3 replicas on those 3 nodes, stuck forever.

Instead the test should wait for a 5x replication before starting the drain.

Fixes cockroachdb#84128.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants