-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: drain-and-decommission/nodes=9 failed #84128
Comments
Some thoughts: The test fails because node 8 is not draining - instead of a few seconds it takes more than 10 minutes where 12 ranges are stuck. I think that this is because those 12 ranges have 3 replicas on the 3 nodes that we are draining, so the lease cannot be drained to another machine. I'm assuming that we don't have the code to initiate a replica rebalance in this case in order to move the lease to a node that is not being drained. A few more details: the cluster has 9 nodes, where the test decommissions node 6 and drains nodes 7,8,9. The test finishes when all drains and the decommission are done. After writing data (using An interesting range to look at is One thing to clarify: why don't we upreplicate Anyway, too many assumptions here, I'll get more info about this, for now cc @aayushshah15 and @nvanbenschoten who maybe had some related changes. |
Thanks for that investigation and write-up, @lidorcarmel! As you point out, we're setting the replication factor to 5 in this test (precisely to avoid the hazard we're running into here, with all the replicas of a given node being on draining nodes) but only waiting for 3x replication before proceeding.
It looks like this recently regressed with 90d5c80. Before this, we were indeed waiting for 5x replication. |
I'll remove the release blocker label from this issue now. I'll also send a fix out for this test. |
Thanks Aayush!! This will be closed in a bit, mental note: it would be great to rebalance when all replicas are drained (filed #84395). |
84360: sql/sem/builtins: move definitions map to new package r=Xiang-Gu a=ajwerner Previously, the definition of builtin functions live in the `builtins` package. This was undesirable because various other packages need to acceess builtins properties by name, but it has a been a headache to achieve this without importing the `builtins` package, which stands pretty high in the dependecy chain (e.g. `seqexpr`, `memo`). This PR moves builtins definition into a new registry package that the `builtins` package calls to register builtin functions, which happens in the `init()` function. This way, other lower level packages, who wish to access builtins properties, need only to import the newly created `builtinsregistry` package. Release note: None 84376: opt: add assertion that selectivity is never NaN r=rytaft a=rytaft This commit addresses a leftover comment from #84366. Release note: None 84392: roachtest: wait for a 5x replication instead of 3x r=lidorcarmel a=lidorcarmel Flaky test: we wait for a 3x replication and then drain 3 nodes. Then we sometimes have ranges with all 3 replicas on those 3 nodes, stuck forever. Instead the test should wait for a 5x replication before starting the drain. Fixes #84128. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Lidor Carmel <[email protected]>
Flaky test: we wait for a 3x replication and then drain 3 nodes. Then we sometimes have ranges with all 3 replicas on those 3 nodes, stuck forever. Instead the test should wait for a 5x replication before starting the drain. Fixes cockroachdb#84128. Release note: None
roachtest.drain-and-decommission/nodes=9 failed with artifacts on release-22.1 @ f9e7181a96fa72e48e3ac0df730843fed4a09ec4:
Help
See: roachtest README
See: How To Investigate (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-17477
The text was updated successfully, but these errors were encountered: