-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver,rpc: set the stage for maintaining a local blocklist of permanently removed nodes #54936
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just reviewed r1, r2, r3, r6, r12, r13, r14. I'll let irfan review the rest.
In r13-r14 I recommend extending the commit message (and/or code comments) to remind the reader that Ping is used prior to actually using the RPC connection for inter-node traffic, so that its logic effectively gates the use of a connection. This wasn't clear at first.
Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 2 of 2 files at r3, 2 of 4 files at r4, 2 of 2 files at r6, 6 of 11 files at r7, 5 of 5 files at r12, 3 of 3 files at r13, 4 of 4 files at r14.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)
pkg/cli/node.go, line 341 at r6 (raw file):
c := serverpb.NewAdminClient(conn) if err := runDecommissionNodeImpl(ctx, c, nodeCtx.nodeDecommissionWait, nodeIDs); err != nil { cause := errors.Cause(err)
UnwrapAll
is clearer IMHO. Cause
is only provided for compatibility with pkg/errors.
pkg/cli/node.go, line 583 at r6 (raw file):
// ValidateLivenessTransition in kvserverpb/liveness.go for where this // error is generated. if s, ok := status.FromError(err); ok && s.Code() == codes.FailedPrecondition {
I'd recommend extracting cause := errors.UnwrapAll(err)
for consistency with thge other case.
Maybe even worth introducing a FromError
wrapper in our own grpcutil
package and have that unwrap in all cases.
pkg/kv/kvserver/node_liveness.go, line 278 at r2 (raw file):
// best effort attempt at marking the node as draining. It can return // without an error after trying unsuccessfully a few times. Is that what we // want?
As the last person interested in this function, I didn't appreciate that DefaultRetryOptions had a finite timeout. With a retry-forever, this code should be fine?
If it does, then there's a bug I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)
pkg/cli/node.go, line 341 at r6 (raw file):
Previously, knz (kena) wrote…
UnwrapAll
is clearer IMHO.Cause
is only provided for compatibility with pkg/errors.
Done in #54544. I keep getting lost in the very broad API net have in cockroachdb/errors because of these compatibility constraints, I need to sit down and go through all of it once for good.
pkg/cli/node.go, line 583 at r6 (raw file):
Previously, knz (kena) wrote…
I'd recommend extracting
cause := errors.UnwrapAll(err)
for consistency with thge other case.Maybe even worth introducing a
FromError
wrapper in our owngrpcutil
package and have that unwrap in all cases.
Done in #54544. Ditto the comment above, lately I've found it faster to just throw some random error handling variant from cockroachdb/errors and relying on the reviewers to teach me where I'm wrong (so I'm clearly pretty confused by it all).
pkg/kv/kvserver/node_liveness.go, line 278 at r2 (raw file):
Previously, knz (kena) wrote…
As the last person interested in this function, I didn't appreciate that DefaultRetryOptions had a finite timeout. With a retry-forever, this code should be fine?
If it does, then there's a bug I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all LGTM.
I did manage to confuse myself a bit though: Contrasting it with what we were thinking earlier with persisting a "prevent startup file" (and with crdb node decommission --force
) like we did in #54373, I realize the guarantees provided by that approach are a bit stronger (unless I'm missing something). Because there we persist the DECOMMISSIONED.txt file remotely first before updating the liveness record, we could have the target node (the one being decommissioned) intentionally stop sending out PingRequests as part of that FinalizeDecommission API. I think it would have the same effect as this one in that all live connections are abandoned (except of course it would no longer be best effort). The target node would also be unable to restart because of that file, so that problem too goes away.
As for when the target node is permanently downed (and --force is specified), then I think we'd need the safeguards we're introducing in this PR (stopping outgoing pings to the decomm'ed nodes, which we might as well do all the time).
So I guess my question is, do we still need something like #54373? These failure modes do seem exceedingly rare, so I'm unsure.
pkg/kv/kvserver/node_liveness.go
Outdated
// Note that there is yet another struct, NodeLivenessStartOptions, which | ||
// is supplied when the instance is started. This is necessary as during | ||
// server startup, some inputs can only be constructed at Start time. The | ||
// separation has grown organically and various option could in principle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/option/options
type NodeLivenessStartOptions struct { | ||
Stopper *stop.Stopper | ||
Engines []storage.Engine | ||
// OnSelfLive is invoked after every successful heartbeat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could move this comment to NodeLiveness.mu.onSelfLive, probably the first place to look.
pkg/kv/kvserver/node_liveness.go
Outdated
@@ -1290,6 +1305,12 @@ func (nl *NodeLiveness) maybeUpdate(ctx context.Context, newLivenessRec Liveness | |||
fn(newLivenessRec.Liveness) | |||
} | |||
} | |||
if newLivenessRec.Membership == kvserverpb.MembershipStatus_DECOMMISSIONED { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a helper, newLivenessRec.Membership.Decommissioned()
.
pkg/rpc/context.go
Outdated
// NB: We want the request to fail-fast (the default), otherwise we won't | ||
// be notified of transport failures. | ||
response, err = heartbeatClient.Ping(goCtx, request) | ||
err := interceptor(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]
if err := interceptor(req); err != nil {
return err
}
// ...
|
||
// NB: OnSendPing and OnHandlePing default to noops. | ||
// This is used both for testing and the cli. | ||
_, _ = c.OnSendPing, c.OnHandlePing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just set them to noops here, during validation/setting defaults, instead of down below?
782881c
to
3ea3038
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @knz from 3 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)
pkg/kv/kvserver/node_liveness.go, line 243 at r10 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/option/options
Done.
pkg/kv/kvserver/node_liveness.go, line 633 at r10 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Could move this comment to NodeLiveness.mu.onSelfLive, probably the first place to look.
I referenced this comment from there.
pkg/kv/kvserver/node_liveness.go, line 1308 at r11 (raw file):
Previously, irfansharif (irfan sharif) wrote…
We have a helper,
newLivenessRec.Membership.Decommissioned()
.
Done.
pkg/rpc/context.go, line 411 at r14 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Could we just set them to noops here, during validation/setting defaults, instead of down below?
We could, but I find it easier to discover the defaults when they're applied at the single site of use. It's always kind of tedious to have to go chase down where the default is applied. It's mostly potato/potahto here, but I'll leave as is.
pkg/rpc/context.go, line 1150 at r14 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[nit]
if err := interceptor(req); err != nil { return err } // ...
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can extend the approach taken here to also write the prevented startup file and terminate the process when the RPC layer gets shut off from the cluster. However, I think we need to talk about the UX before we decide whether that's useful.
Users are generally expected to run their own supervisor, so they should be in charge of terminating the node. If we "disconnect" the node but the operator doesn't shut it down, it will definitely languish (i.e. SQL connections will hang, etc). I think that is "ok" for some value of "ok" - the operator supposedy took the node out of SQL load balancing before they initiated the decommissioning process.
We may want to consider a flag to node decommission
in which the replicas are moved off, but we don't set the DECOMMISSIONED status yet. The idea would be that some users will prefer to move data off first, then shut down the node, and then set the DECOMMISSIONED status
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)
Of course there would be another snag! Interestingly, with a connection to #39415! In the "kvserver: add Nodeliveness.StartOptions.OnNodeDecommissionedCallback" commit, I am moving registering node liveness' gossip callback to It makes sense that this move would cause problems: before the gossip callback is registered, node liveness basically does not consume any updates and consequently the KV client is severely restricted as none of the replicas it's reaching out to seems live. We were basically relying, silently, on the fact that gossip starts filling up NodeLiveness the moment the former (but not latter) is started. I'll have to think about what we'll want to do here. The main reason I had to put things into |
We were jumping through a number of hoops to create the engines only in `(*Server).Start` since that seems to be the "idiomatic" place to start moving parts. However, it creates a lot of complexity since various callbacks have to be registered with access to engines. Move engine creation to `NewServer`. This unblocks cockroachdb#54936. Release note: None
The signature of NewNodeLiveness is now a single options struct. Release note: None
The bulk of the options for node liveness will shift to the Start method which was renamed to make this a more idiomatic pattern. We want to pass the options at Start time because by that time the caller is usually fully set up to provide its callbacks, etc. At construction time, it is early in the server lifecycle and in particular the `storage.Engine`s are not set up yet. But we want to use them in an upcoming callback, which will be possible once the callbacks are added at Start() time. Release note: None
We'll want to pass more parameters to `Start()`, so pull them out into an options struct similar to how we did it for the constructor a few commits prior. Release note: None
We were jumping through a number of hoops to create the engines only in `(*Server).Start` since that seems to be the "idiomatic" place to start moving parts. However, it creates a lot of complexity since various callbacks have to be registered with access to engines. Move engine creation to `NewServer`. This unblocks cockroachdb#54936. Release note: None
3ea3038
to
6565602
Compare
This callback is invoked whenever a node is permanently marked as removed from the cluster. We use it in `Server.Start`, though only to call the optionally provided testing knob to establish a unit test. The plan is to hook this up to a component that keeps track of the decommissioned nodes across restarts, to provide best-effort (but tight) guarantees that these nodes won't be able to reconnect to the cluster. This is required to uphold the invariants for long- running migrations. Release note: None
Make it obvious which ones apply to the origin (the maker of the PingRequest) and which ones to the destination (the receiver of PingRequest). We are going to add `origin_node_id` here separately later. Release note: None
Upcoming commits will augment Ping() to check the sender's node ID against a blocklist of decommissioned nodes. Release note: None
These two methods are invoked before sending and while handling PingRequests. They'll be used to prevent decommissioned nodes from communicating with the cluster. Release note: None
6565602
to
8552175
Compare
I prepended #55277 here and reworked |
55062: kvserver: remove unused parameter r=TheSamHuang a=tbg Release note: None 55214: sql: fix performance regression due to increased hash allocations r=nvanbenschoten,knz a=arulajmani We recently increased allocations in the `appStats` structure when adding support for transaction level statistics in #52704. This was because the interactions with the `fnv` library were expensive in terms of allocations. This patch aims to claw back the regression by: - Using our own implementation of the FNV algorithm instead of the library, which is significantly lighter weight (microbenchmarks below). - Re-organizes the code to only construct the statement IDs (deemed the expensive operation) if required. When comparing the difference between the commit that introduced the regression and the changes proposed by this diff, I got the following improvements on the KV workload: ``` name old ops/s new ops/s delta kv95-throughput 34.5k ± 6% 35.7k ± 4% +3.42% (p=0.023 n=10+10) ``` Microbenchmarks for the new hashing algorithm (written/run by @knz): ``` name old time/op new time/op delta ConstructStatementID-32 405ns ±17% 39ns ±12% -90.34% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ConstructStatementID-32 120B ± 0% 16B ± 0% -86.67% (p=0.008 n=5+5) name old allocs/op new allocs/op delta ConstructStatementID-32 6.00 ± 0% 1.00 ± 0% -83.33% (p=0.008 n=5+5) ``` Closes #54515 Release note: None 55277: server: create engines in NewServer r=irfansharif a=tbg We were jumping through a number of hoops to create the engines only in `(*Server).Start` since that seems to be the "idiomatic" place to start moving parts. However, it creates a lot of complexity since various callbacks have to be registered with access to engines. Move engine creation to `NewServer`. This unblocks #54936. Release note: None 55280: roachtest: recognize GEOS dlls on more platforms r=otan a=knz This makes roachtest work on the BSDs again. Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: arulajmani <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
We were jumping through a number of hoops to create the engines only in `(*Server).Start` since that seems to be the "idiomatic" place to start moving parts. However, it creates a lot of complexity since various callbacks have to be registered with access to engines. Move engine creation to `NewServer`. This unblocks #54936. Release note: None
Friendly ping @irfansharif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll bookmark this PR for if/when we revisit decommissioning UX.
We may want to consider a flag to node decommission in which the replicas are moved off, but we don't set the DECOMMISSIONED status yet. The idea would be that some users will prefer to move data off first, then shut down the node, and then set the DECOMMISSIONED status
I also like this idea, it removes another conceptual hurdle when having to explain the difference between a decommissioned node and dead one (and all our questions around decommissioning nodes in absentia go away).
Reviewed 15 of 15 files at r15, 5 of 5 files at r16, 4 of 4 files at r17, 1 of 1 files at r18, 4 of 4 files at r19, 5 of 5 files at r20, 3 of 3 files at r21, 4 of 4 files at r22.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/context.go, line 383 at r22 (raw file):
// // It can inject an error. OnHandlePing func(*PingRequest) error
[nit] Minor preference to name these On{Outgoing,Incoming}Ping instead.
pkg/rpc/heartbeat.go, line 56 at r22 (raw file):
disableClusterNameVerification bool onHandlePing func(*PingRequest) error // see ContextOptions.OnHandlePing
Ditto.
TFTR! I'll make the rename separately. bors r=irfansharif |
Build succeeded: |
55286: server,rpc: block decommissioned node from interacting with the cluster r=irfansharif a=tbg This PR completes the work started in #54936 by introducing a component that keeps track of decommissioned nodes and hooks it up to the RPC layer, where it is used to prevent RPC-layer heartbeats between the cluster and decommissioned nodes. Since these heartbeats are a prerequisite for opening a connection in either direction, by failing these heartbeats we are effectively preventing decommissioned nodes from interacting with the cluster. First commits from #54936. New here: - keys: introduce NodeTombstoneKey - server: introduce node tombstone storage Release note: None Co-authored-by: Tobias Grieger <[email protected]>
As part of long-running migrations (#54903) we need to make sure, to the best of our abilities, that decommissioned nodes can not return to the cluster. We will achieve this by persisting this knowledge locally (in an eventually consistent way, via gossip) and ading checks in the RPC layer.
This PR stops short of adding the actual storage but sets the stage for doing so.
First seven commits are from another PR and should be ignored here.