Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver,rpc: set the stage for maintaining a local blocklist of permanently removed nodes #54936

Merged
merged 8 commits into from
Oct 14, 2020

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 29, 2020

As part of long-running migrations (#54903) we need to make sure, to the best of our abilities, that decommissioned nodes can not return to the cluster. We will achieve this by persisting this knowledge locally (in an eventually consistent way, via gossip) and ading checks in the RPC layer.

This PR stops short of adding the actual storage but sets the stage for doing so.

First seven commits are from another PR and should be ignored here.

@tbg tbg requested a review from irfansharif September 29, 2020 11:44
@tbg tbg requested a review from a team as a code owner September 29, 2020 11:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reviewed r1, r2, r3, r6, r12, r13, r14. I'll let irfan review the rest.

In r13-r14 I recommend extending the commit message (and/or code comments) to remind the reader that Ping is used prior to actually using the RPC connection for inter-node traffic, so that its logic effectively gates the use of a connection. This wasn't clear at first.

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 2 of 2 files at r3, 2 of 4 files at r4, 2 of 2 files at r6, 6 of 11 files at r7, 5 of 5 files at r12, 3 of 3 files at r13, 4 of 4 files at r14.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)


pkg/cli/node.go, line 341 at r6 (raw file):

	c := serverpb.NewAdminClient(conn)
	if err := runDecommissionNodeImpl(ctx, c, nodeCtx.nodeDecommissionWait, nodeIDs); err != nil {
		cause := errors.Cause(err)

UnwrapAll is clearer IMHO. Cause is only provided for compatibility with pkg/errors.


pkg/cli/node.go, line 583 at r6 (raw file):

		// ValidateLivenessTransition in kvserverpb/liveness.go for where this
		// error is generated.
		if s, ok := status.FromError(err); ok && s.Code() == codes.FailedPrecondition {

I'd recommend extracting cause := errors.UnwrapAll(err) for consistency with thge other case.

Maybe even worth introducing a FromError wrapper in our own grpcutil package and have that unwrap in all cases.


pkg/kv/kvserver/node_liveness.go, line 278 at r2 (raw file):

	// best effort attempt at marking the node as draining. It can return
	// without an error after trying unsuccessfully a few times. Is that what we
	// want?

As the last person interested in this function, I didn't appreciate that DefaultRetryOptions had a finite timeout. With a retry-forever, this code should be fine?
If it does, then there's a bug I agree.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)


pkg/cli/node.go, line 341 at r6 (raw file):

Previously, knz (kena) wrote…

UnwrapAll is clearer IMHO. Cause is only provided for compatibility with pkg/errors.

Done in #54544. I keep getting lost in the very broad API net have in cockroachdb/errors because of these compatibility constraints, I need to sit down and go through all of it once for good.


pkg/cli/node.go, line 583 at r6 (raw file):

Previously, knz (kena) wrote…

I'd recommend extracting cause := errors.UnwrapAll(err) for consistency with thge other case.

Maybe even worth introducing a FromError wrapper in our own grpcutil package and have that unwrap in all cases.

Done in #54544. Ditto the comment above, lately I've found it faster to just throw some random error handling variant from cockroachdb/errors and relying on the reviewers to teach me where I'm wrong (so I'm clearly pretty confused by it all).


pkg/kv/kvserver/node_liveness.go, line 278 at r2 (raw file):

Previously, knz (kena) wrote…

As the last person interested in this function, I didn't appreciate that DefaultRetryOptions had a finite timeout. With a retry-forever, this code should be fine?
If it does, then there's a bug I agree.

Patched in #54544 (6b43996).

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all LGTM.

I did manage to confuse myself a bit though: Contrasting it with what we were thinking earlier with persisting a "prevent startup file" (and with crdb node decommission --force) like we did in #54373, I realize the guarantees provided by that approach are a bit stronger (unless I'm missing something). Because there we persist the DECOMMISSIONED.txt file remotely first before updating the liveness record, we could have the target node (the one being decommissioned) intentionally stop sending out PingRequests as part of that FinalizeDecommission API. I think it would have the same effect as this one in that all live connections are abandoned (except of course it would no longer be best effort). The target node would also be unable to restart because of that file, so that problem too goes away.

As for when the target node is permanently downed (and --force is specified), then I think we'd need the safeguards we're introducing in this PR (stopping outgoing pings to the decomm'ed nodes, which we might as well do all the time).

So I guess my question is, do we still need something like #54373? These failure modes do seem exceedingly rare, so I'm unsure.

// Note that there is yet another struct, NodeLivenessStartOptions, which
// is supplied when the instance is started. This is necessary as during
// server startup, some inputs can only be constructed at Start time. The
// separation has grown organically and various option could in principle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/option/options

type NodeLivenessStartOptions struct {
Stopper *stop.Stopper
Engines []storage.Engine
// OnSelfLive is invoked after every successful heartbeat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could move this comment to NodeLiveness.mu.onSelfLive, probably the first place to look.

@@ -1290,6 +1305,12 @@ func (nl *NodeLiveness) maybeUpdate(ctx context.Context, newLivenessRec Liveness
fn(newLivenessRec.Liveness)
}
}
if newLivenessRec.Membership == kvserverpb.MembershipStatus_DECOMMISSIONED {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a helper, newLivenessRec.Membership.Decommissioned().

// NB: We want the request to fail-fast (the default), otherwise we won't
// be notified of transport failures.
response, err = heartbeatClient.Ping(goCtx, request)
err := interceptor(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

if err := interceptor(req); err != nil {
  return err
}

// ...


// NB: OnSendPing and OnHandlePing default to noops.
// This is used both for testing and the cli.
_, _ = c.OnSendPing, c.OnHandlePing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just set them to noops here, during validation/setting defaults, instead of down below?

@tbg tbg force-pushed the decom-bit-rpc-step1 branch 4 times, most recently from 782881c to 3ea3038 Compare October 1, 2020 13:50
@tbg tbg requested review from irfansharif and knz October 1, 2020 13:50
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @knz from 3 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)


pkg/kv/kvserver/node_liveness.go, line 243 at r10 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/option/options

Done.


pkg/kv/kvserver/node_liveness.go, line 633 at r10 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Could move this comment to NodeLiveness.mu.onSelfLive, probably the first place to look.

I referenced this comment from there.


pkg/kv/kvserver/node_liveness.go, line 1308 at r11 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We have a helper, newLivenessRec.Membership.Decommissioned().

Done.


pkg/rpc/context.go, line 411 at r14 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Could we just set them to noops here, during validation/setting defaults, instead of down below?

We could, but I find it easier to discover the defaults when they're applied at the single site of use. It's always kind of tedious to have to go chase down where the default is applied. It's mostly potato/potahto here, but I'll leave as is.


pkg/rpc/context.go, line 1150 at r14 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit]

if err := interceptor(req); err != nil {
  return err
}

// ...

Done.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can extend the approach taken here to also write the prevented startup file and terminate the process when the RPC layer gets shut off from the cluster. However, I think we need to talk about the UX before we decide whether that's useful.
Users are generally expected to run their own supervisor, so they should be in charge of terminating the node. If we "disconnect" the node but the operator doesn't shut it down, it will definitely languish (i.e. SQL connections will hang, etc). I think that is "ok" for some value of "ok" - the operator supposedy took the node out of SQL load balancing before they initiated the decommissioning process.
We may want to consider a flag to node decommission in which the replicas are moved off, but we don't set the DECOMMISSIONED status yet. The idea would be that some users will prefer to move data off first, then shut down the node, and then set the DECOMMISSIONED status

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)

@tbg
Copy link
Member Author

tbg commented Oct 1, 2020

Of course there would be another snag! Interestingly, with a connection to #39415!

In the "kvserver: add Nodeliveness.StartOptions.OnNodeDecommissionedCallback" commit, I am moving registering node liveness' gossip callback to NodeLiveness.Start and as a result TestBootstrapNewStore times out.

It makes sense that this move would cause problems: before the gossip callback is registered, node liveness basically does not consume any updates and consequently the KV client is severely restricted as none of the replicas it's reaching out to seems live. We were basically relying, silently, on the fact that gossip starts filling up NodeLiveness the moment the former (but not latter) is started. I'll have to think about what we'll want to do here. The main reason I had to put things into Start in the first place was that the engines are opened late. Maybe it is time to rethink that, and to open them earlier in the boot process. But I also consider this silent dependency on Gossip worth taking a look at, at the very least it should be more obvious that it matters.

tbg added a commit to tbg/cockroach that referenced this pull request Oct 7, 2020
We were jumping through a number of hoops to create the engines only in
`(*Server).Start` since that seems to be the "idiomatic" place to start
moving parts. However, it creates a lot of complexity since various
callbacks have to be registered with access to engines. Move engine
creation to `NewServer`.

This unblocks cockroachdb#54936.

Release note: None
tbg added 4 commits October 7, 2020 10:38
The signature of NewNodeLiveness is now a single options struct.

Release note: None
The bulk of the options for node liveness will shift to the Start
method which was renamed to make this a more idiomatic pattern.

We want to pass the options at Start time because by that time
the caller is usually fully set up to provide its callbacks, etc.

At construction time, it is early in the server lifecycle and in
particular the `storage.Engine`s are not set up yet. But we want
to use them in an upcoming callback, which will be possible once
the callbacks are added at Start() time.

Release note: None
We'll want to pass more parameters to `Start()`, so pull them
out into an options struct similar to how we did it for the
constructor a few commits prior.

Release note: None
We were jumping through a number of hoops to create the engines only in
`(*Server).Start` since that seems to be the "idiomatic" place to start
moving parts. However, it creates a lot of complexity since various
callbacks have to be registered with access to engines. Move engine
creation to `NewServer`.

This unblocks cockroachdb#54936.

Release note: None
@tbg tbg force-pushed the decom-bit-rpc-step1 branch from 3ea3038 to 6565602 Compare October 7, 2020 08:58
tbg added 4 commits October 7, 2020 10:59
This callback is invoked whenever a node is permanently marked as
removed from the cluster.

We use it in `Server.Start`, though only to call the optionally provided
testing knob to establish a unit test.

The plan is to hook this up to a component that keeps track of the
decommissioned nodes across restarts, to provide best-effort (but
tight) guarantees that these nodes won't be able to reconnect to
the cluster. This is required to uphold the invariants for long-
running migrations.

Release note: None
Make it obvious which ones apply to the origin (the maker of the
PingRequest) and which ones to the destination (the receiver of
PingRequest).

We are going to add `origin_node_id` here separately later.

Release note: None
Upcoming commits will augment Ping() to check the sender's node ID
against a blocklist of decommissioned nodes.

Release note: None
These two methods are invoked before sending and while handling
PingRequests. They'll be used to prevent decommissioned nodes
from communicating with the cluster.

Release note: None
@tbg tbg force-pushed the decom-bit-rpc-step1 branch from 6565602 to 8552175 Compare October 7, 2020 08:59
@tbg
Copy link
Member Author

tbg commented Oct 7, 2020

I prepended #55277 here and reworked kvserver: add Nodeliveness.StartOptions.OnNodeDecommissionedCallback. PTAL, should turn green now & be ready to go.

craig bot pushed a commit that referenced this pull request Oct 7, 2020
55062: kvserver: remove unused parameter r=TheSamHuang a=tbg

Release note: None

55214: sql: fix performance regression due to increased hash allocations r=nvanbenschoten,knz a=arulajmani

We recently increased allocations in the `appStats` structure when
adding support for transaction level statistics in #52704. This was
because the interactions with the `fnv` library were expensive in terms
of allocations. This patch aims to claw back the regression by:

- Using our own implementation of the FNV algorithm instead of the
library, which is significantly lighter weight (microbenchmarks below).
- Re-organizes the code to only construct the statement IDs (deemed the
expensive operation) if required.

When comparing the difference between the commit that introduced the
regression and the changes proposed by this diff, I got the following
improvements on the KV workload:

```
name             old ops/s   new ops/s   delta
kv95-throughput  34.5k ± 6%  35.7k ± 4%  +3.42%  (p=0.023 n=10+10)

```

Microbenchmarks for the new hashing algorithm (written/run by @knz):
```
name                     old time/op    new time/op    delta
ConstructStatementID-32     405ns ±17%      39ns ±12%  -90.34%  (p=0.008 n=5+5)

name                     old alloc/op   new alloc/op   delta
ConstructStatementID-32      120B ± 0%       16B ± 0%  -86.67%  (p=0.008 n=5+5)

name                     old allocs/op  new allocs/op  delta
ConstructStatementID-32      6.00 ± 0%      1.00 ± 0%  -83.33%  (p=0.008 n=5+5)
```

Closes #54515

Release note: None

55277: server: create engines in NewServer r=irfansharif a=tbg

We were jumping through a number of hoops to create the engines only in
`(*Server).Start` since that seems to be the "idiomatic" place to start
moving parts. However, it creates a lot of complexity since various
callbacks have to be registered with access to engines. Move engine
creation to `NewServer`.

This unblocks #54936.

Release note: None


55280: roachtest: recognize GEOS dlls on more platforms r=otan a=knz

This makes roachtest work on the BSDs again.

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: arulajmani <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
jayshrivastava pushed a commit that referenced this pull request Oct 8, 2020
We were jumping through a number of hoops to create the engines only in
`(*Server).Start` since that seems to be the "idiomatic" place to start
moving parts. However, it creates a lot of complexity since various
callbacks have to be registered with access to engines. Move engine
creation to `NewServer`.

This unblocks #54936.

Release note: None
@tbg
Copy link
Member Author

tbg commented Oct 13, 2020

Friendly ping @irfansharif

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I'll bookmark this PR for if/when we revisit decommissioning UX.

We may want to consider a flag to node decommission in which the replicas are moved off, but we don't set the DECOMMISSIONED status yet. The idea would be that some users will prefer to move data off first, then shut down the node, and then set the DECOMMISSIONED status

I also like this idea, it removes another conceptual hurdle when having to explain the difference between a decommissioned node and dead one (and all our questions around decommissioning nodes in absentia go away).

Reviewed 15 of 15 files at r15, 5 of 5 files at r16, 4 of 4 files at r17, 1 of 1 files at r18, 4 of 4 files at r19, 5 of 5 files at r20, 3 of 3 files at r21, 4 of 4 files at r22.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/context.go, line 383 at r22 (raw file):

	//
	// It can inject an error.
	OnHandlePing func(*PingRequest) error

[nit] Minor preference to name these On{Outgoing,Incoming}Ping instead.


pkg/rpc/heartbeat.go, line 56 at r22 (raw file):

	disableClusterNameVerification bool

	onHandlePing func(*PingRequest) error // see ContextOptions.OnHandlePing

Ditto.

@tbg
Copy link
Member Author

tbg commented Oct 14, 2020

TFTR! I'll make the rename separately.

bors r=irfansharif

@craig
Copy link
Contributor

craig bot commented Oct 14, 2020

Build succeeded:

@craig craig bot merged commit cf805a5 into cockroachdb:master Oct 14, 2020
craig bot pushed a commit that referenced this pull request Oct 14, 2020
55286: server,rpc: block decommissioned node from interacting with the cluster r=irfansharif a=tbg

This PR completes the work started in #54936 by introducing a component that
keeps track of decommissioned nodes and hooks it up to the RPC layer, where it
is used to prevent RPC-layer heartbeats between the cluster and decommissioned
nodes. Since these heartbeats are a prerequisite for opening a connection in
either direction, by failing these heartbeats we are effectively preventing
decommissioned nodes from interacting with the cluster.

First commits from #54936. New here:

- keys: introduce NodeTombstoneKey
- server: introduce node tombstone storage

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
@tbg tbg deleted the decom-bit-rpc-step1 branch October 20, 2020 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants