-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: re-work connection maintenance #88625
Conversation
The implementation of `./cockroach init` first dials up the node (in a retry loop) and issues a `Health` RPC, to reduce the likelihood of accidental double-init of the cluster in case of network issues. However, this was trying to be too clever: if it managed to dial the node and check its health (again, using a short-lived context), it would then return that same connection to the caller for use in the Bootstrap RPC. Unfortunately, that connection would sit on top of an `rpc.Context` whose life was tied to a context[^1] that descended from one wrapped by `RunWithTimeout`. In other words, the context would be cancelled by the time the health check method returned. This seems to not cause issues today, since the `rpc.Context` seems to ignore this context cancellation. But in cockroachdb#88625, this suddenly became a problem and now that I've looked at this code, might as wel fix it regardless of whether cockroachdb#88625 is ever going to land. No release note since today's code happens to work. [^1]: https://github.com/cockroachdb/cockroach/blob/aecc58f125ac611f5793101cbd0323df569369db/pkg/cli/rpc_client.go#L46 Release note: None
The implementation of `./cockroach init` first dials up the node (in a retry loop) and issues a `Health` RPC, to reduce the likelihood of accidental double-init of the cluster in case of network issues. However, this was trying to be too clever: if it managed to dial the node and check its health (again, using a short-lived context), it would then return that same connection to the caller for use in the Bootstrap RPC. Unfortunately, that connection would sit on top of an `rpc.Context` whose life was tied to a context[^1] that descended from one wrapped by `RunWithTimeout`. In other words, the context would be cancelled by the time the health check method returned. This seems to not cause issues today, since the `rpc.Context` seems to ignore this context cancellation. But in cockroachdb#88625, this suddenly became a problem and now that I've looked at this code, might as wel fix it regardless of whether cockroachdb#88625 is ever going to land. No release note since today's code happens to work. [^1]: https://github.com/cockroachdb/cockroach/blob/aecc58f125ac611f5793101cbd0323df569369db/pkg/cli/rpc_client.go#L46 Release note: None
6aa5ebb
to
0612ea9
Compare
88414: util/tracing: trim trace recordings in a smarter way r=andreimatei a=andreimatei Before this patch, when the recording of a child span was being added to the parent, if the number of spans in the child recording + the number of spans in the parent's recording were greater than the span limit (1000), then the child's recording was completely dropped (apart from the structured events, which were still retained). So, for example, if the parent had a recording of 1 span, and the child has a recording of 1000 spans, the whole 1000 spans were dropped. This patch improves things by always combining the parent trace and the child trace, and then trimming the result according to the following arbitrary algorithm: - start at the root of the trace and sort its children by size, desc - drop the fattest children (including their descendents) until the remaining number of spans to drop becomes smaller than the size of the fattest non-dropped child - recurse into that child, with an adjusted number of spans to drop So, the idea is that, recursively, we drop parts of the largest child - including dropping the whole child if needed. Fixes #87536 Release note: None 88707: ui: reduce frequent Metrics page re-rendering r=koorosh a=koorosh Before, many components on Metrics page relied on `nodesSummarySelector` selector that in turn relies on `NodeStatus` that change constantly with every request (and we request it periodically every 10 seconds). `NodeStatus` include lots of unnecessary data for Metrics page (ie. node's and stores last metrics that aren't used on charts) but these changes forces react components to be re-rendered. This patch refactors selectors that are used by metrics page in a way to provide only relevant subset of NodeStatus's info to Graph components and reduce propagation of `props` passing from parent to child components. Instead, use selectors in child components if necessary. Release note: None Resolves #65030 Video that shows how often components were re-rendered **before** this fix - regardless of timewindow, it always updates every 10 seconds: https://user-images.githubusercontent.com/3106437/192295619-9b2da8bd-2fdb-4f5e-96db-688048fc54cf.mov and here's after fix. Components re-render every 10 second for 10 minutes interval and it is not re-rendered for timewindows like 2 weeks or 1 month: https://user-images.githubusercontent.com/3106437/192296089-13103781-6632-46a5-85aa-80ad8b20dd02.mov 88973: sql: pool some of the processor allocations r=yuzefovich a=yuzefovich **sql: clarify closing contract around plan node and row source adapters** This commit cleans up the `rowSourceToPlanNode` adapter (from the DistSQL processor to the planNode object) in the following manner: - it removes the incorrect call to `ConsumerClosed` of the wrapped input. This call was redundant (because the other side of the adapter `planNodeToRowSource` does the closure too) but was also incorrect since it could access the row source that was put back into the sync.Pool (and maybe even picked up by another query). See issue 88964 for more details. - it removes the checks around non-nil "metadata forwarder". These were needed when the local planNode and DistSQL processor engines were merged into one about four years ago, but nowadays the adapter always gets a non-nil forwarder. Fixes: #88964. Release note: None **sql: pool some of the processor allocations** This commit makes it so that we now pool allocations of noop, planNodeToRowSource, and columnarizer processors. Additionally, trailing meta callbacks for these three as well as materializers are changed to not be anonymous functions to further reduce the allocations. Fixes: #88525. Release note: None 89059: cli: fix use of canceled connection in `init` r=erikgrinaker a=tbg The implementation of `./cockroach init` first dials up the node (in a retry loop) and issues a `Health` RPC, to reduce the likelihood of accidental double-init of the cluster in case of network issues. However, this was trying to be too clever: if it managed to dial the node and check its health (using a short-lived context), it would then return that same connection to the caller for use in the Bootstrap RPC. Unfortunately, that connection would sit on top of an `rpc.Context` whose life was tied to a context[^1] that descended from one wrapped by `RunWithTimeout`. In other words, the context would be cancelled by the time the health check method returned. This seems to not cause issues today, since the `rpc.Context` seems to ignore this context cancellation. But in #88625, this suddenly became a problem and now that I've looked at this code, might as wel fix it regardless of whether #88625 is ever going to land. No release note since today's code happens to work. [^1]: https://github.com/cockroachdb/cockroach/blob/aecc58f125ac611f5793101cbd0323df569369db/pkg/cli/rpc_client.go#L46 Release note: None 89064: jobs: make the registry logging less chatty r=dt,stevendanna a=knz The "toggling idleness" log message was the 4th most voluminous log event source in CC, logged 4x more frequently than the first next event source in volume. This commit makes it less verbose. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First commit LGTM.
2nd commit - I reviewed all I could but not in the right frame of mind to ascertain correctness of the "main" thing. I'd need to look at it further next week. Also will welcome Erik's opinion of course.
Reviewed 1 of 1 files at r1, 24 of 24 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/rpc/context.go
line 1804 at r2 (raw file):
target string, remoteNodeID roachpb.NodeID, class ConnectionClass, ) *Connection { ctx := rpcCtx.makeDialCtx(target, remoteNodeID, class)
Why did you move this? I put it there for the log.Fatal's benefit.
pkg/util/grpcutil/grpc_err_redaction.go
line 28 at r2 (raw file):
func grpcSpecialCasePrintFn(err error, p errbase.Printer, isLeaf bool) (handled bool, next error) { if errors.GetTypeKey(err) != grpcErrTypeKey {
nit: I think the following is just as correct and will get you less overhead:
var refErr = status.Error(codes.Unavailable, "")
...
if errors.HasType(err, refErr)
Alternatively, you could just drop this check entirely and trust status.FromError
- it only returns true in the 2nd return value if the status object could be recovered with the right type.
pkg/util/netutil/net.go
line 212 at r2 (raw file):
// SafeFormatError implements errors.SafeFormatter. func (e *InitialHeartbeatFailedError) SafeFormatError(p errors.Printer) (next error) { p.Printf("initial connection heartbeat failed")
nit: p.SafeString
, less overhead.
pkg/rpc/context_test.go
line 821 at r2 (raw file):
func TestHeartbeatHealth(t *testing.T) { defer leaktest.AfterTest(t)() // TODO why does this hide log.HEALTH even when -show-logs is active?
It was Andrei's request: #69765 / #70052.
You can define a custom log config for use with log.Scope
if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 25 files at r4, 8 of 9 files at r6, 7 of 7 files at r7, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/context.go
line 1804 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Done.
I'm still confused - what's the obstacle to having ctx := rpcCtx.makeDialCtx(target, remoteNodeID, class)
here and passing it as argument to grpcDialNodeInternal
? Is it solely because you want to avoid creating a context.Context if rpcCtx.m.Get
returns a valid connection?
pkg/rpc/context.go
line 1871 at r7 (raw file):
// callers that may pass a zero NodeID could coalesce onto the connection // with the "real" NodeID. This caused issues over the years[^1][^2] and // was never necessary anyway, so we don't do it anymore.
Does this mean we get 1 extra connection for those callers that use node ID 0?
If so, perhaps worth mentioning in the release note. This will double the number of node-node connections when the cluster goes from non-initialized to initialized at the point cockroach init
is run.
pkg/rpc/context.go
line 1943 at r7 (raw file):
conn.err.Store(errDialRejected) close(conn.initialHeartbeatDone) _ = rpcCtx.m.Remove(k, conn) // ignoring error since it's just on shutdown anyway
nit: period at end of comment.
Also I'm not sure I agree - we're moving to a world where these things get instantiated and teared down dynamically (#84700) so we want clean shutdowns in every case.
pkg/util/netutil/net.go
line 212 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Does not exist on
errors.Printer
.
oh sorry I misread. that's cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/context.go
line 1871 at r7 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Does this mean we get 1 extra connection for those callers that use node ID 0?
If so, perhaps worth mentioning in the release note. This will double the number of node-node connections when the cluster goes from non-initialized to initialized at the pointcockroach init
is run.
Also, gossip uses UnvalidatedDial (so with Node ID = 0). So it's not just in the non-init case we're doubling the connections. It's all the time.
(Technically, we'd be going to 2 conns per node-node pair, from system/normal classes, to 3: system/normal/gossip-other.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/context.go
line 1871 at r7 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Also, gossip uses UnvalidatedDial (so with Node ID = 0). So it's not just in the non-init case we're doubling the connections. It's all the time.
(Technically, we'd be going to 2 conns per node-node pair, from system/normal classes, to 3: system/normal/gossip-other.)
That incidentally brings me to a related question: does it matter whether the gossip connection uses the same TCP connection as the other RPCs? Is there something that gossip decides wrt connectivity that could be wrong / disruptive if the gossip connection succeeds, but the other RPC connection(s) fail?
a609889
to
62d76f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @knz, and @yuzefovich)
pkg/rpc/context.go
line 1804 at r2 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
I'm still confused - what's the obstacle to having
ctx := rpcCtx.makeDialCtx(target, remoteNodeID, class)
here and passing it as argument togrpcDialNodeInternal
? Is it solely because you want to avoid creating a context.Context ifrpcCtx.m.Get
returns a valid connection?
The dialing API is currently such that you have methods like GRPCDialNode
that give you a *Connection
and are non-blocking, so they can never use the cancellation property of the Context
(that happens when you call .Connect
). I think it's a little clearer to pass the context only once, or you end up with (essentially) rpcCtx.GRPCDialNode(ctx).Connect(ctx)
.
pkg/rpc/context.go
line 1871 at r7 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
That incidentally brings me to a related question: does it matter whether the gossip connection uses the same TCP connection as the other RPCs? Is there something that gossip decides wrt connectivity that could be wrong / disruptive if the gossip connection succeeds, but the other RPC connection(s) fail?
Good point, will mention in the release note. It shouldn't matter to Gossip who it's sharing the connection with. If anything, having its own connection might be helpful sometimes. (And of course in async weird network partitions as always all bets are off).
pkg/rpc/context.go
line 1943 at r7 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit: period at end of comment.
Also I'm not sure I agree - we're moving to a world where these things get instantiated and teared down dynamically (#84700) so we want clean shutdowns in every case.
The shutdown here is clean, the error makes assertions. But I might as well check the error here too.
bors r=erikgrinaker |
Linter failure:
|
bors r- |
Canceled. |
This commit simplifies the `rpc` package. The main aim is to make the code and the logging it produces somewhat clearer and to pave the way for an ultimate merging of `nodedialer` with `rpc` as well as adoption of the `util/circuit` package (async-based circuit breaker). `rpc.Context` hadn't aged well. Back in the day, it was a connection pool that held on to connections even when they failed. The heartbeat loop would run forever and its latest outcome would reflect the health of the connection. We relied on gRPC to reconnect the transport as necessary. At some point we realized that leaving the connection management to gRPC could cause correctness issues. For example, if a node is de- commissioned and brought up again, gRPC might reconnect to it but now we have a connection that claims to be for node X but is actually for node Y. Similarly, we want to be able to vet that the remote node's cluster version is compatible with ours before letting any messages cross the connection. To achieve this, we added `onlyOnceDialer` - a dialer that fails all but the first attempt to actually connect, and in particular from that point on connections were never long lived once they hit a failure state. Still, the code and testing around the heartbeat loop continued to nurture this illusion. I found that quite unappealing as it was sure to throw off whoever would ultimately work on this code. So, while my original impetus was to produce better log messages from `rpc.Context` when there were connection issues, I took the opportunity to simplify the architecture of the code to reflect what it actually does. In doing so, a few heartbeat-related metrics were removed, as they made limited sense in a world where failing connections get torn down (i.e. our world). Connection state logging is now a lot nicer. Previously, one would very frequently see the onlyOnceDialer's error "cannot reuse client connection" which, if one is not familar with the concept of the onlyOnceDialer is completely opaque, and for those few in the know is an unhelpful error - you wanted the error that occurred during dialing. I paid special attention to the "failfast" behavior. If connections don't stay in the pool when they are unhealthy, what prevents us from dialing down nodes in a tight loop? I realized that we got lucky with our use of onlyOnceDialer because it would always prompt gRPC to do one retry, and at a 1s backoff. So, the connection would stay in the pool for at least a second, meaning that this was the maximum frequency at which we'd try to connect to a down node. My cleanups to onlyOnceDialer in pursuit of better logging elimitated this (desirable) property. Instead we now skip the log messages should they become too frequent. I claim that this is acceptable for now since the vast majority of callers go through a `node.Diaelr`, which has a circuit breaker (letting callers through at most once per second). We previously configured gRPC with a 20s dial timeout. That means that if a remote is unreachable, we'd spend 20s just trying to dial it, possibly tying up callers that could be trying a reachable node in the meantime. That seemed wildly too large; I am reducing it to 5s here which is still generous (but there's a TLS handshake in here so better not make it too tight). We previously tried to re-use connections that were keyed with a NodeID for dial attempts that didn't provide a NodeID. This caused some issues over the years and was now removed; the extra RPC connections are unlikely to cause any issues. The internal connection map is now a plain map with an RWMutex. This is just easier and gets the job done. The code has simplified as a result and it's clearer that it's correct (which it repeatedly was not in the past). I implemented redactability for gRPC's `status.Status`-style errors, so they format nicer and at least we get to see the error code (the error message is already flattened when gRPC hands us the error, sadly). There are various other improvements to errors and formatting. The following are excerpts from the health channel when shutting down another node in a local cluster: Connection is first established: > [n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 3 connection is now ready n1 goes down, i.e. existing connection is interrupted (this error comes from `onlyOnceDialer`): > [n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 35 closing connection after: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing connection interrupted (did the remote node shut down or are there networking issues?)"› Reconnection attempts; these logs are spaced 1-2s apart: > [n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 37 unable to connect (is the peer up and reachable?): initial connection heartbeat failed: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:26257: connect: connection refused"› n3 is back up: > [n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 49 connection is now ready There are other connection errors in the logs that stem from operations checking for a healthy connection, failing to get through circuit breakers, etc., such as these: > [n3,intExec=‹claim-jobs›,range-lookup=/Table/15/4/‹NULL›] 33 unable to connect to n1: failed to check for ready connection to n1 at ‹127.0.0.1:26257›: ‹connection not ready: TRANSIENT_FAILURE› Release note (general change): Previously, CockroachDB employed a 20s connection timeout for cluster-internal connections between nodes. This has been reduced to 5s to potentially reduce the impact of network issues. Release note (general change): Previously, CockroachDB (mostly) shared a TCP connection for the KV and Gossip subsystems. They now use their own connection each, so the total number of outgoing and incoming TCP connections at each node in the cluster will increase by 30 to 50 percent.
It was using a 5ms heartbeat frequency which translates to a 10ms timeout. I noticed that under stress this test would basically not make progress. To make more obvious that the knob the test used (and which I was able to remove since it wasn't needed anyway) I renamed it to also reflect its role as a heartbeat timeout. Release note: None
The RPC heartbeat runs at a fixed interval, so if the connection goes bad sometime in the middle, it'll take some time for the heartbeat to notice. But often gRPC notices right away, and we ought to tear down the connection as soon as we can since that reduces the number of flakes and ambiguous results we see. This commit uses gRPC's state change watcher API to trigger an additional heartbeat as soon as we see the connection transition to `TRANSIENT_FAILURE` (which due to our use of `onlyOnceDialer` is really a permanent - for this connection - failure). In turn, I removed `ConnHealth` and the one place that used it, since the heartbeat loop in effect eagerly carries out that check, too. I did all of this in response to `TestImportWorkerFailure` becoming quite flaky. Most of the time, it'd fail on the `ConnectionReady` check because the connection was still around after n2 had been shut down. The next problem was that `(*distSQLNodeHealth).checkSystem` was ignoring `ErrNotHeartbeated`, in effect leading to down nodes being used frequently. This special case (discussed on the PR) is now also removed. This in turn caused fallout in other tests that were relying on the old behavior (that nodes that hadn't been heartbeated yet were dialled immediately)[^1]. I decided to pull the nuclear option and to make sure that TestCluster establishes full connectivity in `Start`. The test passes reliably now. [^1]: `TestDistSQLRangeCachesIntegrationTest` fails reliably without this Release note: None
Remove some doubled-up logging and simplify the caller of runHeartbeat. Release note: None
bors r=erikgrinaker |
Build succeeded: |
This commit simplifies the
rpc
package.The main aim is to make the code and the logging it produces somewhat
clearer and to pave the way for an ultimate merging of
nodedialer
withrpc
as well as adoption of theutil/circuit
package (async-basedcircuit breaker).
rpc.Context
hadn't aged well. Back in the day, it was a connectionpool that held on to connections even when they failed. The heartbeat
loop would run forever and its latest outcome would reflect the health
of the connection. We relied on gRPC to reconnect the transport as
necessary.
At some point we realized that leaving the connection management to
gRPC could cause correctness issues. For example, if a node is de-
commissioned and brought up again, gRPC might reconnect to it but
now we have a connection that claims to be for node X but is actually
for node Y. Similarly, we want to be able to vet that the remote
node's cluster version is compatible with ours before letting any
messages cross the connection.
To achieve this, we added
onlyOnceDialer
- a dialer that failsall but the first attempt to actually connect, and in particular
from that point on connections were never long lived once they
hit a failure state.
Still, the code and testing around the heartbeat loop continued
to nurture this illusion. I found that quite unappealing as it
was sure to throw off whoever would ultimately work on this code.
So, while my original impetus was to produce better log messages
from
rpc.Context
when there were connection issues, I tookthe opportunity to simplify the architecture of the code to
reflect what it actually does.
In doing so, a few heartbeat-related metrics were removed, as they made
limited sense in a world where failing connections get torn down (i.e.
our world).
Connection state logging is now a lot nicer. Previously, one would very
frequently see the onlyOnceDialer's error "cannot reuse client
connection" which, if one is not familar with the concept of the
onlyOnceDialer is completely opaque, and for those few in the know is an
unhelpful error - you wanted the error that occurred during dialing.
I paid special attention to the "failfast" behavior. If connections
don't stay in the pool when they are unhealthy, what prevents us from
dialing down nodes in a tight loop? I realized that we got lucky with
our use of onlyOnceDialer because it would always prompt gRPC to do
one retry, and at a 1s backoff. So, the connection would stay in the
pool for at least a second, meaning that this was the maximum frequency
at which we'd try to connect to a down node.
My cleanups to onlyOnceDialer in pursuit of better logging elimitated
this (desirable) property. Instead we now skip the log messages should
they become too frequent. I claim that this is acceptable for now since
the vast majority of callers go through a
node.Diaelr
, which has acircuit breaker (letting callers through at most once per second).
We previously configured gRPC with a 20s dial timeout. That means that
if a remote is unreachable, we'd spend 20s just trying to dial it,
possibly tying up callers that could be trying a reachable node in the
meantime. That seemed wildly too large; I am reducing it to 5s here
which is still generous (but there's a TLS handshake in here so better
not make it too tight).
We previously tried to re-use connections that were keyed with a NodeID
for dial attempts that didn't provide a NodeID. This caused some issues
over the years and was now removed; the extra RPC connections are
unlikely to cause any issues.
The internal connection map is now a plain map with an RWMutex. This is
just easier and gets the job done. The code has simplified as a result
and it's clearer that it's correct (which it repeatedly was not in the
past).
I implemented redactability for gRPC's
status.Status
-style errors,so they format nicer and at least we get to see the error code (the
error message is already flattened when gRPC hands us the error,
sadly).
There are various other improvements to errors and formatting. The
following are excerpts from the health channel when shutting down
another node in a local cluster:
Connection is first established:
n1 goes down, i.e. existing connection is interrupted (this error comes
from
onlyOnceDialer
):Reconnection attempts; these logs are spaced 1-2s apart:
n3 is back up:
There are other connection errors in the logs that stem from operations
checking for a healthy connection, failing to get through circuit
breakers, etc., such as these:
Epic: none
Release note (general change): Previously, CockroachDB employed a 20s
connection timeout for cluster-internal connections between nodes. This
has been reduced to 5s to potentially reduce the impact of network
issues.
Release note (general change): Previously, CockroachDB (mostly) shared a
TCP connection for the KV and Gossip subsystems. They now use their own
connection each, so the total number of outgoing and incoming TCP
connections at each node in the cluster will increase by 30 to 50
percent.