Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: re-work connection maintenance #88625

Merged
merged 5 commits into from
Oct 12, 2022
Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 23, 2022

This commit simplifies the rpc package.

The main aim is to make the code and the logging it produces somewhat
clearer and to pave the way for an ultimate merging of nodedialer with
rpc as well as adoption of the util/circuit package (async-based
circuit breaker).

rpc.Context hadn't aged well. Back in the day, it was a connection
pool that held on to connections even when they failed. The heartbeat
loop would run forever and its latest outcome would reflect the health
of the connection. We relied on gRPC to reconnect the transport as
necessary.

At some point we realized that leaving the connection management to
gRPC could cause correctness issues. For example, if a node is de-
commissioned and brought up again, gRPC might reconnect to it but
now we have a connection that claims to be for node X but is actually
for node Y. Similarly, we want to be able to vet that the remote
node's cluster version is compatible with ours before letting any
messages cross the connection.

To achieve this, we added onlyOnceDialer - a dialer that fails
all but the first attempt to actually connect, and in particular
from that point on connections were never long lived once they
hit a failure state.

Still, the code and testing around the heartbeat loop continued
to nurture this illusion. I found that quite unappealing as it
was sure to throw off whoever would ultimately work on this code.

So, while my original impetus was to produce better log messages
from rpc.Context when there were connection issues, I took
the opportunity to simplify the architecture of the code to
reflect what it actually does.

In doing so, a few heartbeat-related metrics were removed, as they made
limited sense in a world where failing connections get torn down (i.e.
our world).

Connection state logging is now a lot nicer. Previously, one would very
frequently see the onlyOnceDialer's error "cannot reuse client
connection" which, if one is not familar with the concept of the
onlyOnceDialer is completely opaque, and for those few in the know is an
unhelpful error - you wanted the error that occurred during dialing.

I paid special attention to the "failfast" behavior. If connections
don't stay in the pool when they are unhealthy, what prevents us from
dialing down nodes in a tight loop? I realized that we got lucky with
our use of onlyOnceDialer because it would always prompt gRPC to do
one retry, and at a 1s backoff. So, the connection would stay in the
pool for at least a second, meaning that this was the maximum frequency
at which we'd try to connect to a down node.

My cleanups to onlyOnceDialer in pursuit of better logging elimitated
this (desirable) property. Instead we now skip the log messages should
they become too frequent. I claim that this is acceptable for now since
the vast majority of callers go through a node.Diaelr, which has a
circuit breaker (letting callers through at most once per second).

We previously configured gRPC with a 20s dial timeout. That means that
if a remote is unreachable, we'd spend 20s just trying to dial it,
possibly tying up callers that could be trying a reachable node in the
meantime. That seemed wildly too large; I am reducing it to 5s here
which is still generous (but there's a TLS handshake in here so better
not make it too tight).

We previously tried to re-use connections that were keyed with a NodeID
for dial attempts that didn't provide a NodeID. This caused some issues
over the years and was now removed; the extra RPC connections are
unlikely to cause any issues.

The internal connection map is now a plain map with an RWMutex. This is
just easier and gets the job done. The code has simplified as a result
and it's clearer that it's correct (which it repeatedly was not in the
past).

I implemented redactability for gRPC's status.Status-style errors,
so they format nicer and at least we get to see the error code (the
error message is already flattened when gRPC hands us the error,
sadly).

There are various other improvements to errors and formatting. The
following are excerpts from the health channel when shutting down
another node in a local cluster:

Connection is first established:

[n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 3 connection is now ready

n1 goes down, i.e. existing connection is interrupted (this error comes
from onlyOnceDialer):

[n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 35 closing
connection after: ‹rpc error: code = Unavailable desc = connection
error: desc = "transport: Error while dialing connection interrupted
(did the remote node shut down or are there networking issues?)"›

Reconnection attempts; these logs are spaced 1-2s apart:

[n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 37 unable to
connect (is the peer up and reachable?): initial connection heartbeat
failed: ‹rpc error: code = Unavailable desc = connection error: desc =
"transport: Error while dialing dial tcp 127.0.0.1:26257: connect:
connection refused"›

n3 is back up:

[n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 49 connection is now ready

There are other connection errors in the logs that stem from operations
checking for a healthy connection, failing to get through circuit
breakers, etc., such as these:

[n3,intExec=‹claim-jobs›,range-lookup=/Table/15/4/‹NULL›] 33 unable
to connect to n1: failed to check for ready connection to n1 at
‹127.0.0.1:26257›: ‹connection not ready: TRANSIENT_FAILURE›

Epic: none

Release note (general change): Previously, CockroachDB employed a 20s
connection timeout for cluster-internal connections between nodes. This
has been reduced to 5s to potentially reduce the impact of network
issues.

Release note (general change): Previously, CockroachDB (mostly) shared a
TCP connection for the KV and Gossip subsystems. They now use their own
connection each, so the total number of outgoing and incoming TCP
connections at each node in the cluster will increase by 30 to 50
percent.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

tbg added a commit to tbg/cockroach that referenced this pull request Sep 30, 2022
The implementation of `./cockroach init` first dials up the node (in a
retry loop) and issues a `Health` RPC, to reduce the likelihood of
accidental double-init of the cluster in case of network issues.

However, this was trying to be too clever: if it managed to dial the
node and check its health (again, using a short-lived context), it would
then return that same connection to the caller for use in the Bootstrap
RPC. Unfortunately, that connection would sit on top of an `rpc.Context`
whose life was tied to a context[^1] that descended from one wrapped by
`RunWithTimeout`. In other words, the context would be cancelled by the
time the health check method returned.

This seems to not cause issues today, since the `rpc.Context` seems to
ignore this context cancellation. But in cockroachdb#88625, this suddenly became
a problem and now that I've looked at this code, might as wel fix it
regardless of whether cockroachdb#88625 is ever going to land.

No release note since today's code happens to work.

[^1]: https://github.com/cockroachdb/cockroach/blob/aecc58f125ac611f5793101cbd0323df569369db/pkg/cli/rpc_client.go#L46

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Sep 30, 2022
The implementation of `./cockroach init` first dials up the node (in a
retry loop) and issues a `Health` RPC, to reduce the likelihood of
accidental double-init of the cluster in case of network issues.

However, this was trying to be too clever: if it managed to dial the
node and check its health (again, using a short-lived context), it would
then return that same connection to the caller for use in the Bootstrap
RPC. Unfortunately, that connection would sit on top of an `rpc.Context`
whose life was tied to a context[^1] that descended from one wrapped by
`RunWithTimeout`. In other words, the context would be cancelled by the
time the health check method returned.

This seems to not cause issues today, since the `rpc.Context` seems to
ignore this context cancellation. But in cockroachdb#88625, this suddenly became
a problem and now that I've looked at this code, might as wel fix it
regardless of whether cockroachdb#88625 is ever going to land.

No release note since today's code happens to work.

[^1]: https://github.com/cockroachdb/cockroach/blob/aecc58f125ac611f5793101cbd0323df569369db/pkg/cli/rpc_client.go#L46

Release note: None
@tbg tbg force-pushed the rpc-redial-err branch 7 times, most recently from 6aa5ebb to 0612ea9 Compare September 30, 2022 14:30
@tbg tbg marked this pull request as ready for review September 30, 2022 14:32
@tbg tbg requested a review from a team as a code owner September 30, 2022 14:32
@tbg tbg requested review from a team September 30, 2022 14:32
@tbg tbg requested review from a team as code owners September 30, 2022 14:32
@tbg tbg requested review from erikgrinaker and removed request for a team September 30, 2022 14:33
craig bot pushed a commit that referenced this pull request Sep 30, 2022
88414: util/tracing: trim trace recordings in a smarter way  r=andreimatei a=andreimatei

Before this patch, when the recording of a child span was being added to
the parent, if the number of spans in the child recording + the number
of spans in the parent's recording were greater than the span limit
(1000), then the child's recording was completely dropped (apart from
the structured events, which were still retained). So, for example, if
the parent had a recording of 1 span, and the child has a recording of
1000 spans, the whole 1000 spans were dropped.
This patch improves things by always combining the parent trace and the
child trace, and then trimming the result according to the following
arbitrary algorithm:
- start at the root of the trace and sort its children by size, desc
- drop the fattest children (including their descendents) until the
  remaining number of spans to drop becomes smaller than the size of the
  fattest non-dropped child
- recurse into that child, with an adjusted number of spans to drop

So, the idea is that, recursively, we drop parts of the largest child -
including dropping the whole child if needed.

Fixes #87536

Release note: None

88707: ui: reduce frequent Metrics page re-rendering r=koorosh a=koorosh

Before, many components on Metrics page relied on
`nodesSummarySelector` selector that in turn relies on `NodeStatus` that change constantly with every request (and we request it periodically every 10 seconds). `NodeStatus` include lots of unnecessary data for Metrics page (ie. node's and stores last metrics that aren't used on charts) but these changes forces react components to be re-rendered.

This patch refactors selectors that are used by metrics page in a way to provide only relevant subset of NodeStatus's info
to Graph components and reduce propagation of `props` passing from parent to child components. Instead, use selectors 
in child components if necessary.

Release note: None

Resolves #65030


Video that shows how often components were re-rendered **before** this fix - regardless of timewindow, it always
updates every 10 seconds:

https://user-images.githubusercontent.com/3106437/192295619-9b2da8bd-2fdb-4f5e-96db-688048fc54cf.mov

and here's after fix. Components re-render every 10 second for 10 minutes interval and it is not re-rendered 
for timewindows like 2 weeks or 1 month:

https://user-images.githubusercontent.com/3106437/192296089-13103781-6632-46a5-85aa-80ad8b20dd02.mov




88973: sql: pool some of the processor allocations r=yuzefovich a=yuzefovich

**sql: clarify closing contract around plan node and row source adapters**

This commit cleans up the `rowSourceToPlanNode` adapter (from the
DistSQL processor to the planNode object) in the following manner:
- it removes the incorrect call to `ConsumerClosed` of the wrapped
input. This call was redundant (because the other side of the adapter
`planNodeToRowSource` does the closure too) but was also incorrect since
it could access the row source that was put back into the sync.Pool (and
maybe even picked up by another query). See issue 88964 for more details.
- it removes the checks around non-nil "metadata forwarder". These were
needed when the local planNode and DistSQL processor engines were merged
into one about four years ago, but nowadays the adapter always gets
a non-nil forwarder.

Fixes: #88964.

Release note: None

**sql: pool some of the processor allocations**

This commit makes it so that we now pool allocations of noop,
planNodeToRowSource, and columnarizer processors. Additionally,
trailing meta callbacks for these three as well as materializers
are changed to not be anonymous functions to further reduce the
allocations.

Fixes: #88525.

Release note: None

89059: cli: fix use of canceled connection in `init` r=erikgrinaker a=tbg

The implementation of `./cockroach init` first dials up the node (in a
retry loop) and issues a `Health` RPC, to reduce the likelihood of
accidental double-init of the cluster in case of network issues.

However, this was trying to be too clever: if it managed to dial the
node and check its health (using a short-lived context), it would
then return that same connection to the caller for use in the Bootstrap
RPC. Unfortunately, that connection would sit on top of an `rpc.Context`
whose life was tied to a context[^1] that descended from one wrapped by
`RunWithTimeout`. In other words, the context would be cancelled by the
time the health check method returned.

This seems to not cause issues today, since the `rpc.Context` seems to
ignore this context cancellation. But in #88625, this suddenly became
a problem and now that I've looked at this code, might as wel fix it
regardless of whether #88625 is ever going to land.

No release note since today's code happens to work.

[^1]: https://github.com/cockroachdb/cockroach/blob/aecc58f125ac611f5793101cbd0323df569369db/pkg/cli/rpc_client.go#L46

Release note: None


89064: jobs: make the registry logging less chatty r=dt,stevendanna a=knz

The "toggling idleness" log message was the 4th most voluminous log event source in CC, logged 4x more frequently than the first next event source in volume.

This commit makes it less verbose.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Andrii Vorobiov <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First commit LGTM.
2nd commit - I reviewed all I could but not in the right frame of mind to ascertain correctness of the "main" thing. I'd need to look at it further next week. Also will welcome Erik's opinion of course.

Reviewed 1 of 1 files at r1, 24 of 24 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


pkg/rpc/context.go line 1804 at r2 (raw file):

	target string, remoteNodeID roachpb.NodeID, class ConnectionClass,
) *Connection {
	ctx := rpcCtx.makeDialCtx(target, remoteNodeID, class)

Why did you move this? I put it there for the log.Fatal's benefit.


pkg/util/grpcutil/grpc_err_redaction.go line 28 at r2 (raw file):

func grpcSpecialCasePrintFn(err error, p errbase.Printer, isLeaf bool) (handled bool, next error) {
	if errors.GetTypeKey(err) != grpcErrTypeKey {

nit: I think the following is just as correct and will get you less overhead:

var refErr = status.Error(codes.Unavailable, "")
...
if errors.HasType(err, refErr)

Alternatively, you could just drop this check entirely and trust status.FromError - it only returns true in the 2nd return value if the status object could be recovered with the right type.


pkg/util/netutil/net.go line 212 at r2 (raw file):

// SafeFormatError implements errors.SafeFormatter.
func (e *InitialHeartbeatFailedError) SafeFormatError(p errors.Printer) (next error) {
	p.Printf("initial connection heartbeat failed")

nit: p.SafeString, less overhead.


pkg/rpc/context_test.go line 821 at r2 (raw file):

func TestHeartbeatHealth(t *testing.T) {
	defer leaktest.AfterTest(t)()
	// TODO why does this hide log.HEALTH even when -show-logs is active?

It was Andrei's request: #69765 / #70052.

You can define a custom log config for use with log.Scope if you want.

@tbg tbg requested a review from a team as a code owner October 10, 2022 20:12
@tbg tbg requested review from herkolategan and removed request for a team and herkolategan October 10, 2022 20:12
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 25 files at r4, 8 of 9 files at r6, 7 of 7 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/context.go line 1804 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Done.

I'm still confused - what's the obstacle to having ctx := rpcCtx.makeDialCtx(target, remoteNodeID, class) here and passing it as argument to grpcDialNodeInternal? Is it solely because you want to avoid creating a context.Context if rpcCtx.m.Get returns a valid connection?


pkg/rpc/context.go line 1871 at r7 (raw file):

	// callers that may pass a zero NodeID could coalesce onto the connection
	// with the "real" NodeID. This caused issues over the years[^1][^2] and
	// was never necessary anyway, so we don't do it anymore.

Does this mean we get 1 extra connection for those callers that use node ID 0?
If so, perhaps worth mentioning in the release note. This will double the number of node-node connections when the cluster goes from non-initialized to initialized at the point cockroach init is run.


pkg/rpc/context.go line 1943 at r7 (raw file):

		conn.err.Store(errDialRejected)
		close(conn.initialHeartbeatDone)
		_ = rpcCtx.m.Remove(k, conn) // ignoring error since it's just on shutdown anyway

nit: period at end of comment.

Also I'm not sure I agree - we're moving to a world where these things get instantiated and teared down dynamically (#84700) so we want clean shutdowns in every case.


pkg/util/netutil/net.go line 212 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Does not exist on errors.Printer.

oh sorry I misread. that's cool.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/context.go line 1871 at r7 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Does this mean we get 1 extra connection for those callers that use node ID 0?
If so, perhaps worth mentioning in the release note. This will double the number of node-node connections when the cluster goes from non-initialized to initialized at the point cockroach init is run.

Also, gossip uses UnvalidatedDial (so with Node ID = 0). So it's not just in the non-init case we're doubling the connections. It's all the time.

(Technically, we'd be going to 2 conns per node-node pair, from system/normal classes, to 3: system/normal/gossip-other.)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/context.go line 1871 at r7 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Also, gossip uses UnvalidatedDial (so with Node ID = 0). So it's not just in the non-init case we're doubling the connections. It's all the time.

(Technically, we'd be going to 2 conns per node-node pair, from system/normal classes, to 3: system/normal/gossip-other.)

That incidentally brings me to a related question: does it matter whether the gossip connection uses the same TCP connection as the other RPCs? Is there something that gossip decides wrt connectivity that could be wrong / disruptive if the gossip connection succeeds, but the other RPC connection(s) fail?

@tbg tbg force-pushed the rpc-redial-err branch 2 times, most recently from a609889 to 62d76f8 Compare October 12, 2022 07:17
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @knz, and @yuzefovich)


pkg/rpc/context.go line 1804 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I'm still confused - what's the obstacle to having ctx := rpcCtx.makeDialCtx(target, remoteNodeID, class) here and passing it as argument to grpcDialNodeInternal? Is it solely because you want to avoid creating a context.Context if rpcCtx.m.Get returns a valid connection?

The dialing API is currently such that you have methods like GRPCDialNode that give you a *Connection and are non-blocking, so they can never use the cancellation property of the Context (that happens when you call .Connect). I think it's a little clearer to pass the context only once, or you end up with (essentially) rpcCtx.GRPCDialNode(ctx).Connect(ctx).


pkg/rpc/context.go line 1871 at r7 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

That incidentally brings me to a related question: does it matter whether the gossip connection uses the same TCP connection as the other RPCs? Is there something that gossip decides wrt connectivity that could be wrong / disruptive if the gossip connection succeeds, but the other RPC connection(s) fail?

Good point, will mention in the release note. It shouldn't matter to Gossip who it's sharing the connection with. If anything, having its own connection might be helpful sometimes. (And of course in async weird network partitions as always all bets are off).


pkg/rpc/context.go line 1943 at r7 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: period at end of comment.

Also I'm not sure I agree - we're moving to a world where these things get instantiated and teared down dynamically (#84700) so we want clean shutdowns in every case.

The shutdown here is clean, the error makes assertions. But I might as well check the error here too.

@tbg
Copy link
Member Author

tbg commented Oct 12, 2022

bors r=erikgrinaker

@erikgrinaker
Copy link
Contributor

Linter failure:

=== RUN   TestLint/TestRoachVet
    lint_test.go:91:
        pkg/rpc/context_test.go:928:11: github.com/cockroachdb/errors.Errorf format %d reads arg #1, but call has 0 args
        pkg/rpc/context_test.go:951:11: github.com/cockroachdb/errors.Errorf format %d reads arg #1, but call has 0 args
    --- FAIL: TestLint/TestRoachVet (142.81s)

@tbg
Copy link
Member Author

tbg commented Oct 12, 2022

bors r-

@craig
Copy link
Contributor

craig bot commented Oct 12, 2022

Canceled.

tbg added 4 commits October 12, 2022 10:30
This commit simplifies the `rpc` package.

The main aim is to make the code and the logging it produces somewhat
clearer and to pave the way for an ultimate merging of `nodedialer` with
`rpc` as well as adoption of the `util/circuit` package (async-based
circuit breaker).

`rpc.Context` hadn't aged well. Back in the day, it was a connection
pool that held on to connections even when they failed. The heartbeat
loop would run forever and its latest outcome would reflect the health
of the connection. We relied on gRPC to reconnect the transport as
necessary.

At some point we realized that leaving the connection management to
gRPC could cause correctness issues. For example, if a node is de-
commissioned and brought up again, gRPC might reconnect to it but
now we have a connection that claims to be for node X but is actually
for node Y. Similarly, we want to be able to vet that the remote
node's cluster version is compatible with ours before letting any
messages cross the connection.

To achieve this, we added `onlyOnceDialer` - a dialer that fails
all but the first attempt to actually connect, and in particular
from that point on connections were never long lived once they
hit a failure state.

Still, the code and testing around the heartbeat loop continued
to nurture this illusion. I found that quite unappealing as it
was sure to throw off whoever would ultimately work on this code.

So, while my original impetus was to produce better log messages
from `rpc.Context` when there were connection issues, I took
the opportunity to simplify the architecture of the code to
reflect what it actually does.

In doing so, a few heartbeat-related metrics were removed, as they made
limited sense in a world where failing connections get torn down (i.e.
our world).

Connection state logging is now a lot nicer. Previously, one would very
frequently see the onlyOnceDialer's error "cannot reuse client
connection" which, if one is not familar with the concept of the
onlyOnceDialer is completely opaque, and for those few in the know is an
unhelpful error - you wanted the error that occurred during dialing.

I paid special attention to the "failfast" behavior. If connections
don't stay in the pool when they are unhealthy, what prevents us from
dialing down nodes in a tight loop? I realized that we got lucky with
our use of onlyOnceDialer because it would always prompt gRPC to do
one retry, and at a 1s backoff. So, the connection would stay in the
pool for at least a second, meaning that this was the maximum frequency
at which we'd try to connect to a down node.

My cleanups to onlyOnceDialer in pursuit of better logging elimitated
this (desirable) property. Instead we now skip the log messages should
they become too frequent. I claim that this is acceptable for now since
the vast majority of callers go through a `node.Diaelr`, which has a
circuit breaker (letting callers through at most once per second).

We previously configured gRPC with a 20s dial timeout. That means that
if a remote is unreachable, we'd spend 20s just trying to dial it,
possibly tying up callers that could be trying a reachable node in the
meantime. That seemed wildly too large; I am reducing it to 5s here
which is still generous (but there's a TLS handshake in here so better
not make it too tight).

We previously tried to re-use connections that were keyed with a NodeID
for dial attempts that didn't provide a NodeID. This caused some issues
over the years and was now removed; the extra RPC connections are
unlikely to cause any issues.

The internal connection map is now a plain map with an RWMutex. This is
just easier and gets the job done. The code has simplified as a result
and it's clearer that it's correct (which it repeatedly was not in the
past).

I implemented redactability for gRPC's `status.Status`-style errors,
so they format nicer and at least we get to see the error code (the
error message is already flattened when gRPC hands us the error,
sadly).

There are various other improvements to errors and formatting. The
following are excerpts from the health channel when shutting down
another node in a local cluster:

Connection is first established:

> [n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 3  connection is now ready

n1 goes down, i.e. existing connection is interrupted (this error comes
from `onlyOnceDialer`):

> [n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 35  closing
connection after: ‹rpc error: code = Unavailable desc = connection
error: desc = "transport: Error while dialing connection interrupted
(did the remote node shut down or are there networking issues?)"›

Reconnection attempts; these logs are spaced 1-2s apart:

> [n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 37  unable to
connect (is the peer up and reachable?): initial connection heartbeat
failed: ‹rpc error: code = Unavailable desc = connection error: desc =
"transport: Error while dialing dial tcp 127.0.0.1:26257: connect:
connection refused"›

n3 is back up:

> [n3,rnode=1,raddr=‹127.0.0.1:26257›,class=system,rpc] 49  connection is now ready

There are other connection errors in the logs that stem from operations
checking for a healthy connection, failing to get through circuit
breakers, etc., such as these:

> [n3,intExec=‹claim-jobs›,range-lookup=/Table/15/4/‹NULL›] 33  unable
to connect to n1: failed to check for ready connection to n1 at
‹127.0.0.1:26257›: ‹connection not ready: TRANSIENT_FAILURE›

Release note (general change): Previously, CockroachDB employed a 20s
connection timeout for cluster-internal connections between nodes.  This
has been reduced to 5s to potentially reduce the impact of network
issues.

Release note (general change): Previously, CockroachDB (mostly) shared a
TCP connection for the KV and Gossip subsystems. They now use their own
connection each, so the total number of outgoing and incoming TCP
connections at each node in the cluster will increase by 30 to 50
percent.
It was using a 5ms heartbeat frequency which translates to a 10ms
timeout. I noticed that under stress this test would basically not
make progress. To make more obvious that the knob the test used (and
which I was able to remove since it wasn't needed anyway) I renamed
it to also reflect its role as a heartbeat timeout.

Release note: None
The RPC heartbeat runs at a fixed interval, so if the connection
goes bad sometime in the middle, it'll take some time for the
heartbeat to notice. But often gRPC notices right away, and we
ought to tear down the connection as soon as we can since that
reduces the number of flakes and ambiguous results we see.

This commit uses gRPC's state change watcher API to trigger an
additional heartbeat as soon as we see the connection transition to
`TRANSIENT_FAILURE` (which due to our use of `onlyOnceDialer` is really
a permanent - for this connection - failure).

In turn, I removed `ConnHealth` and the one place that used it, since
the heartbeat loop in effect eagerly carries out that check, too.

I did all of this in response to `TestImportWorkerFailure` becoming
quite flaky. Most of the time, it'd fail on the `ConnectionReady`
check because the connection was still around after n2 had been
shut down.

The next problem was that `(*distSQLNodeHealth).checkSystem` was
ignoring `ErrNotHeartbeated`, in effect leading to down nodes being used
frequently. This special case (discussed on the PR) is now also removed.

This in turn caused fallout in other tests that were relying on the
old behavior (that nodes that hadn't been heartbeated yet were dialled
immediately)[^1]. I decided to pull the nuclear option and to make sure
that TestCluster establishes full connectivity in `Start`.

The test passes reliably now.

[^1]: `TestDistSQLRangeCachesIntegrationTest` fails reliably without this

Release note: None
Remove some doubled-up logging and simplify the caller of runHeartbeat.

Release note: None
@tbg
Copy link
Member Author

tbg commented Oct 12, 2022

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Oct 12, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants