Skip to content

Commit

Permalink
cli: fix use of canceled connection in init
Browse files Browse the repository at this point in the history
The implementation of `./cockroach init` first dials up the node (in a
retry loop) and issues a `Health` RPC, to reduce the likelihood of
accidental double-init of the cluster in case of network issues.

However, this was trying to be too clever: if it managed to dial the
node and check its health (again, using a short-lived context), it would
then return that same connection to the caller for use in the Bootstrap
RPC. Unfortunately, that connection would sit on top of an `rpc.Context`
whose life was tied to a context[^1] that descended from one wrapped by
`RunWithTimeout`. In other words, the context would be cancelled by the
time the health check method returned.

This seems to not cause issues today, since the `rpc.Context` seems to
ignore this context cancellation. But in cockroachdb#88625, this suddenly became
a problem and now that I've looked at this code, might as wel fix it
regardless of whether cockroachdb#88625 is ever going to land.

No release note since today's code happens to work.

[^1]: https://github.com/cockroachdb/cockroach/blob/aecc58f125ac611f5793101cbd0323df569369db/pkg/cli/rpc_client.go#L46

Release note: None
  • Loading branch information
tbg committed Sep 30, 2022
1 parent 2b58861 commit 90aadd6
Showing 1 changed file with 36 additions and 48 deletions.
84 changes: 36 additions & 48 deletions pkg/cli/init.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ import (
"github.com/cockroachdb/cockroach/pkg/util/retry"
"github.com/cockroachdb/errors"
"github.com/spf13/cobra"
"google.golang.org/grpc"
)

var initCmd = &cobra.Command{
Expand All @@ -44,7 +43,10 @@ func runInit(cmd *cobra.Command, args []string) error {
defer cancel()

// Wait for the node to be ready for initialization.
conn, finish, err := waitForClientReadinessAndGetClientGRPCConn(ctx)
if err := dialAndCheckHealth(ctx); err != nil {
return err
}
conn, _, finish, err := getClientGRPCConn(ctx, serverCfg)
if err != nil {
return err
}
Expand All @@ -60,61 +62,47 @@ func runInit(cmd *cobra.Command, args []string) error {
return nil
}

// waitForClientReadinessAndGetClientGRPCConn waits for the node to
// be ready for initialization. This check ensures that the `init`
// command is less likely to fail because it was issued too
// early. In general, retrying the `init` command is dangerous [0],
// so we make a best effort at minimizing chances for users to
// arrive in an uncomfortable situation.
// dialAndCheckHealth waits for the node to be ready for initialization. This
// check ensures that the `init` command is less likely to fail because it was
// issued too early. In general, retrying the `init` command is dangerous [0],
// so we make a best effort at minimizing chances for users to arrive in an
// uncomfortable situation.
//
// [0]: https://github.com/cockroachdb/cockroach/pull/19753#issuecomment-341561452
func waitForClientReadinessAndGetClientGRPCConn(
ctx context.Context,
) (conn *grpc.ClientConn, finish func(), err error) {
defer func() {
// If we're returning with an error, tear down the gRPC connection
// that's been established, if any.
if finish != nil && err != nil {
finish()
func dialAndCheckHealth(ctx context.Context) error {
retryOpts := retry.Options{InitialBackoff: time.Second, MaxBackoff: time.Second}

tryConnect := func(ctx context.Context) error {
// (Attempt to) establish the gRPC connection. If that fails,
// it may be that the server hasn't started to listen yet, in
// which case we'll retry.
conn, _, finish, err := getClientGRPCConn(ctx, serverCfg)
if err != nil {
return err
}
}()
defer finish()

retryOpts := retry.Options{InitialBackoff: time.Second, MaxBackoff: time.Second}
for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
if err = contextutil.RunWithTimeout(ctx, "init-open-conn", 5*time.Second,
func(ctx context.Context) error {
// (Attempt to) establish the gRPC connection. If that fails,
// it may be that the server hasn't started to listen yet, in
// which case we'll retry.
conn, _, finish, err = getClientGRPCConn(ctx, serverCfg)
if err != nil {
return err
}
// Access the /health endpoint. Until/unless this succeeds, the
// node is not yet fully initialized and ready to accept
// Bootstrap requests.
ac := serverpb.NewAdminClient(conn)
_, err = ac.Health(ctx, &serverpb.HealthRequest{})
return err
}

// Access the /health endpoint. Until/unless this succeeds, the
// node is not yet fully initialized and ready to accept
// Bootstrap requests.
ac := serverpb.NewAdminClient(conn)
_, err := ac.Health(ctx, &serverpb.HealthRequest{})
return err
}); err != nil {
for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
if err := contextutil.RunWithTimeout(
ctx, "init-open-conn", 5*time.Second, tryConnect,
); err != nil {
err = errors.Wrapf(err, "node not ready to perform cluster initialization")
fmt.Fprintln(stderr, "warning:", err, "(retrying)")

// We're going to retry; first cancel the connection that's
// been established, if any.
if finish != nil {
finish()
finish = nil
}
// Then retry.
continue
}

// No error - connection was established and health endpoint is
// ready.
return conn, finish, err
// We managed to connect and run a sanity check. Note however that `conn` in
// `tryConnect` was established using a context that is now canceled, so we
// let the caller do its own dial.
return nil
}
err = errors.New("maximum number of retries exceeded")
return
return errors.New("maximum number of retries exceeded")
}

0 comments on commit 90aadd6

Please sign in to comment.