Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tablet types to wait #17622

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions examples/region_sharding/101_initial_cluster.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,10 @@ fi
CELL=zone1 ../common/scripts/vtctld-up.sh

# start unsharded keyspace and tablet
CELL=zone1 TABLET_UID=100 ../common/scripts/mysqlctl-up.sh
SHARD=0 CELL=zone1 KEYSPACE=main TABLET_UID=100 ../common/scripts/vttablet-up.sh
for T_UID in 100 101; do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need two tablets here because the vtgate start-up script waits for primary,replica and with a single tablet we end up with no replicas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change vtgate to only wait for PRIMARY?

Copy link
Contributor Author

@ejortegau ejortegau Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script that starts the vtgate (../common/vtgate-up.sh) is common for all tests, not only region_sharding, so changing it would affect all tests. It seemed less intrusive of a change to modify this test only by having an extra vttablet. But if you prefer, I can do it that way. LMK.

CELL=zone1 TABLET_UID="${T_UID}" ../common/scripts/mysqlctl-up.sh
SHARD=0 CELL=zone1 KEYSPACE=main TABLET_UID="${T_UID}" ../common/scripts/vttablet-up.sh
done

# set the correct durability policy for the keyspace
vtctldclient --server localhost:15999 SetKeyspaceDurabilityPolicy --durability-policy=none main || fail "Failed to set keyspace durability policy on the main keyspace"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,85 +19,19 @@ package loadkeyspace
import (
"os"
"path"
"strings"
"testing"
"time"

"github.com/stretchr/testify/require"

"vitess.io/vitess/go/test/endtoend/utils"

"vitess.io/vitess/go/test/endtoend/cluster"
)

var (
clusterInstance *cluster.LocalProcessCluster
hostname = "localhost"
keyspaceName = "ks"
cell = "zone1"
sqlSchema = `
create table vt_user (
id bigint,
name varchar(64),
primary key (id)
) Engine=InnoDB;
create table main (
id bigint,
val varchar(128),
primary key(id)
) Engine=InnoDB;
create table test_table (
id bigint,
val varchar(128),
primary key(id)
) Engine=InnoDB;
`
)

func TestLoadKeyspaceWithNoTablet(t *testing.T) {
Copy link
Contributor Author

@ejortegau ejortegau Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case was removed as discussed on Slack

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an empty string an allowed value for --tablet_types_to_wait? Or do we validate and reject it and default to PRIMARY, REPLICA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I understood that default is empty but it fails the validation, so at least one tablet type have to be provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I understood that default is empty but it fails the validation, so at least one tablet type have to be provided.

Correct, the vtgate refuses to start with empty string. Furthermore, it validates that the string corresponds to a serving tablet type.

var err error

clusterInstance = cluster.NewCluster(cell, hostname)
defer clusterInstance.Teardown()

// Start topo server
err = clusterInstance.StartTopo()
require.NoError(t, err)

// create keyspace
keyspace := &cluster.Keyspace{
Name: keyspaceName,
SchemaSQL: sqlSchema,
}
clusterInstance.VtTabletExtraArgs = append(clusterInstance.VtTabletExtraArgs, "--queryserver-config-schema-change-signal")
err = clusterInstance.StartUnshardedKeyspace(*keyspace, 0, false)
require.NoError(t, err)

// teardown vttablets
for _, vttablet := range clusterInstance.Keyspaces[0].Shards[0].Vttablets {
err = vttablet.VttabletProcess.TearDown()
require.NoError(t, err)
utils.TimeoutAction(t, 1*time.Minute, "timeout - teardown of VTTablet", func() bool {
return vttablet.VttabletProcess.GetStatus() == ""
})
}

// Start vtgate with the schema_change_signal flag
clusterInstance.VtGateExtraArgs = append(clusterInstance.VtGateExtraArgs, "--schema_change_signal")
err = clusterInstance.StartVtgate()
require.NoError(t, err)

// After starting VTGate we need to leave enough time for resolveAndLoadKeyspace to reach
// the schema tracking timeout (5 seconds).
utils.TimeoutAction(t, 5*time.Minute, "timeout - could not find 'Unable to get initial schema reload' in 'vtgate-stderr.txt'", func() bool {
logDir := clusterInstance.VtgateProcess.LogDir
all, _ := os.ReadFile(path.Join(logDir, "vtgate-stderr.txt"))
return strings.Contains(string(all), "Unable to get initial schema reload")
})
}

func TestNoInitialKeyspace(t *testing.T) {
var err error

Expand Down
3 changes: 1 addition & 2 deletions go/vt/vtgate/tabletgateway.go
Original file line number Diff line number Diff line change
Expand Up @@ -207,9 +207,8 @@ func (gw *TabletGateway) WaitForTablets(ctx context.Context, tabletTypesToWait [
case context.DeadlineExceeded:
// In this scenario, we were able to reach the
// topology service, but some tablets may not be
// ready. We just warn and keep going.
// ready.
log.Warningf("Timeout waiting for all keyspaces / shards to have healthy tablets of types %v, may be in degraded mode", tabletTypesToWait)
err = nil
}
}()

Expand Down
17 changes: 15 additions & 2 deletions go/vt/vtgate/vtgate.go
Original file line number Diff line number Diff line change
Expand Up @@ -295,8 +295,21 @@ func Init(
// TabletGateway can create it's own healthcheck
gw := NewTabletGateway(ctx, hc, serv, cell)
gw.RegisterStats()
if err := gw.WaitForTablets(ctx, tabletTypesToWait); err != nil {
log.Fatalf("tabletGateway.WaitForTablets failed: %v", err)

// Retry loop for potential time-outs waiting for all tablets.
OuterLoop:
for {
err := gw.WaitForTablets(ctx, tabletTypesToWait)
switch {
case err == nil:
break OuterLoop
case errors.Is(err, context.DeadlineExceeded):
log.Warning("TabletGateway timed out waiting for tablets to become available - retrying.")

continue
Comment on lines +306 to +309
Copy link
Member

@harshit-gangal harshit-gangal Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would require a new context for retry otherwise it will keep failing with context.DeadlineExceeded
we should add a test for this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure I follow what you would like the behavior to be, could you please elaborate? Thanks!

Copy link
Member

@harshit-gangal harshit-gangal Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check the parent context sent to WaitForTablets to ensure it is still valid. If the context has already expired, the retry will continue to fail with the same error.
The key decision here is whether we should stop retrying and fail or attempt again with a new context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the clarification. However, I am unsure this is necessary. The upstream context passed by Init() to WaitForTablets() does not seem to be have a timeout; it's just a context.WithCancel() instead, it is WaitForTablets() who creates a new context with the timeout given by --gateway_initial_tablet_timeout. Given that, it seems to me that the retries would not fail with the same error - unless the timeout is hit again. Is that not the case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is my expectation as well.

default:
log.Fatalf("tabletGateway.WaitForTablets failed: %v", err)
}
}

dynamicConfig := NewDynamicViperConfig()
Expand Down
Loading