Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop waiting for the ready notification if the server has stopped #16754

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Oct 13, 2023

@ahrtr ahrtr marked this pull request as draft October 13, 2023 10:05
@ahrtr
Copy link
Member Author

ahrtr commented Oct 13, 2023

Need to add a test.

@Hendrik-H
Copy link

I believe a test case could be added to tests/integration/embed/embed_test.go that works like this:

  • create a cluster with 3 nodes
  • stop one node
  • remove the stopped node from the cluster
  • restart the stopped node

The start should fail and Close() must not hang. I'll see what I can do ...

@ahrtr
Copy link
Member Author

ahrtr commented Oct 13, 2023

restart the stopped node

We shouldn't restart stopped & removed node, but etcd shouldn't hang indefinitely when users call Close(). So it's OK to add such case.

Another case is to only start one node in a 3-node cluster, then the node should be blocked on the <-s.ReadyNotify(), in such case, when users call Close(), it shouldn't be blocked.

@Hendrik-H
Copy link

The following is a possible test case that shows that the code change fixes the issue:

diff --git a/tests/integration/embed/embed_test.go b/tests/integration/embed/embed_test.go
index 7cdb42934..6b5e1720d 100644
--- a/tests/integration/embed/embed_test.go
+++ b/tests/integration/embed/embed_test.go
@@ -187,6 +187,77 @@ func testEmbedEtcdGracefulStop(t *testing.T, secure bool) {
 	}
 }
 
+func TestEmbedEtcdClusterGracefulStopSecure(t *testing.T) {
+	testEmbedEtcdClusterGracefulStop(t, true)
+}
+
+func TestEmbedEtcdClusterGracefulStopInsecure(t *testing.T) {
+	testEmbedEtcdClusterGracefulStop(t, false)
+}
+
+func testEmbedEtcdClusterGracefulStop(t *testing.T, secure bool) {
+	testutil.SkipTestIfShortMode(t, "Cannot start embedded cluster in --short tests")
+
+	nodes := 3 // need to be an odd number larger than 1
+
+	urls := newEmbedURLs(secure, nodes*2)
+
+	// start a cluster
+	var err error
+	cfg := make([]*embed.Config, nodes)
+	e := make([]*embed.Etcd, nodes)
+	for i := 0; i < nodes; i++ {
+		cfg[i] = setupEmbedClusterCfg(t, secure, urls[0:nodes], urls[nodes:], i)
+		e[i], err = embed.StartEtcd(cfg[i])
+		if err != nil {
+			t.Fatal(err)
+		}
+	}
+
+	// wait for all nodes to be ready
+	for i := 0; i < nodes; i++ {
+		select {
+		case <-e[i].Server.ReadyNotify(): // wait for e.Server to join the cluster
+		case <-time.After(10 * time.Second):
+			t.Fatalf("node %d did not join the cluster in time", i)
+		}
+	}
+
+	// stop one node
+	e[nodes-1].Close()
+
+	// remove the stopped node from the cluster
+	_, err = e[0].Server.RemoveMember(context.Background(), uint64(e[nodes-1].Server.MemberId()))
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	// start the node again -> fails as it was removed from the cluster
+	cfg[nodes-1] = setupEmbedClusterCfg(t, secure, urls[0:nodes], urls[nodes:], nodes-1)
+	e[nodes-1], err = embed.StartEtcd(cfg[nodes-1])
+	if err != nil {
+		t.Fatal(err)
+	}
+	select {
+	case <-e[nodes-1].Server.ReadyNotify():
+		t.Fatal("the start should have failed")
+	case <-e[nodes-1].Server.StopNotify():
+	}
+
+	donec := make(chan struct{})
+	go func() {
+		defer close(donec)
+		for i := 0; i < nodes; i++ {
+			e[i].Close()
+		}
+	}()
+	select {
+	case <-donec:
+	case <-time.After(2*time.Second + e[0].Server.Cfg.ReqTimeout()):
+		t.Fatalf("took too long to close servers")
+	}
+}
+
 func newEmbedURLs(secure bool, n int) (urls []url.URL) {
 	scheme := "unix"
 	if secure {
@@ -213,6 +284,30 @@ func setupEmbedCfg(cfg *embed.Config, curls []url.URL, purls []url.URL) {
 	cfg.InitialCluster = cfg.InitialCluster[1:]
 }
 
+func setupEmbedClusterCfg(t *testing.T, secure bool, curls []url.URL, purls []url.URL, n int) *embed.Config {
+	cfg := embed.NewConfig()
+	if secure {
+		cfg.ClientTLSInfo = testTLSInfo
+		cfg.PeerTLSInfo = testTLSInfo
+	}
+
+	cfg.Logger = "zap"
+	cfg.LogOutputs = []string{"/dev/null"}
+
+	cfg.Name = fmt.Sprintf("node-%d", n)
+	cfg.ClusterState = "new"
+	cfg.ListenClientUrls, cfg.AdvertiseClientUrls = []url.URL{curls[n]}, []url.URL{curls[n]}
+	cfg.ListenPeerUrls, cfg.AdvertisePeerUrls = []url.URL{purls[n]}, []url.URL{purls[n]}
+	cfg.InitialCluster = ""
+	for i := range purls {
+		cfg.InitialCluster += fmt.Sprintf(",node-%d=%s", i, purls[i].String())
+	}
+	cfg.InitialCluster = cfg.InitialCluster[1:]
+	cfg.Dir = filepath.Join(t.TempDir(), fmt.Sprintf("embed-etcd-%d", n))
+
+	return cfg
+}
+
 func TestEmbedEtcdAutoCompactionRetentionRetained(t *testing.T) {
 	cfg := embed.NewConfig()
 	urls := newEmbedURLs(false, 2)

I'm not sure why but the "Secure" version of the test does not work while the "Insecure" works just fine. Pretty much the identical code fix and test case also works for 3.5.9. Then only difference is that Server.MemberId() needs to be changed to Server.ID().

@ahrtr
Copy link
Member Author

ahrtr commented Oct 13, 2023

@Hendrik-H Please,

  • add the second case (only start one node in a 3-nodes cluster) I mentioned above in a separate PR, it should be easier. The test should fail.
  • cherry pick the commit in this PR into your PR as the second commit. Then the test should pass.
  • Feel free to continue to work on your test case (restart already stopped & removed node) above in a followup PR.

Does it make sense to you?

@Hendrik-H
Copy link

The test you suggested (starting only one node of a cluster) is actually a different issue. The Close() call does not result in the the server stopping and thus the goroutine still keeps hanging.
I'll try to open a new PR in a slightly different way:

  • first commit my testcase
  • second commit the cherry-pick of your change
  • third commit the 1 node test case, which is still failing

@Hendrik-H
Copy link

new PR is #16758

@Hendrik-H Hendrik-H mentioned this pull request Oct 29, 2023
Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@stale stale bot removed the stale label Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

embed server never becomes ready / can't be stopped in certain start failure conditions
3 participants