Stop waiting for the ready notification if the server has stopped #16754

ahrtr · 2023-10-13T10:05:39Z

Fix #9533

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

ahrtr · 2023-10-13T10:07:07Z

Need to add a test.

server/embed/serve.go

Hendrik-H · 2023-10-13T10:54:59Z

I believe a test case could be added to tests/integration/embed/embed_test.go that works like this:

create a cluster with 3 nodes
stop one node
remove the stopped node from the cluster
restart the stopped node

The start should fail and Close() must not hang. I'll see what I can do ...

Signed-off-by: Benjamin Wang <[email protected]>

ahrtr · 2023-10-13T11:06:48Z

restart the stopped node

We shouldn't restart stopped & removed node, but etcd shouldn't hang indefinitely when users call Close(). So it's OK to add such case.

Another case is to only start one node in a 3-node cluster, then the node should be blocked on the <-s.ReadyNotify(), in such case, when users call Close(), it shouldn't be blocked.

Hendrik-H · 2023-10-13T13:36:25Z

The following is a possible test case that shows that the code change fixes the issue:

diff --git a/tests/integration/embed/embed_test.go b/tests/integration/embed/embed_test.go
index 7cdb42934..6b5e1720d 100644
--- a/tests/integration/embed/embed_test.go
+++ b/tests/integration/embed/embed_test.go
@@ -187,6 +187,77 @@ func testEmbedEtcdGracefulStop(t *testing.T, secure bool) {
 	}
 }
 
+func TestEmbedEtcdClusterGracefulStopSecure(t *testing.T) {
+	testEmbedEtcdClusterGracefulStop(t, true)
+}
+
+func TestEmbedEtcdClusterGracefulStopInsecure(t *testing.T) {
+	testEmbedEtcdClusterGracefulStop(t, false)
+}
+
+func testEmbedEtcdClusterGracefulStop(t *testing.T, secure bool) {
+	testutil.SkipTestIfShortMode(t, "Cannot start embedded cluster in --short tests")
+
+	nodes := 3 // need to be an odd number larger than 1
+
+	urls := newEmbedURLs(secure, nodes*2)
+
+	// start a cluster
+	var err error
+	cfg := make([]*embed.Config, nodes)
+	e := make([]*embed.Etcd, nodes)
+	for i := 0; i < nodes; i++ {
+		cfg[i] = setupEmbedClusterCfg(t, secure, urls[0:nodes], urls[nodes:], i)
+		e[i], err = embed.StartEtcd(cfg[i])
+		if err != nil {
+			t.Fatal(err)
+		}
+	}
+
+	// wait for all nodes to be ready
+	for i := 0; i < nodes; i++ {
+		select {
+		case <-e[i].Server.ReadyNotify(): // wait for e.Server to join the cluster
+		case <-time.After(10 * time.Second):
+			t.Fatalf("node %d did not join the cluster in time", i)
+		}
+	}
+
+	// stop one node
+	e[nodes-1].Close()
+
+	// remove the stopped node from the cluster
+	_, err = e[0].Server.RemoveMember(context.Background(), uint64(e[nodes-1].Server.MemberId()))
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	// start the node again -> fails as it was removed from the cluster
+	cfg[nodes-1] = setupEmbedClusterCfg(t, secure, urls[0:nodes], urls[nodes:], nodes-1)
+	e[nodes-1], err = embed.StartEtcd(cfg[nodes-1])
+	if err != nil {
+		t.Fatal(err)
+	}
+	select {
+	case <-e[nodes-1].Server.ReadyNotify():
+		t.Fatal("the start should have failed")
+	case <-e[nodes-1].Server.StopNotify():
+	}
+
+	donec := make(chan struct{})
+	go func() {
+		defer close(donec)
+		for i := 0; i < nodes; i++ {
+			e[i].Close()
+		}
+	}()
+	select {
+	case <-donec:
+	case <-time.After(2*time.Second + e[0].Server.Cfg.ReqTimeout()):
+		t.Fatalf("took too long to close servers")
+	}
+}
+
 func newEmbedURLs(secure bool, n int) (urls []url.URL) {
 	scheme := "unix"
 	if secure {
@@ -213,6 +284,30 @@ func setupEmbedCfg(cfg *embed.Config, curls []url.URL, purls []url.URL) {
 	cfg.InitialCluster = cfg.InitialCluster[1:]
 }
 
+func setupEmbedClusterCfg(t *testing.T, secure bool, curls []url.URL, purls []url.URL, n int) *embed.Config {
+	cfg := embed.NewConfig()
+	if secure {
+		cfg.ClientTLSInfo = testTLSInfo
+		cfg.PeerTLSInfo = testTLSInfo
+	}
+
+	cfg.Logger = "zap"
+	cfg.LogOutputs = []string{"/dev/null"}
+
+	cfg.Name = fmt.Sprintf("node-%d", n)
+	cfg.ClusterState = "new"
+	cfg.ListenClientUrls, cfg.AdvertiseClientUrls = []url.URL{curls[n]}, []url.URL{curls[n]}
+	cfg.ListenPeerUrls, cfg.AdvertisePeerUrls = []url.URL{purls[n]}, []url.URL{purls[n]}
+	cfg.InitialCluster = ""
+	for i := range purls {
+		cfg.InitialCluster += fmt.Sprintf(",node-%d=%s", i, purls[i].String())
+	}
+	cfg.InitialCluster = cfg.InitialCluster[1:]
+	cfg.Dir = filepath.Join(t.TempDir(), fmt.Sprintf("embed-etcd-%d", n))
+
+	return cfg
+}
+
 func TestEmbedEtcdAutoCompactionRetentionRetained(t *testing.T) {
 	cfg := embed.NewConfig()
 	urls := newEmbedURLs(false, 2)

I'm not sure why but the "Secure" version of the test does not work while the "Insecure" works just fine. Pretty much the identical code fix and test case also works for 3.5.9. Then only difference is that Server.MemberId() needs to be changed to Server.ID().

ahrtr · 2023-10-13T13:48:18Z

@Hendrik-H Please,

add the second case (only start one node in a 3-nodes cluster) I mentioned above in a separate PR, it should be easier. The test should fail.
cherry pick the commit in this PR into your PR as the second commit. Then the test should pass.
Feel free to continue to work on your test case (restart already stopped & removed node) above in a followup PR.

Does it make sense to you?

Hendrik-H · 2023-10-13T14:55:40Z

The test you suggested (starting only one node of a cluster) is actually a different issue. The Close() call does not result in the the server stopping and thus the goroutine still keeps hanging.
I'll try to open a new PR in a slightly different way:

first commit my testcase
second commit the cherry-pick of your change
third commit the 1 node test case, which is still failing

Hendrik-H · 2023-10-13T15:00:14Z

new PR is #16758

stale · 2024-03-17T12:56:36Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

k8s-ci-robot · 2024-12-14T01:44:43Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ahrtr marked this pull request as draft October 13, 2023 10:05

ahrtr mentioned this pull request Oct 13, 2023

embed server never becomes ready / can't be stopped in certain start failure conditions #9533

Open

ahrtr force-pushed the ready_20231013 branch from d32e07c to 369eef7 Compare October 13, 2023 10:12

Hendrik-H reviewed Oct 13, 2023

View reviewed changes

server/embed/serve.go Show resolved Hide resolved

stop waiting for the ready notification if the server has stopped

e02da50

Signed-off-by: Benjamin Wang <[email protected]>

ahrtr force-pushed the ready_20231013 branch from 369eef7 to e02da50 Compare October 13, 2023 10:55

Hendrik-H mentioned this pull request Oct 29, 2023

Issue 9533 #16758

Open

stale bot added the stale label Mar 17, 2024

k8s-ci-robot added the needs-rebase label Dec 14, 2024

stale bot removed the stale label Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop waiting for the ready notification if the server has stopped #16754

Stop waiting for the ready notification if the server has stopped #16754

ahrtr commented Oct 13, 2023

ahrtr commented Oct 13, 2023

Hendrik-H commented Oct 13, 2023

ahrtr commented Oct 13, 2023 •

edited

Loading

Hendrik-H commented Oct 13, 2023

ahrtr commented Oct 13, 2023 •

edited

Loading

Hendrik-H commented Oct 13, 2023

Hendrik-H commented Oct 13, 2023

stale bot commented Mar 17, 2024

k8s-ci-robot commented Dec 14, 2024

Stop waiting for the ready notification if the server has stopped #16754

Are you sure you want to change the base?

Stop waiting for the ready notification if the server has stopped #16754

Conversation

ahrtr commented Oct 13, 2023

ahrtr commented Oct 13, 2023

Hendrik-H commented Oct 13, 2023

ahrtr commented Oct 13, 2023 • edited Loading

Hendrik-H commented Oct 13, 2023

ahrtr commented Oct 13, 2023 • edited Loading

Hendrik-H commented Oct 13, 2023

Hendrik-H commented Oct 13, 2023

stale bot commented Mar 17, 2024

k8s-ci-robot commented Dec 14, 2024

ahrtr commented Oct 13, 2023 •

edited

Loading

ahrtr commented Oct 13, 2023 •

edited

Loading