mcs: fix watch primary address revision and update cache when meets not leader #6279

lhy1024 · 2023-04-06T15:16:27Z

What problem does this PR solve?

Issue Number: Ref #5895.

What is changed and how does it work?

Check List

Tests

Unit test

Release note

None.

ti-chi-bot · 2023-04-06T15:16:28Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

binshi-bing
rleungx

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

ti-chi-bot · 2023-04-06T15:16:28Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

binshi-bing · 2023-04-06T17:50:10Z

server/server.go

+				revision = wresp.CompactRevision
+				break
+			}
+			if wresp.Canceled {


It seems that, you have forgotten your comment in my pr and didn't check my reply :)

At the same place, you asked "Do we need to solve other errors?", I replied:

I checked the code in etcd watch.go. There is no other errors to solve, but I do think here using wrep.Err() != nil is better than using wresp.Canceled, because the former can accommodate future etcd changes.

Etcd Watch() returns three types of errors -- closed error, compacted error and cancelled error. closed error is captured by range watchChan, compacted error is captured by wresp.CompactRevision != 0, and cancelled error is captured here. Etcd Watch() itself underlying will retry Watch() for all other errors. Using wrep.Err() != nil is better than using wresp.Canceled.

Changed to using wrep.Err() != nil here.

sorry, I forgot about it when the copilot finished

binshi-bing · 2023-04-06T17:59:13Z

server/server.go

+				revision = wresp.CompactRevision
+				break
+			}
+			if wresp.Canceled {


This change made some improvement, but mightn't be enough. When wresp.Canceled(), it doesn't mean context has been cancelled, so if the primary address loop returns here but the API service is still serving, we'll have improper function. Do we need to try to recreate Watcher and re-watch in a endless loop until context is cancelled, as what I do in KeyspaceGroupManager?

select { case reqc <- wr: ok = true case <-wr.ctx.Done(): case <-donec: if wgs.closeErr != nil { closeCh <- WatchResponse{Canceled: true, closeErr: wgs.closeErr} break } // retry; may have dropped stream from no ctxs return w.Watch(ctx, key, opts...) }

binshi-bing · 2023-04-06T17:59:37Z

server/server.go

@@ -1726,7 +1726,7 @@ func (s *Server) watchServicePrimaryAddrLoop(serviceName string) {
 	log.Info("start to watch", zap.String("service-key", serviceKey))

 	primary := &tsopb.Participant{}
-	ok, rev, err := etcdutil.GetProtoMsgWithModRev(s.client, serviceKey, primary)
+	ok, revision, err := etcdutil.GetProtoMsgWithModRev(s.client, serviceKey, primary)


do we only need to watch keyspace group 0?

yes, serviceKey := fmt.Sprintf("/ms/%d/%s/%s/%s", s.clusterID, serviceName, fmt.Sprintf("%05d", 0), "primary")

binshi-bing

LGTM for the change having improved things, though we might still need more improvements.

rleungx · 2023-04-07T02:32:37Z

server/server.go

@@ -1737,15 +1737,26 @@ func (s *Server) watchServicePrimaryAddrLoop(serviceName string) {
 	} else {
 		log.Warn("service primary addr doesn't exist", zap.String("service-key", serviceKey))
 	}
+	watcher := clientv3.NewWatcher(s.client)


There are many codes with the same logic, the only difference is the key. How about abstracting a function for them?

ok, I will try it later.

Signed-off-by: lhy1024 <[email protected]>

codecov · 2023-04-07T07:37:25Z

Codecov Report

Patch coverage: 56.00% and project coverage change: -0.12 ⚠️

Comparison is base (8c9b4fb) 75.16% compared to head (c5c6b5d) 75.04%.

❗ Current head c5c6b5d differs from pull request most recent head 596d6e4. Consider uploading reports for the commit 596d6e4 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6279      +/-   ##
==========================================
- Coverage   75.16%   75.04%   -0.12%     
==========================================
  Files         404      404              
  Lines       39860    39913      +53     
==========================================
- Hits        29961    29954       -7     
- Misses       7282     7332      +50     
- Partials     2617     2627      +10

Flag	Coverage Δ
unittests	`75.04% <56.00%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/tso/keyspace_group_manager.go	`85.29% <0.00%> (ø)`
pkg/utils/tsoutil/tso_dispatcher.go	`58.03% <28.57%> (-2.72%)`	⬇️
server/server.go	`74.91% <60.93%> (-0.96%)`	⬇️
server/grpc_service.go	`49.51% <100.00%> (-0.09%)`	⬇️

... and 18 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: lhy1024 <[email protected]>

binshi-bing

there is one more comment which might need you to check and confirm.

server/server.go

binshi-bing · 2023-04-07T19:30:55Z

server/server.go

+		case <-ctx.Done():
+			return revision, nil
+		case <-s.updateServicePrimaryAddrCh:
+			revision, err = s.updateServicePrimaryAddr(serviceName)


Most likely even after we update the service primary address, we still have one more problem -- s.clientConns stores the forwarded hosts' grpc.ClientConn. We never update the broken connections. if a forwarded host's connection broke，e.g., the forwarded host restarted and broken the existing connection, we'll retrieve the broken connection for this forwarded host continuously.

it will try to create a new connection automatically when the existed connection is closed.

binshi-bing

LGTM

pkg/utils/tsoutil/tso_dispatcher.go

JmPotato · 2023-04-10T02:56:13Z

pkg/utils/tsoutil/tso_dispatcher.go

+				if len(updateServicePrimaryAddrChs) > 0 {
+					if strings.Contains(err.Error(), errs.NotLeaderErr) || strings.Contains(err.Error(), errs.MismatchLeaderErr) {
+						updateServicePrimaryAddrChs[0] <- struct{}{}
+					}
+				}


I suggest putting this part into a function outside the loop, which may help the compiler to do some inline optimization.

JmPotato · 2023-04-10T02:58:28Z

server/grpc_service.go

@@ -224,7 +224,7 @@ func (s *GrpcServer) Tso(stream pdpb.PD_TsoServer) error {
 			}

 			tsoRequest := tsoutil.NewPDProtoRequest(forwardedHost, clientConn, request, stream)
-			s.tsoDispatcher.DispatchRequest(ctx, tsoRequest, tsoProtoFactory, doneCh, errCh)
+			s.tsoDispatcher.DispatchRequest(ctx, tsoRequest, tsoProtoFactory, doneCh, errCh, s.updateServicePrimaryAddrCh)


It seems that the channel is always passed to the function, then why use an optional parameter?

DispatchRequest is also used by tso server, tso server is no needed to watch api key

JmPotato · 2023-04-10T02:59:47Z

server/server.go

+				zap.Time("retry-at", time.Now().Add(watchKEtcdChangeRetryInterval)),
+				zap.Error(err))
+			revision = nextRevision
+			time.Sleep(watchKEtcdChangeRetryInterval)


Suggest using a ticker to select rather than sleeping.

In fact, we need to wait for a retry here, instead of periodically going to use tick

server/server.go

rleungx · 2023-04-10T03:07:16Z

server/server.go

+	primary := &tsopb.Participant{}
+	ok, revision, err := etcdutil.GetProtoMsgWithModRev(s.client, serviceKey, primary)
+	listenUrls := primary.GetListenUrls()
+	if !ok || err != nil || len(listenUrls) == 0 {


There are two cases we may return 0, nil and it breaks the retry loop, is it expected?

rleungx · 2023-04-10T03:08:11Z

server/server.go

+
+	for {
+	WatchChan:
+		watchChan := watcher.Watch(s.serverLoopCtx, serviceKey, clientv3.WithPrefix(), clientv3.WithRev(revision))


Do we still need the prefix?

server/server.go

Signed-off-by: lhy1024 <[email protected]>

JmPotato · 2023-04-17T06:03:26Z

server/server.go

+		select {
+		case <-ctx.Done():
+			return
+		case <-time.After(retryIntervalGetServicePrimary):
+		}


Should we call s.updateServicePrimaryAddr(serviceName) at the beginning of the loop? Otherwise, we have to wait for a retryIntervalGetServicePrimary before the first time update.

rleungx

Here is a comment still left: https://github.com/tikv/pd/pull/6279/files#r1161400175

rleungx · 2023-04-17T06:31:56Z

server/server.go

+
+// SetServicePrimaryAddr sets the primary address directly.
+// Note: This function is only used for test.
+func (s *Server) SetServicePrimaryAddr(serviceName, addr string) {


Where do we use it?

Signed-off-by: lhy1024 <[email protected]>

lhy1024 · 2023-04-17T12:23:19Z

/merge

ti-chi-bot · 2023-04-17T12:23:21Z

@lhy1024: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2023-04-17T12:23:22Z

This pull request has been accepted and is ready to merge.

Commit hash: c5c6b5d

ti-chi-bot · 2023-04-17T12:23:36Z

@lhy1024: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

…ot leader (tikv#6279) (tikv#64) * mcs: fix watch primary address revision Signed-off-by: lhy1024 <[email protected]> * add update cache when meets not leader Signed-off-by: lhy1024 <[email protected]> --------- Signed-off-by: lhy1024 <[email protected]>

ref #5895, ref #6279, close #6289 Signed-off-by: lhy1024 <[email protected]>

ref tikv#5895, ref tikv#6279, close tikv#6289 Signed-off-by: lhy1024 <[email protected]>

ref tikv#5895, ref tikv#6279, close tikv#6289 Signed-off-by: lhy1024 <[email protected]> Signed-off-by: zeminzhou <[email protected]>

* mcs: update client when meet transport is closing (tikv#6341) * mcs: update client when meet transport is closing Signed-off-by: lhy1024 <[email protected]> * address comments Signed-off-by: lhy1024 <[email protected]> * add retry Signed-off-by: lhy1024 <[email protected]> --------- Signed-off-by: lhy1024 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Signed-off-by: lhy1024 <[email protected]> * mcs: fix watch primary address revision and update cache when meets not leader (tikv#6279) ref tikv#5895 Signed-off-by: lhy1024 <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]> Signed-off-by: lhy1024 <[email protected]> --------- Signed-off-by: lhy1024 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Co-authored-by: Ti Chi Robot <[email protected]>

ref tikv#5895, ref tikv#6279, close tikv#6289 Signed-off-by: lhy1024 <[email protected]>

Signed-off-by: lhy1024 <[email protected]>

ti-chi-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. labels Apr 6, 2023

lhy1024 marked this pull request as ready for review April 6, 2023 15:16

ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2023

ti-chi-bot requested review from JmPotato and rleungx April 6, 2023 15:16

binshi-bing reviewed Apr 6, 2023

View reviewed changes

binshi-bing approved these changes Apr 6, 2023

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 6, 2023

rleungx reviewed Apr 7, 2023

View reviewed changes

lhy1024 force-pushed the fix-watch branch from f8fa907 to 742fb12 Compare April 7, 2023 04:43

mcs: fix watch primary address revision

2c43e7a

Signed-off-by: lhy1024 <[email protected]>

lhy1024 force-pushed the fix-watch branch from e9fb0f4 to 2c43e7a Compare April 7, 2023 07:24

lhy1024 force-pushed the fix-watch branch from aa353ad to 4cd6edd Compare April 7, 2023 15:26

lhy1024 mentioned this pull request Apr 7, 2023

mcs: fix forward test with pd mode client #6290

Merged

lhy1024 requested review from binshi-bing and rleungx April 7, 2023 15:29

add update cache when meets not leader

e34a074

Signed-off-by: lhy1024 <[email protected]>

lhy1024 force-pushed the fix-watch branch from 4cd6edd to e34a074 Compare April 7, 2023 16:38

lhy1024 changed the title ~~mcs: fix watch primary address revision~~ mcs: fix watch primary address revision and update cache when meets not leader Apr 7, 2023

binshi-bing requested changes Apr 7, 2023

View reviewed changes

ti-chi-bot removed the status/LGT1 Indicates that a PR has LGTM 1. label Apr 7, 2023

binshi-bing approved these changes Apr 8, 2023

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 8, 2023

JmPotato reviewed Apr 10, 2023

View reviewed changes

rleungx reviewed Apr 10, 2023

View reviewed changes

server/server.go Outdated Show resolved Hide resolved

rleungx reviewed Apr 10, 2023

View reviewed changes

lhy1024 added 2 commits April 10, 2023 12:41

address comments

1b09ef4

Signed-off-by: lhy1024 <[email protected]>

fix possible block

642872e

Signed-off-by: lhy1024 <[email protected]>

lhy1024 mentioned this pull request Apr 11, 2023

mcs: add balancer for keyspace group #6274

Merged

address comments

719b6f9

Signed-off-by: lhy1024 <[email protected]>

JmPotato reviewed Apr 17, 2023

View reviewed changes

rleungx reviewed Apr 17, 2023

View reviewed changes

ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2023

lhy1024 added 2 commits April 17, 2023 15:16

address comments

4a6cb7f

Signed-off-by: lhy1024 <[email protected]>

Merge branch 'master' of http://github.com/tikv/pd into fix-watch

c5c6b5d

Signed-off-by: lhy1024 <[email protected]>

ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 17, 2023

rleungx approved these changes Apr 17, 2023

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Apr 17, 2023

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 17, 2023

Merge branch 'master' into fix-watch

596d6e4

ti-chi-bot merged commit b9a03d2 into tikv:master Apr 17, 2023

ti-chi-bot bot pushed a commit that referenced this pull request May 6, 2023

mcs: fix forward test with pd mode client (#6290)

07399d3

ref #5895, ref #6279, close #6289 Signed-off-by: lhy1024 <[email protected]>

zeminzhou pushed a commit to zeminzhou/pd that referenced this pull request May 8, 2023

mcs: fix forward test with pd mode client (tikv#6290)

49c334f

ref tikv#5895, ref tikv#6279, close tikv#6289 Signed-off-by: lhy1024 <[email protected]>

zeminzhou pushed a commit to zeminzhou/pd that referenced this pull request May 10, 2023

mcs: fix forward test with pd mode client (tikv#6290)

95d1ea2

ref tikv#5895, ref tikv#6279, close tikv#6289 Signed-off-by: lhy1024 <[email protected]> Signed-off-by: zeminzhou <[email protected]>

rleungx pushed a commit to rleungx/pd that referenced this pull request Aug 2, 2023

mcs: fix forward test with pd mode client (tikv#6290)

37efd9b

ref tikv#5895, ref tikv#6279, close tikv#6289 Signed-off-by: lhy1024 <[email protected]>

lhy1024 added a commit to ti-chi-bot/pd that referenced this pull request Feb 27, 2025

fix and pick tikv#6341 tikv#6279 tikv#7327

8ec1daf

Signed-off-by: lhy1024 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcs: fix watch primary address revision and update cache when meets not leader #6279

mcs: fix watch primary address revision and update cache when meets not leader #6279

lhy1024 commented Apr 6, 2023

ti-chi-bot commented Apr 6, 2023 •

edited

Loading

ti-chi-bot commented Apr 6, 2023

binshi-bing Apr 6, 2023 •

edited

Loading

lhy1024 Apr 7, 2023

binshi-bing Apr 6, 2023

lhy1024 Apr 7, 2023

binshi-bing Apr 6, 2023

lhy1024 Apr 7, 2023

binshi-bing left a comment

rleungx Apr 7, 2023

lhy1024 Apr 7, 2023

codecov bot commented Apr 7, 2023 •

edited

Loading

binshi-bing left a comment

binshi-bing Apr 7, 2023

lhy1024 Apr 10, 2023

binshi-bing left a comment

JmPotato Apr 10, 2023

JmPotato Apr 10, 2023

lhy1024 Apr 10, 2023

JmPotato Apr 10, 2023

lhy1024 Apr 10, 2023

rleungx Apr 10, 2023

lhy1024 Apr 10, 2023

rleungx Apr 10, 2023

JmPotato Apr 17, 2023

rleungx left a comment

rleungx Apr 17, 2023

lhy1024 commented Apr 17, 2023

ti-chi-bot commented Apr 17, 2023

ti-chi-bot commented Apr 17, 2023

ti-chi-bot commented Apr 17, 2023

mcs: fix watch primary address revision and update cache when meets not leader #6279

mcs: fix watch primary address revision and update cache when meets not leader #6279

Conversation

lhy1024 commented Apr 6, 2023

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

ti-chi-bot commented Apr 6, 2023 • edited Loading

ti-chi-bot commented Apr 6, 2023

binshi-bing Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binshi-bing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Apr 7, 2023 • edited Loading

Codecov Report

binshi-bing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binshi-bing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rleungx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhy1024 commented Apr 17, 2023

ti-chi-bot commented Apr 17, 2023

ti-chi-bot commented Apr 17, 2023

ti-chi-bot commented Apr 17, 2023

ti-chi-bot commented Apr 6, 2023 •

edited

Loading

binshi-bing Apr 6, 2023 •

edited

Loading

codecov bot commented Apr 7, 2023 •

edited

Loading