-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clientv3: fix balancer and upgrade gRPC v1.7.x #8828
Comments
it seems like a bug at gRPC side or our analysis has an issue. basically after step 15 in 1.7x, step 16, 17, 18 is a side effect of the previous notification of so what will happen after 18? |
It starts over by notifying
With gRPC v1.6.0, balancer unpins on With gRPC v1.7.2, balancer waits until state change is propagated to the balancer, and calls
I will investigate more. |
eventually a should be unpined due to drain, then b should be pined. we shoud figure out why this is not happening instead of "fixing" the race. |
Correct,
Correct, Problem is, afterwards, |
will it be pined again? |
Yes, but not always. Balancer could pin |
but then b should be pined again, and things should get stabilized, right? |
It was pinning Since this is after |
i am more curious if b will be eventually pined regardless of the timeout. if it is the case, then all we need to fight against is the timing issue. |
I believe we cannot guarantee that, because both A and B are unhealthy for the same reason--errConnClosing, from balancer's viewpoint. We notify both A and B in
|
@gyuho OK. Let us stop notifying gRPC the endpoints that are in the unhealthy list. It should solve this issue, and it is inline with what we are going to do in the future too. sounds reasonable? |
Here's the problem when notifying only healthy addresses (even with gRPC v1.6.x):
Problem with step 9 is B was never pinned, so balancer thinks it's heathy. Or even B was marked unhealthy, it might have been removed from unhealthy lists after a few seconds. Problem with step 16 is it's still notifying B, rather than A, C (since A and C are still marked as unhealthy). |
@gyuho this is actually ok. retry l-get again or enable keepalive should fix it. |
I believe even retry wouldn't help. In the case above, balancer is stuck in step 18 (when notifying B). Even after A and C get removed from unhealthy, it is still stuck waiting for a new endpoint up. And not sure how keepalive would help, when there's no endpoint to ping. |
@gyuho well, maybe you accidentally remove the notify loop at https://github.com/coreos/etcd/blob/master/clientv3/health_balancer.go#L153? this is very important, so that when A,C is back, A, C will be sent to gRPC. |
@xiang90 Good catch! Indeed, my patch was handling empty pinned address wrong, in that code path. Now test passes. Will run more tests and push a PR. |
clientv3: change balancer for gRPC v1.7.x
v1.6.0
A
andB
A
updateNotifyLoop
case upc == nil
b.notifyCh <- A
A
becomes blackholedA
b.notifyCh <- []
in retry.go to drain connectionA
grpc.tearDown(errConnDrain)
onA
down(errConnDrain)
onA
A
B
updateNotifyLoop
case upc == nil
b.notifyCh <- B
v1.7.x
No more
grpc.tearDown(errConnDrain)
, onlysubConnection.down(errConnClosing)
.So timing has changed. Custom balancer needs to wait until
errConnClosing
to unpin an endpoint.A
andB
A
updateNotifyLoop
case upc == nil
b.notifyCh <- A
A
becomes blackholedA
b.notifyCh <- []
in retry.go to drain connectionA
(ac *addrConn).tearDown(errConnDrain)
in clientconn.gohandleSubConnStateChange(A,SHUTDOWN)
A
connection state changes fromREADY
toSHUTDOWN
updateNotifyLoop
case upc == nil
b.notifyCh <- A
<=== need fix!down(errConnClosing)
onA
A
B
lbWatcher
RemoveSubConn(B)
from step 10 <=== need fix!updateNotifyLoop
case upc == nil
b.notifyCh <- B
handleSubConnStateChange(B,SHUTDOWN)
down(errConnClosing)
onB
B
So with gRPC v1.7.x, we are sending
A
whileA
gets blackholed after context timeout but beforeA
gets unpinned. WileA
is pinned, ifA
is notified again,lbWatcher
removesB
sub-connection, whenB
is the next pinned endpoint.Balancer needs fix by either
[]
to notify channel)This gRPC change has been failing
TestBalancerUnderBlackholeNoKeepAlive*
.The text was updated successfully, but these errors were encountered: