Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at graceful shutdown handlers #3134

Merged
merged 7 commits into from
Mar 26, 2020

Conversation

venkat1109
Copy link
Contributor

What changed?
This patch implements mechanism to gracefully drain traffic during deployments to avoid availability drops. The mechanism varies somewhat based on the role.

The shutdown protocol on some host H looks like the following:

History

  1. Remove H from the membership ring. This will make other members believe that H is unhealthy
  2. Wait for other members to discover that H is unhealthy
  3. Stop acquiring new shards (periodically or based on other membership changes) on H
  4. Wait for shard ownership to transfer (and inflight requests to drain) while still accepting new requests on H
  5. Reject all requests arriving at rpc handler on H to avoid taking on more work
  6. Wait for grace period
  7. Force stop the whole world and exit

Frontend

  1. Fail rpc health check, this will cause client side load balancer to stop forwarding requests to this node
  2. Wait for failure detection time, typically this will be around 5s or less
  3. Stop taking new rpc requests by returning InternalServiceError
  4. Wait for a second
  5. Stop the world and exit

Matching

  1. Remove H from the membership ring. This will make other members believe that H is unhealthy
  2. Wait for others to discover that H is unhealthy and traffic to drain
  3. Stop the world and exit

Why?
Grace shutdown mechanism is needed to reduce availability dips during deployments

How did you test it?
Tested locally. Also tested on a staging environment.

Potential risks
Graceful shutdown fails to work and there continues to be availability dips during deployment.
A bug in the code causes process to crash during shutdown.

First pass at fixing #1454

@venkat1109 venkat1109 requested a review from a team March 25, 2020 00:27
@venkat1109 venkat1109 self-assigned this Mar 25, 2020
@venkat1109 venkat1109 force-pushed the v_graceful_shutdown branch from 098e109 to 23321e0 Compare March 25, 2020 00:28
@coveralls
Copy link

coveralls commented Mar 25, 2020

Coverage Status

Coverage decreased (-0.5%) to 67.032% when pulling 1603000 on venkat1109:v_graceful_shutdown into a5111a6 on uber:master.

service/frontend/accessControlledHandler.go Outdated Show resolved Hide resolved
service/frontend/dcRedirectionHandler.go Outdated Show resolved Hide resolved
service/history/handler.go Show resolved Hide resolved
service/frontend/workflowHandler.go Outdated Show resolved Hide resolved
service/history/service.go Outdated Show resolved Hide resolved
@venkat1109 venkat1109 force-pushed the v_graceful_shutdown branch from dcbe70d to 01e9f26 Compare March 25, 2020 19:09
@venkat1109 venkat1109 force-pushed the v_graceful_shutdown branch from dec5a0f to 9d0c94b Compare March 25, 2020 23:18
@venkat1109 venkat1109 force-pushed the v_graceful_shutdown branch from 9d0c94b to 1603000 Compare March 26, 2020 00:00
@venkat1109 venkat1109 merged commit 9ddcb08 into cadence-workflow:master Mar 26, 2020
@venkat1109 venkat1109 deleted the v_graceful_shutdown branch March 26, 2020 00:43
yux0 pushed a commit that referenced this pull request Apr 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants