First pass at graceful shutdown handlers #3134

venkat1109 · 2020-03-25T00:27:32Z

What changed?
This patch implements mechanism to gracefully drain traffic during deployments to avoid availability drops. The mechanism varies somewhat based on the role.

The shutdown protocol on some host H looks like the following:

History

Remove H from the membership ring. This will make other members believe that H is unhealthy
Wait for other members to discover that H is unhealthy
Stop acquiring new shards (periodically or based on other membership changes) on H
Wait for shard ownership to transfer (and inflight requests to drain) while still accepting new requests on H
Reject all requests arriving at rpc handler on H to avoid taking on more work
Wait for grace period
Force stop the whole world and exit

Frontend

Fail rpc health check, this will cause client side load balancer to stop forwarding requests to this node
Wait for failure detection time, typically this will be around 5s or less
Stop taking new rpc requests by returning InternalServiceError
Wait for a second
Stop the world and exit

Matching

Remove H from the membership ring. This will make other members believe that H is unhealthy
Wait for others to discover that H is unhealthy and traffic to drain
Stop the world and exit

Why?
Grace shutdown mechanism is needed to reduce availability dips during deployments

How did you test it?
Tested locally. Also tested on a staging environment.

Potential risks
Graceful shutdown fails to work and there continues to be availability dips during deployment.
A bug in the code causes process to crash during shutdown.

First pass at fixing #1454

coveralls · 2020-03-25T02:25:48Z

Coverage decreased (-0.5%) to 67.032% when pulling 1603000 on venkat1109:v_graceful_shutdown into a5111a6 on uber:master.

service/frontend/accessControlledHandler.go

service/frontend/dcRedirectionHandler.go

service/history/handler.go

service/frontend/accessControlledHandler.go

service/frontend/workflowHandler.go

service/history/service.go

venkat1109 requested a review from a team March 25, 2020 00:27

venkat1109 self-assigned this Mar 25, 2020

venkat1109 force-pushed the v_graceful_shutdown branch from 098e109 to 23321e0 Compare March 25, 2020 00:28

emrahs reviewed Mar 25, 2020

View reviewed changes

service/frontend/accessControlledHandler.go Outdated Show resolved Hide resolved

service/frontend/dcRedirectionHandler.go Outdated Show resolved Hide resolved

service/history/handler.go Show resolved Hide resolved

yycptt reviewed Mar 25, 2020

View reviewed changes

service/frontend/accessControlledHandler.go Show resolved Hide resolved

yycptt approved these changes Mar 25, 2020

View reviewed changes

service/frontend/workflowHandler.go Outdated Show resolved Hide resolved

service/history/service.go Outdated Show resolved Hide resolved

venkat1109 force-pushed the v_graceful_shutdown branch from dcbe70d to 01e9f26 Compare March 25, 2020 19:09

vancexu approved these changes Mar 25, 2020

View reviewed changes

emrahs approved these changes Mar 25, 2020

View reviewed changes

venkat1109 force-pushed the v_graceful_shutdown branch from dec5a0f to 9d0c94b Compare March 25, 2020 23:18

venkat1109 added 7 commits March 25, 2020 17:00

First pass at graceful shutdown handlers

8b8c84f

fix dynamic config decl

7e8d287

fix unit test

9536b3d

remove start/stop func from frontend handlers

ce0da35

forward Health() call to underlying handler

833cd17

address cr comments

c5504bc

fix type check error

1603000

venkat1109 force-pushed the v_graceful_shutdown branch from 9d0c94b to 1603000 Compare March 26, 2020 00:00

venkat1109 merged commit 9ddcb08 into cadence-workflow:master Mar 26, 2020

venkat1109 deleted the v_graceful_shutdown branch March 26, 2020 00:43

venkat1109 added a commit that referenced this pull request Mar 27, 2020

First pass at graceful shutdown handlers (#3134)

7856808

venkat1109 added a commit that referenced this pull request Mar 27, 2020

First pass at graceful shutdown handlers (#3134)

7232709

yux0 pushed a commit that referenced this pull request Apr 14, 2020

First pass at graceful shutdown handlers (#3134)

eaa978f

Groxx mentioned this pull request Mar 29, 2024

Deadlock demo in acquireShards #5824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at graceful shutdown handlers #3134

First pass at graceful shutdown handlers #3134

venkat1109 commented Mar 25, 2020

coveralls commented Mar 25, 2020 •

edited

Loading

First pass at graceful shutdown handlers #3134

First pass at graceful shutdown handlers #3134

Conversation

venkat1109 commented Mar 25, 2020

coveralls commented Mar 25, 2020 • edited Loading

coveralls commented Mar 25, 2020 •

edited

Loading