support delay before history joins membership #4582

alfred-landrum · 2023-07-06T00:27:55Z

(On top of #4510 )

What changed?
When a history instance starts, support a configurable (defaulting to zero) delay before joining membership.

Why?
In environments where the history service is running via a Kubernetes Deployment, rolling restarts or image upgrades cause considerable shard movement, because the Deployment will simultaneously terminate one pod & create a new one. By configuring a non-zero delay on the order of seconds, the shard movement due to the terminating pod can be separated from the shard movement of the newly created pod. Overall, this reduces the impact to user api calls during the change.

How did you test it?
This has been tested in a staging environment.

Potential risks
With the default setting of zero, no risk.

Is hotfix candidate?

dnr · 2023-07-06T05:17:02Z

My intuition would be the opposite: that doing the stop and start at the same time (as close as possible) would be better since there would be only one settling period instead of two, and there would be fewer cases of shards moving and then moving again seconds later. Is the problem that it's too hard to get the events to happen close enough for that to be true, and a long delay is better than a medium delay?

alfred-landrum · 2023-07-06T13:59:26Z

... Is the problem that it's too hard to get the events to happen close enough for that to be true, and a long delay is better than a medium delay?

It appears so. In my test environment, there are requests with higher latencies (at the 3+ 9's of measurement) when termination & creation happen close to each other. From looking at the logs, some requests bounce between the "temporary" shard owners when they land on a history instance that's only seen the impact of either the creation or terminating pod.

dnr

I wonder if this (and related work) is worth doing for matching too. The impact is probably less there but avoiding flipping back and forth should also help somewhat

alfred-landrum · 2023-07-06T21:02:15Z

I wonder if this (and related work) is worth doing for matching too. The impact is probably less there but avoiding flipping back and forth should also help somewhat

It might be - delays related to shard ownership movement can easily show up in the high 9's of end-user request latency. Would the impact be similar for matching?

dnr · 2023-07-06T21:21:22Z

I wonder if this (and related work) is worth doing for matching too. The impact is probably less there but avoiding flipping back and forth should also help somewhat

It might be - delays related to shard ownership movement can easily show up in the high 9's of end-user request latency. Would the impact be similar for matching?

It's not quite the same since in general, matching is not in the path of user requests (except query, and possibly now update). Also matching should be faster to start up. But of course reducing latency for dispatching tasks is always good. It would be an interesting conversation to have once you're finished with this series of work

**What changed?** When a history instance starts, support a configurable (defaulting to zero) delay before joining membership.  **Why?** In environments where the history service is running via a Kubernetes Deployment, rolling restarts or image upgrades cause considerable shard movement, because the Deployment will simultaneously terminate one pod & create a new one. By configuring a non-zero delay on the order of seconds, the shard movement due to the terminating pod can be separated from the shard movement of the newly created pod. Overall, this reduces the impact to user api calls during the change.  **How did you test it?** This has been tested in a staging environment.  **Potential risks** With the default setting of zero, no risk.  **Is hotfix candidate?**

alfred-landrum requested a review from a team as a code owner July 6, 2023 00:27

alfred-landrum assigned MichaelSnowden Jul 6, 2023

Base automatically changed from alfred/ringpop-at-service-start to master July 6, 2023 13:51

support delay before history joins membership

5dc16b7

alfred-landrum force-pushed the alfred/startup-membership-join-delay branch from 2b0c4e1 to 5dc16b7 Compare July 6, 2023 14:00

dnr approved these changes Jul 6, 2023

View reviewed changes

MichaelSnowden approved these changes Jul 6, 2023

View reviewed changes

alfred-landrum merged commit a84c2e0 into master Jul 6, 2023

alfred-landrum deleted the alfred/startup-membership-join-delay branch July 6, 2023 21:02

alfred-landrum added the release/1.21.5 label Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support delay before history joins membership #4582

support delay before history joins membership #4582

alfred-landrum commented Jul 6, 2023

dnr commented Jul 6, 2023

alfred-landrum commented Jul 6, 2023

dnr left a comment

alfred-landrum commented Jul 6, 2023

dnr commented Jul 6, 2023

support delay before history joins membership #4582

support delay before history joins membership #4582

Conversation

alfred-landrum commented Jul 6, 2023

dnr commented Jul 6, 2023

alfred-landrum commented Jul 6, 2023

dnr left a comment

Choose a reason for hiding this comment

alfred-landrum commented Jul 6, 2023

dnr commented Jul 6, 2023