Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support delay before history joins membership #4582

Merged
merged 1 commit into from
Jul 6, 2023

Conversation

alfred-landrum
Copy link
Contributor

(On top of #4510 )

What changed?
When a history instance starts, support a configurable (defaulting to zero) delay before joining membership.

Why?
In environments where the history service is running via a Kubernetes Deployment, rolling restarts or image upgrades cause considerable shard movement, because the Deployment will simultaneously terminate one pod & create a new one. By configuring a non-zero delay on the order of seconds, the shard movement due to the terminating pod can be separated from the shard movement of the newly created pod. Overall, this reduces the impact to user api calls during the change.

How did you test it?
This has been tested in a staging environment.

Potential risks
With the default setting of zero, no risk.

Is hotfix candidate?

@dnr
Copy link
Member

dnr commented Jul 6, 2023

My intuition would be the opposite: that doing the stop and start at the same time (as close as possible) would be better since there would be only one settling period instead of two, and there would be fewer cases of shards moving and then moving again seconds later. Is the problem that it's too hard to get the events to happen close enough for that to be true, and a long delay is better than a medium delay?

Base automatically changed from alfred/ringpop-at-service-start to master July 6, 2023 13:51
@alfred-landrum
Copy link
Contributor Author

... Is the problem that it's too hard to get the events to happen close enough for that to be true, and a long delay is better than a medium delay?

It appears so. In my test environment, there are requests with higher latencies (at the 3+ 9's of measurement) when termination & creation happen close to each other. From looking at the logs, some requests bounce between the "temporary" shard owners when they land on a history instance that's only seen the impact of either the creation or terminating pod.

@alfred-landrum alfred-landrum force-pushed the alfred/startup-membership-join-delay branch from 2b0c4e1 to 5dc16b7 Compare July 6, 2023 14:00
Copy link
Member

@dnr dnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this (and related work) is worth doing for matching too. The impact is probably less there but avoiding flipping back and forth should also help somewhat

@alfred-landrum
Copy link
Contributor Author

I wonder if this (and related work) is worth doing for matching too. The impact is probably less there but avoiding flipping back and forth should also help somewhat

It might be - delays related to shard ownership movement can easily show up in the high 9's of end-user request latency. Would the impact be similar for matching?

@alfred-landrum alfred-landrum merged commit a84c2e0 into master Jul 6, 2023
@alfred-landrum alfred-landrum deleted the alfred/startup-membership-join-delay branch July 6, 2023 21:02
@dnr
Copy link
Member

dnr commented Jul 6, 2023

I wonder if this (and related work) is worth doing for matching too. The impact is probably less there but avoiding flipping back and forth should also help somewhat

It might be - delays related to shard ownership movement can easily show up in the high 9's of end-user request latency. Would the impact be similar for matching?

It's not quite the same since in general, matching is not in the path of user requests (except query, and possibly now update). Also matching should be faster to start up. But of course reducing latency for dispatching tasks is always good. It would be an interesting conversation to have once you're finished with this series of work

alfred-landrum added a commit that referenced this pull request Jul 28, 2023
<!-- Describe what has changed in this PR -->
**What changed?**
When a history instance starts, support a configurable (defaulting to
zero) delay before joining membership.

<!-- Tell your future self why have you made these changes -->
**Why?**
In environments where the history service is running via a Kubernetes
Deployment, rolling restarts or image upgrades cause considerable shard
movement, because the Deployment will simultaneously terminate one pod &
create a new one. By configuring a non-zero delay on the order of
seconds, the shard movement due to the terminating pod can be separated
from the shard movement of the newly created pod. Overall, this reduces
the impact to user api calls during the change.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
This has been tested in a staging environment.

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
With the default setting of zero, no risk.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
alfred-landrum added a commit that referenced this pull request Aug 1, 2023
<!-- Describe what has changed in this PR -->
**What changed?**
When a history instance starts, support a configurable (defaulting to
zero) delay before joining membership.

<!-- Tell your future self why have you made these changes -->
**Why?**
In environments where the history service is running via a Kubernetes
Deployment, rolling restarts or image upgrades cause considerable shard
movement, because the Deployment will simultaneously terminate one pod &
create a new one. By configuring a non-zero delay on the order of
seconds, the shard movement due to the terminating pod can be separated
from the shard movement of the newly created pod. Overall, this reduces
the impact to user api calls during the change.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
This has been tested in a staging environment.

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
With the default setting of zero, no risk.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
rodrigozhou pushed a commit that referenced this pull request Aug 7, 2023
<!-- Describe what has changed in this PR -->
**What changed?**
When a history instance starts, support a configurable (defaulting to
zero) delay before joining membership.

<!-- Tell your future self why have you made these changes -->
**Why?**
In environments where the history service is running via a Kubernetes
Deployment, rolling restarts or image upgrades cause considerable shard
movement, because the Deployment will simultaneously terminate one pod &
create a new one. By configuring a non-zero delay on the order of
seconds, the shard movement due to the terminating pod can be separated
from the shard movement of the newly created pod. Overall, this reduces
the impact to user api calls during the change.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
This has been tested in a staging environment.

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
With the default setting of zero, no risk.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants