Skip to content

Commit

Permalink
support delay before history joins membership (#4582)
Browse files Browse the repository at this point in the history
<!-- Describe what has changed in this PR -->
**What changed?**
When a history instance starts, support a configurable (defaulting to
zero) delay before joining membership.

<!-- Tell your future self why have you made these changes -->
**Why?**
In environments where the history service is running via a Kubernetes
Deployment, rolling restarts or image upgrades cause considerable shard
movement, because the Deployment will simultaneously terminate one pod &
create a new one. By configuring a non-zero delay on the order of
seconds, the shard movement due to the terminating pod can be separated
from the shard movement of the newly created pod. Overall, this reduces
the impact to user api calls during the change.

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
This has been tested in a staging environment.

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
With the default setting of zero, no risk.

<!-- Is this PR a hotfix candidate or require that a notification be
sent to the broader community? (Yes/No) -->
**Is hotfix candidate?**
  • Loading branch information
alfred-landrum authored and rodrigozhou committed Aug 7, 2023
1 parent a6dfbc6 commit 35f6670
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 6 deletions.
12 changes: 7 additions & 5 deletions service/history/configs/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,12 @@ type Config struct {
VisibilityDisableOrderByClause dynamicconfig.BoolPropertyFnWithNamespaceFilter
VisibilityEnableManualPagination dynamicconfig.BoolPropertyFnWithNamespaceFilter

EmitShardLagLog dynamicconfig.BoolPropertyFn
MaxAutoResetPoints dynamicconfig.IntPropertyFnWithNamespaceFilter
ThrottledLogRPS dynamicconfig.IntPropertyFn
EnableStickyQuery dynamicconfig.BoolPropertyFnWithNamespaceFilter
ShutdownDrainDuration dynamicconfig.DurationPropertyFn
EmitShardLagLog dynamicconfig.BoolPropertyFn
MaxAutoResetPoints dynamicconfig.IntPropertyFnWithNamespaceFilter
ThrottledLogRPS dynamicconfig.IntPropertyFn
EnableStickyQuery dynamicconfig.BoolPropertyFnWithNamespaceFilter
ShutdownDrainDuration dynamicconfig.DurationPropertyFn
StartupMembershipJoinDelay dynamicconfig.DurationPropertyFn

// HistoryCache settings
// Change of these configs require shard restart
Expand Down Expand Up @@ -335,6 +336,7 @@ func NewConfig(
EnablePersistencePriorityRateLimiting: dc.GetBoolProperty(dynamicconfig.HistoryEnablePersistencePriorityRateLimiting, true),
PersistenceDynamicRateLimitingParams: dc.GetMapProperty(dynamicconfig.HistoryPersistenceDynamicRateLimitingParams, dynamicconfig.DefaultDynamicRateLimitingParams),
ShutdownDrainDuration: dc.GetDurationProperty(dynamicconfig.HistoryShutdownDrainDuration, 0*time.Second),
StartupMembershipJoinDelay: dc.GetDurationProperty(dynamicconfig.HistoryStartupMembershipJoinDelay, 0*time.Second),
MaxAutoResetPoints: dc.GetIntPropertyFilteredByNamespace(dynamicconfig.HistoryMaxAutoResetPoints, DefaultHistoryMaxAutoResetPoints),
DefaultWorkflowTaskTimeout: dc.GetDurationPropertyFilteredByNamespace(dynamicconfig.DefaultWorkflowTaskTimeout, common.DefaultWorkflowTaskTimeout),
ContinueAsNewMinInterval: dc.GetDurationPropertyFilteredByNamespace(dynamicconfig.ContinueAsNewMinInterval, time.Second),
Expand Down
12 changes: 11 additions & 1 deletion service/history/service.go
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,17 @@ func (s *Service) Start() {
// that we own. Ideally, then, we would start the GRPC server, and only then
// join membership. That's not possible with the GRPC interface, though, hence
// we start membership in a goroutine.
go s.membershipMonitor.Start()
go func() {
if delay := s.config.StartupMembershipJoinDelay(); delay > 0 {
// In some situations, like rolling upgrades of the history service,
// pausing before joining membership can help separate the shard movement
// caused by another history instance terminating with this instance starting.
logger.Info("history start: delaying before membership start",
tag.NewDurationTag("startupMembershipJoinDelay", delay))
time.Sleep(delay)
}
s.membershipMonitor.Start()
}()

logger.Info("Starting to serve on history listener")
if err := s.server.Serve(s.grpcListener); err != nil {
Expand Down

0 comments on commit 35f6670

Please sign in to comment.