Zero-downtime blue-green updates with AWS CNI & AWS LoadBalancer Controller #1283

jessesuen · 2021-06-16T07:23:07Z

Summary

Currently, the way we perform blue-green is as follows:

scale up the new V2 stack
wait for all pods of V2 to be available
swap the label selectors of the active service from pointing from the V1 stack to V2 stack
scale down old V1 stack (after scaleDownDelaySeconds - default 30s)

This is a commonly used technique to achieve blue-green in Kubernetes without the need for special integration with a service mesh or ingress controller. The technique works well on Kubernetes clusters using overlay networks where traffic is proxied via kube proxy, since after the service selector switch:

the update to the Endpoints object and corresponding iptable update propagation to nodes, happens very rapidly (within the 30s scaleDownDelaySeconds)
there are no additional updates that happen to external LoadBalancers (e.g. TargetGroups) since the service is reachable from any node in the cluster (via kube proxy).

However, with pod-aware ingress controllers/load balancers that operate on flat networks, such as the AWS Load Balancer Controller on EKS with the AWS CNI, changing the active Service selectors becomes problematic. This is because it is critical for the LoadBalancer/TargetGroup to accurately reflect the Node IPs of the pods selected by active Service, and changing service selectors introduces a window of opportunity for TargetGroup targets to be innacurate. In fact, accurate target registration is one of the main motivations for pod readinessGates. The problematic scenario is as follows:

scale up the new V2 stack
wait for all pods of V2 to be available
swap the label selectors of the active service from pointing from the V1 stack to V2 stack
due to various issues, the retargeting of pods by LoadBalancer does not happen in a timely manner
scale down old V1 stack (after scaleDownDelaySeconds - default 30s)

With flat networking, after step 3, unlike the iptable update that happens with the overlay network, there is now an external update needed to the TargetGroup of the underlying LoadBalancer. This can take time to retarget the new set of pods (e.g. due to rate-limiting or downtime of the ingress controller). The rollout controller currently has no way of knowing the target group was updated, and so it would proceed to scale down the old V1 stack, despite the fact that the V1 stack might still be targetted by the LoadBalancer, and not yet updated to target the V2 stack.

With a normal rolling update, the potential for downtime is mitigated by the fact that pod readiness gates are injected by the AWS LoadBalancer controller. Readiness gates in conjunction with rolling update, indirectly prevent premature scaledown of the old stack, by blocking readiness of the new stack until after the new pods are registered in the TargetGroup. However, the way AWS LoadBalancer Controller currently works, is that it injects readiness gates to pods only if they are reachable from an ALB Ingress/Service. Unfortunately this is in direct contrast to the blue-green's need to create the pod before it is targetted by the ALB Ingress/Service. So by design, AWS LoadBalancer Controller's readiness gate never gets injected into the new V2 stack.

See: kubernetes-sigs/aws-load-balancer-controller#2061

Rollouts should implement a mechanism to guarantee zero-downtime updates when using the blue-green strategy with AWS CNI and AWS Ingress Controller.

Current workarounds:

postPromotionAnalysis can be leveraged to somehow verify that the TargetGroup has been updated to point to V2 pods. For example, postPromotionAnalysis could simply curl the external endpoint and verify the new version is returned. If this fails, rollouts would abort the update, swap back the selector to the old stack, and it would be back to the starting state.
spec.strategy.blueGreen.scaleDownDelaySeconds can be increased to a higher value (from the default 30s) to decrease the possibility of downtime when there are delays in the AWS LoadBalancer controller, by giving it more time to recover/catch up.

Potential solutions:

Similar to the weight verification feature for ALB canary, the rollout controller could implement a feature to also verify the TargetGroup was updated to point to the V2 stack before it decided to proceed with postPromotionAnalysis / scaling down the old stack. However, with this approach, pods would not benefit from readiness gates (since they never get injected to the V2 stack in the first place). The lack of readiness gates for V2 pods may be acceptable since health check annotations on the ingress may be sufficient to deregister pods from the TargetGroup in the event the pod goes from available -> unavailable after starting.
Argo Rollouts could offer a model where selectors of services are never modified, or at least modified in a way that works well with ALB LoadBalancer Controller's implementation of readiness gate injection. For example, instead of changing selectors of Services, rollouts could alternate sending traffic between a "ping" service and a "pong" service. On every update, the rollout controller would leverage weighted target groups and update the Ingress annotations to send production traffic from 100% ping to 100% pong (and vice versa on the next update). With this approach, readiness gates would be leveraged to ensure accurate target group registration.

However, the second approach has the following drawbacks:

It only works for HTTP services that are behind ALB. Any other protocol would not work.
Additional Service objects would need to be managed (possibly created) by rollouts, increasing complexity
It would require introducing ALB integration to the blue-green strategy, which is a maintenance cost we'd like to avoid.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

jessesuen · 2021-08-05T00:59:15Z

Slides on the approaches to solving this:

https://docs.google.com/presentation/d/1JnvlE-oKL7HPErwFnBBhH2pfWUf0kSoFRLUDt2Glc6E/edit?usp=sharing

We will be implementing both options 1 (target group verification) and options 2 (ping pong service support for alb canary, and other implementations if necessary)

jessesuen · 2021-08-26T21:13:39Z

Closing this since solution 1 has been implemented in v1.1 using the new Target Group verification feature.

I opened a #1453 for implementing the second approach, but approach 2 is not necessarily required for zero-downtime.

jessesuen added enhancement New feature or request workaround There's a workaround, might not be great, but exists labels Jun 16, 2021

jessesuen mentioned this issue Jul 14, 2021

feat: verify AWS TargetGroup after updating active/stable services #1348

Merged

jessesuen mentioned this issue Aug 26, 2021

Ping-Pong service management in canary updates #1453

Closed

jessesuen closed this as completed Aug 26, 2021

jessesuen added this to the v1.1 milestone Aug 26, 2021

rajeshetty87 mentioned this issue Jul 3, 2023

Recommend Ping Pong strategy as default for Canary with Traffic Routing using ALB. #2864

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-downtime blue-green updates with AWS CNI & AWS LoadBalancer Controller #1283

Zero-downtime blue-green updates with AWS CNI & AWS LoadBalancer Controller #1283

jessesuen commented Jun 16, 2021 •

edited

Loading

jessesuen commented Aug 5, 2021 •

edited

Loading

jessesuen commented Aug 26, 2021

Zero-downtime blue-green updates with AWS CNI & AWS LoadBalancer Controller #1283

Zero-downtime blue-green updates with AWS CNI & AWS LoadBalancer Controller #1283

Comments

jessesuen commented Jun 16, 2021 • edited Loading

Summary

Current workarounds:

Potential solutions:

jessesuen commented Aug 5, 2021 • edited Loading

jessesuen commented Aug 26, 2021

jessesuen commented Jun 16, 2021 •

edited

Loading

jessesuen commented Aug 5, 2021 •

edited

Loading