Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-downtime blue-green updates with AWS CNI & AWS LoadBalancer Controller #1283

Closed
jessesuen opened this issue Jun 16, 2021 · 2 comments
Closed
Labels
enhancement New feature or request workaround There's a workaround, might not be great, but exists
Milestone

Comments

@jessesuen
Copy link
Member

jessesuen commented Jun 16, 2021

Summary

Currently, the way we perform blue-green is as follows:

  1. scale up the new V2 stack
  2. wait for all pods of V2 to be available
  3. swap the label selectors of the active service from pointing from the V1 stack to V2 stack
  4. scale down old V1 stack (after scaleDownDelaySeconds - default 30s)

This is a commonly used technique to achieve blue-green in Kubernetes without the need for special integration with a service mesh or ingress controller. The technique works well on Kubernetes clusters using overlay networks where traffic is proxied via kube proxy, since after the service selector switch:

  • the update to the Endpoints object and corresponding iptable update propagation to nodes, happens very rapidly (within the 30s scaleDownDelaySeconds)
  • there are no additional updates that happen to external LoadBalancers (e.g. TargetGroups) since the service is reachable from any node in the cluster (via kube proxy).

However, with pod-aware ingress controllers/load balancers that operate on flat networks, such as the AWS Load Balancer Controller on EKS with the AWS CNI, changing the active Service selectors becomes problematic. This is because it is critical for the LoadBalancer/TargetGroup to accurately reflect the Node IPs of the pods selected by active Service, and changing service selectors introduces a window of opportunity for TargetGroup targets to be innacurate. In fact, accurate target registration is one of the main motivations for pod readinessGates. The problematic scenario is as follows:

  1. scale up the new V2 stack
  2. wait for all pods of V2 to be available
  3. swap the label selectors of the active service from pointing from the V1 stack to V2 stack
  4. due to various issues, the retargeting of pods by LoadBalancer does not happen in a timely manner
  5. scale down old V1 stack (after scaleDownDelaySeconds - default 30s)

With flat networking, after step 3, unlike the iptable update that happens with the overlay network, there is now an external update needed to the TargetGroup of the underlying LoadBalancer. This can take time to retarget the new set of pods (e.g. due to rate-limiting or downtime of the ingress controller). The rollout controller currently has no way of knowing the target group was updated, and so it would proceed to scale down the old V1 stack, despite the fact that the V1 stack might still be targetted by the LoadBalancer, and not yet updated to target the V2 stack.

With a normal rolling update, the potential for downtime is mitigated by the fact that pod readiness gates are injected by the AWS LoadBalancer controller. Readiness gates in conjunction with rolling update, indirectly prevent premature scaledown of the old stack, by blocking readiness of the new stack until after the new pods are registered in the TargetGroup. However, the way AWS LoadBalancer Controller currently works, is that it injects readiness gates to pods only if they are reachable from an ALB Ingress/Service. Unfortunately this is in direct contrast to the blue-green's need to create the pod before it is targetted by the ALB Ingress/Service. So by design, AWS LoadBalancer Controller's readiness gate never gets injected into the new V2 stack.

See: kubernetes-sigs/aws-load-balancer-controller#2061

Rollouts should implement a mechanism to guarantee zero-downtime updates when using the blue-green strategy with AWS CNI and AWS Ingress Controller.

Current workarounds:

  1. postPromotionAnalysis can be leveraged to somehow verify that the TargetGroup has been updated to point to V2 pods. For example, postPromotionAnalysis could simply curl the external endpoint and verify the new version is returned. If this fails, rollouts would abort the update, swap back the selector to the old stack, and it would be back to the starting state.

  2. spec.strategy.blueGreen.scaleDownDelaySeconds can be increased to a higher value (from the default 30s) to decrease the possibility of downtime when there are delays in the AWS LoadBalancer controller, by giving it more time to recover/catch up.

Potential solutions:

  1. Similar to the weight verification feature for ALB canary, the rollout controller could implement a feature to also verify the TargetGroup was updated to point to the V2 stack before it decided to proceed with postPromotionAnalysis / scaling down the old stack. However, with this approach, pods would not benefit from readiness gates (since they never get injected to the V2 stack in the first place). The lack of readiness gates for V2 pods may be acceptable since health check annotations on the ingress may be sufficient to deregister pods from the TargetGroup in the event the pod goes from available -> unavailable after starting.

  2. Argo Rollouts could offer a model where selectors of services are never modified, or at least modified in a way that works well with ALB LoadBalancer Controller's implementation of readiness gate injection. For example, instead of changing selectors of Services, rollouts could alternate sending traffic between a "ping" service and a "pong" service. On every update, the rollout controller would leverage weighted target groups and update the Ingress annotations to send production traffic from 100% ping to 100% pong (and vice versa on the next update). With this approach, readiness gates would be leveraged to ensure accurate target group registration.

However, the second approach has the following drawbacks:

  • It only works for HTTP services that are behind ALB. Any other protocol would not work.
  • Additional Service objects would need to be managed (possibly created) by rollouts, increasing complexity
  • It would require introducing ALB integration to the blue-green strategy, which is a maintenance cost we'd like to avoid.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@jessesuen jessesuen added enhancement New feature or request workaround There's a workaround, might not be great, but exists labels Jun 16, 2021
@jessesuen
Copy link
Member Author

jessesuen commented Aug 5, 2021

Slides on the approaches to solving this:

https://docs.google.com/presentation/d/1JnvlE-oKL7HPErwFnBBhH2pfWUf0kSoFRLUDt2Glc6E/edit?usp=sharing

We will be implementing both options 1 (target group verification) and options 2 (ping pong service support for alb canary, and other implementations if necessary)

@jessesuen
Copy link
Member Author

Closing this since solution 1 has been implemented in v1.1 using the new Target Group verification feature.

I opened a #1453 for implementing the second approach, but approach 2 is not necessarily required for zero-downtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request workaround There's a workaround, might not be great, but exists
Projects
None yet
Development

No branches or pull requests

1 participant