Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ping-Pong service management in canary updates #1453

Closed
jessesuen opened this issue Aug 26, 2021 · 3 comments · Fixed by #1697
Closed

Ping-Pong service management in canary updates #1453

jessesuen opened this issue Aug 26, 2021 · 3 comments · Fixed by #1697
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@jessesuen
Copy link
Member

jessesuen commented Aug 26, 2021

Summary

Spawning this from #1283.

AWS Load Balancer Controller (and possibly others), suffer from an issue where modifying selectors of Services behind an AWS Ingress is problematic, because changing Service selectors prevents readiness gates to be injected properly. This highly dependent on the ingress controller implementation, but with AWS Load Balancer Controller v2.x, the controller will only inject pod readiness gates if services are reachable by an AWS Ingress. Pods are considered reachable by an ingress if they match labels of an Ingress/Service at the time of pod creation. See more in-depth description of the issue here:
https://argoproj.github.io/argo-rollouts/features/traffic-management/alb/#zero-downtime-updates-with-aws-targetgroup-verification.

The v1.1 Target Group verification feature was implemented to provide zero-downtime guarantees in the absence of proper readiness gate injection, and it should satisfy any concerns of zero-downtime. But in case the feature is not used, Argo Rollouts could offer a model where selectors of services are modified in a way that works well with ALB LoadBalancer Controller's implementation of readiness gate injection.

This proposal is for Rollouts to provide a second option of another way of canarying, which alternates sending traffic between a "ping" service and a "pong" service (both managed & deployed by the user). On every update, the rollout controller would leverage weighted target groups to update the Ingress annotations to split production traffic from the ping and pong (and vice versa on the next update). With this approach, readiness gates would be injected properly, because we would only ever modify the ping/pong service selectors before the pods were created.

Some slides that detail the problem and approach to solving it:
https://docs.google.com/presentation/d/1JnvlE-oKL7HPErwFnBBhH2pfWUf0kSoFRLUDt2Glc6E/edit#slide=id.ge7a629063e_1_451

Proposed spec:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: alb-canary
spec:
  ...
  strategy:
    canary:
      pingPong:
        pingService: ping-svc
        pongService: pong-svc
      trafficRouting:
        alb:
          ingress: alb-canary-ingress
          servicePort: 80
      steps:
      - setWeight: 5
      - pause: {duration: 1h}
      - setWeight: 25
      - pause: {duration: 1h}

I am open to a better name than ping/pong service

Use Cases

When would you use this?

I use AWS Load Balancer Controller with Rollouts' AWS integration, and I want readiness gates to be injected properly.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@gaganapplatix
Copy link

Found another window where rollouts Pods never become ready, when dealing with readiness gate:

  1. Desired and Stable both point to current replicaset
  2. New Pod scales up, gets both TargetGroup readiness gates set on it (2 readiness gates)
  3. Before containers become Ready, new replicaset is rolled out and desired service points to new hash for new replicaset
  4. At this point the Pod will never become ready because the endpoint object for the desired service will never contain the Pod IP in ready state ever.
  5. I am not sure if https://argoproj.github.io/argo-rollouts/features/traffic-management/alb/#zero-downtime-updates-with-aws-targetgroup-verification solves this window actually. Because I don't think any verification will fix this case.

@harikrongali harikrongali added this to the v1.2 milestone Sep 16, 2021
@perenesenko
Copy link
Member

Hi @jessesuen
BTW how we're going to implement the case with experiment and reference to the canary via ping-pong?
We can not specify the ping (or pong) there as it always switches

Example of the experiment with canary:

  strategy:
    canary:
      steps:
      - setWeight: 20
      - experiment:
          templates:
            - name: experiment-istio
              specRef: canary
              weight: 20

@perenesenko
Copy link
Member

We just talked with @harikrongali about this and see no issue here.
In this case the "specRef: canary" means "let's use the new template here". There is no relation to the ping/pong service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants