Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support retaining clusters when still referenced by route/listeners #18958

Open
howardjohn opened this issue Nov 10, 2021 · 29 comments
Open

Support retaining clusters when still referenced by route/listeners #18958

howardjohn opened this issue Nov 10, 2021 · 29 comments

Comments

@howardjohn
Copy link
Contributor

Title: Support retaining clusters when still referenced by route/listeners

Description:

Currently, our ingress's gets clusters for all services we know about. This is often extremely wasteful, since most of them are not referenced by any routes.

We added an option to filter these out, but started seeing some issues when

  • Initial state: route /foo -> cluster foo-v1
  • Desired end state: route /foo -> cluster foo-v2, remove foo-v1

In practice, there is a gap between the cluster being removed and the route being updated. This results in a period of downtime on route updates.

Our ideal would be that envoy doesn't delete the cluster while a route still references it.


The alternative would be to solve it in the control plane. But that seems pretty hard... I guess we would need to realize foo-v1 is referenced by a route and keep it around until the route is updated.

This may be ~impossible for us (until we move to Delta XDS, anyways), since "keep it around" with sotw means we literally need a copy of it in the control plane and we don't store config after we send it, nor can we generate it again.

And to know its updated we need to send the route and wait for the ack I guess? Which also introduces a ton of complexity to the control plane and requires it to be stateful.

It seems a bit complex, and I am not even sure it can work if the proxy reconnects to a new control plane instance during the middle of this operation, unless our control plane instances sync data which is extremely complicated

[optional Relevant Links:]
Issue tracker on our side: istio/istio#29131

@howardjohn howardjohn added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Nov 10, 2021
@Shikugawa
Copy link
Member

Our expectation is that prevent the immediate drain of the cluster which is bound to the route when the route is updated via RDS, and wait to update the cluster until the next cluster is available. Is it correct?

@howardjohn
Copy link
Contributor Author

Yeah that would be ideal I think.

@mattklein123
Copy link
Member

Adding this is going to be very complicated and IMHO this is a control plane problem concern, not an Envoy concern. I'm going to push back pretty hard on adding this unless there is a very good reason.

@mattklein123 mattklein123 added area/cluster_manager and removed enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Nov 11, 2021
@howardjohn
Copy link
Contributor Author

howardjohn commented Nov 11, 2021

I actually generally agree with that, but my concern is that I am not convinced a control plane can actually solve this in any reasonable way.

Consider:

  1. Initial state: route /foo -> cluster foo-v1. Connecting to CP-1 instance of control plane.
  2. User config changes to point /foo -> cluster foo-v2.
  3. Control plane is smart, so it sends the new cluster foo-v2 first.
  4. Control plane crashes, new instance CP-2 starts up and envoy reconnects
  5. Now CP-2 needs to resend all config. It seems that the user wants /foo -> cluster foo-v2; no config needs cluster foo-v1 (and in fact, CP-2 may not even know foo-v1 is something that even exists at all!), so it removes foo-v1.
  6. Traffic is now broken
  7. Eventually CP-2 finishes the push, and moves /foo -> foo-v2. Traffic is now fixed

In our specific case, our control plane is essentially always in the 'Control plane crashes' state, because we do not maintain the state required to handle this complex operation; I used a crash as that is something that I would assume no control plane is immune from.

@mattklein123
Copy link
Member

IMO if your control plane crashes it's sufficient for there to be a delay until things converge, downstream retries, etc. should take care of this.

@howardjohn
Copy link
Contributor Author

When the control plane crashes I would expect configuration updates to be delayed. I would not expect traffic to be broken (ie 503s until control plane comes back up).

@mattklein123
Copy link
Member

When the control plane crashes I would expect configuration updates to be delayed. I would not expect traffic to be broken (ie 503s until control plane comes back up).

I don't understand why your control plane can't come up in a consistent state and send the new config.

@howardjohn
Copy link
Contributor Author

The users state is configured in Kubernetes. This includes service discovery (Services/Endpoints), as well as routing configuration. In Kubernetes, we can only get access to the current configuration - we cannot view historical state, unless we maintain that out of band.

In step 3 from #18958 (comment), the control plane can do this because it saw the old config, and now it sees the new configuration, so it could be smart enough to know we need to "stage" the update in two parts (in theory - Istio cannot do this today).

With a new control plane coming up though, there is no way it can understand the state of the old config. So all it sees is the user has config for a foo-v2 Service (ie foo-v2 Envoy Cluster), at which point it doesn't really have much choice other than to send just foo-v2. Even if it somehow knew it should send foo-v1, we don't have the Service for foo-v1 so we wouldn't know how to synthesize a cluster for it.

Its possible I am over complicated this :-)

@mattklein123
Copy link
Member

I understand what you are saying in terms of the corner case, but I don't think Envoy holding configuration is the right solution to this problem. It's too complicated. If you need atomic updates I would rather just figure out a way in which a route update is held until all of the clusters it references are actually available. This has been discussed in the past. cc @htuch

@htuch
Copy link
Member

htuch commented Nov 12, 2021

Yeah. The neat way to approach this would be to have the xDS Resource declare its dependencies in the wrapper and have the xDS machinery wait to deliver until these are available from all sources. The downsides are that this is going to add extra complexity and debug pain when using this feature (it would be opt-in), and our xDS delivery right now is not really setup as a cache (where this would be more natural). CC @adisuissa

@howardjohn
Copy link
Contributor Author

Yeah. The neat way to approach this would be to have the xDS Resource declare its dependencies in the wrapper and have the xDS machinery wait to deliver until these are available from all sources. The downsides are that this is going to add extra complexity and debug pain when using this feature (it would be opt-in), and our xDS delivery right now is not really setup as a cache (where this would be more natural). CC @adisuissa

I can definitely imagine other cases this could be useful. One similar problem (that is not a high priority for us) is that we have a coupling of SAN in Cluster to EDS endpoints. A similar approach could probably work there (although SAN in the EDS may be better). If I think about it more there are probably other cross-resource dependencies that would be useful.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 13, 2021
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@howardjohn
Copy link
Contributor Author

Not stale, maybe we can add "help wanted"? Unless this feature would not be accepted

@htuch
Copy link
Member

htuch commented Dec 21, 2021

@howardjohn which variant? The original proposal or what we ended up discussing?

@howardjohn
Copy link
Contributor Author

I am fine with either one, I think both will satisfy our issue

@htuch
Copy link
Member

htuch commented Dec 27, 2021

I'd be interested in hearing from @mattklein123 and @adisuissa here, but I think what would be ideal would be to open an issue to track making the xDS system more cache like and using that structure to then accommodate the resource staging (probably two issues). I can file them if there is general agreement on this approach. It would then need to be help wanted as I'm not sure if anyone in the xDS maintainer orbit has cycles for a major undertaking like this.

Absent this, I'm not sure if the first approach is going to fly, but I'm open minded to options.

@mattklein123 mattklein123 reopened this Jan 4, 2022
@mattklein123 mattklein123 added the help wanted Needs help! label Jan 4, 2022
@mattklein123
Copy link
Member

I can file them if there is general agreement on this approach. It would then need to be help wanted as I'm not sure if anyone in the xDS maintainer orbit has cycles for a major undertaking like this.

Yes this is the approach that I would recommend. @htuch I marked this help wanted and reopened, but fee free to close this out and replace with a fresh issue if you prefer.

@adisuissa
Copy link
Contributor

I also think the approach of using an xDS internal cache and explicitly setting the dependencies in the resource will make things easier to understand, and generalize this cross-xDS dependency requirement for other use-cases.

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Jan 4, 2022
@howardjohn
Copy link
Contributor Author

Coming back to this, I actually tried implementing #18958 (comment) (ignoring the control plane crashing edge case). Even that is seemingly ~impossible to do correctly.

Despite a correct ordering:

  1. RDS -> points to foo
  2. CDS: push foo and bar
  3. RDS: update to point to bar
  4. CDS: push bar only

There is actually a step "1b" needed to wait for "bar" to finish warming. This information is not actually exposed to control planes, so its ambiguous when to do step "3"

@howardjohn
Copy link
Contributor Author

#23321 I guess resolves #18958 (comment) being impossible to implement

@vikaschoudhary16
Copy link
Contributor

#23321 I guess resolves #18958 (comment) being impossible to implement

@howardjohn I realized pausing cds ack is already working. In my earlier testing, seems like, I was missing proper changes on the Istiod and incorrectly thought that envoy's pausing is not working. I retested istio/istio#41229 with no custom proxy and it works i.e no 503s.

@jewertow
Copy link
Contributor

@vikaschoudhary16 Pausing CDS works correctly, but the problem is that the listener manager and cluster manager pause requesting resources independently. Listener manager pauses requesting Cluster, ClusterLoadAssignment, LbEndpoints and Secret resources. Cluster manager on the other hand, pauses requesting Listener, RouteConfiguration and Secret resources.

I was trying to work around getting 503 by updating a route in 2 steps:

  1. Switch traffic from existing route to a new cluster.
  2. Remove an old cluster.

I expected that it will work without getting 503, because I didn't remove and add clusters in the same update, but the update failed after the first step. I got the following order of events in Envoy:

  1. Cluster test-app-8087 is serving requests successfully.
  2. CDS added test-app-8088.
  3. Envoy paused requesting Clusters, ClusterLoadAssignments, LbEndpoints and Secrets.
  4. Cluster test-app-8088 started being warmed.
  5. Envoy resumed requesting ClusterLoadAssignment.
  6. Received new LDS configuration.
  7. Received new RDS configuration.
  8. New route was loaded!
  9. Responding with 503, because cluster test-app-8088 is not yet warmed.
  10. Received EDS response with ClusterLoadAssignments.
  11. Cluster test-app-8088 is warmed.
  12. Next requests succeed.

This is because of not pausing requests for Listener and RouteConfiguration by CDS. More logs you can find here.

The documentation describes a similar case here and calls the solution "make before break model", but it complicates control plane significantly. On the other hand, implementing dependent resources in XDS API, as it was suggested, is much more complex and couldn't be delivered in a short term.

I'm looking for a trade-off, so what do you think about extending Envoy API with an option that would pause requesting Listener and RouteConfiguration on a CDS update? Then a control plane could send clusters referenced previously and currently by a route and update that route without 503.
cc: @mattklein123 @htuch @howardjohn

@mattklein123
Copy link
Member

I'm looking for a trade-off, so what do you think about extending Envoy API with an option that would pause requesting Listener and RouteConfiguration on a CDS update? Then a control plane could send clusters referenced previously and currently by a route and update that route without 503.

I still think implementing the caching/dependencies is the right solution, but this sounds reasonable to me at a high level. I'm not enough in the weeds anymore to understand the full implications of this change so I will defer to @adisuissa on that.

@jewertow
Copy link
Contributor

jewertow commented Jan 3, 2023

@mattklein123 Thank you for your feedback, but it turned out that my solution does not make sense. @howardjohn helped me realize that in the case I described above, a route was loaded, because it was pushed by the control plane - it was not requested by the proxy. So pausing requests does not make sense in this case.

@jewertow
Copy link
Contributor

I can file them if there is general agreement on this approach. It would then need to be help wanted as I'm not sure if anyone in the xDS maintainer orbit has cycles for a major undertaking like this.

@htuch could you file these issues? We are interested in this feature and we would like to know its scope and requirements.

@htuch
Copy link
Member

htuch commented Jan 27, 2023

@adisuissa this relates to your current arc with XdsManager I think, right?

@adisuissa
Copy link
Contributor

@adisuissa this relates to your current arc with XdsManager I think, right?

There are a few threads in this issue, so I'll give a short summary of what I think can be done.

As discussed IRL the idea is that the XdsManager will ensure warming by not exposing a resource to the upper-level component (RouteConfigProviderManager in this case), until the resource dependencies arrive and it can become active.
Some details on how this will be achieved: when the resource arrives to envoy, there will be a create() method that takes the resource proto, converts it to an Envoy-internal representation, and states if there are any dependencies.
If there are no dependencies, the resource will be added to a cache and any of the internal observers (e.g., RouteConfigProviderManager) will be notified.
If there are dependencies, the resource will be stored in a "warming" state, and the internal observers will not be notified. Once the dependencies arrive, the resource will become "active", stored in the cache, and the internal observers will be notified.

From the protocol side, Envoy will still send an ACK for a resource received and kept in a warmed state. It will just not use that resource until the dependencies arrive.

@jewertow
Copy link
Contributor

@adisuissa thank you very much for explanation!

  1. Does someone already work on it or is it just an idea?
  2. Will it be the default behavior or an optional feature controlled by a config option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants