Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that on-demand CDS does not support SotW xDS #24763

Merged

Conversation

jewertow
Copy link
Contributor

@jewertow jewertow commented Jan 4, 2023

Signed-off-by: Jacek Ewertowski [email protected]

Commit Message: Clarify that on-demand CDS does not support SotW xDS
Additional Description:
Risk Level: n/a
Testing: n/a
Docs Changes: It's a doc change.
Release Notes: n/a
Platform Specific Features: n/a

I was trying to configure on-demand CDS in an Istio ingress-gateway, but the process failed with the following error:

2023-01-04T13:38:10.587354Z     critical        envoy assert    panic: not supported
2023-01-04T13:38:10.587411Z     critical        envoy backtrace Caught Aborted, suspect faulting address 0x5390000001e
2023-01-04T13:38:10.587416Z     critical        envoy backtrace Backtrace (use tools/stack_decode.py to get line numbers):
2023-01-04T13:38:10.587418Z     critical        envoy backtrace Envoy version: 383b3bdbec79392badac7cbf3525f09d1f0f18f8/1.24.1-dev/Clean/RELEASE/BoringSSL
2023-01-04T13:38:10.587694Z     critical        envoy backtrace #0: [0x7fcd2f3e4520]
...
2023-01-04T13:38:11.139240Z     info    ads     ADS: "@" istio-ingressgateway-5d68845c68-hqp4t.istio-system-2 terminated
2023-01-04T13:38:11.139360Z     info    ads     ADS: "@" istio-ingressgateway-5d68845c68-hqp4t.istio-system-1 terminated
2023-01-04T13:38:11.140610Z     error   Envoy exited with error: signal: aborted (core dumped)

I had to dive into the code base to figure out why it fails and I found this method:

void OdCdsApiImpl::onConfigUpdate(const std::vector<Config::DecodedResourceRef>& resources,
  ...
  // On-demand cluster updates are only supported for delta, not sotw.
  PANIC("not supported");
}

so I think it's worth to mention about this limitation in the docs.

Could I also add more details to that panic message or is it intentionally so general? I noticed that most panic messages are very general, so I wonder if this is a convention...?

Question:
Btw. is it an intended limitation or could it be implemented? I'm considering on-demand CDS as a workaround for the problem described in #18958, so if there are no obstacles to make it work with SotW ADS, I would be interested in implementing this feature.

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @adisuissa
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #24763 was opened by jewertow.

see: more, trace.

@markdroth
Copy link
Contributor

A quick drive-by comment: I can't think of any reason in principle that on-demand CDS wouldn't work with SotW. This is very similar to how gRPC uses CDS at the protocol level (although in gRPC, the fetch of a CDS resource is triggered by the presence of the cluster name in the RouteConfig, not by receiving a request that actually needs to be sent to that cluster).

@adisuissa
Copy link
Contributor

Looking at the discussion of the original issue #15523, and quoting from the doc:

... the interaction between “on-demand” discovery and “state-of-the-world” discovery may be troublesome. Basically once a cluster is discovered on-demand, it will also need to appear in the state-of-the-world discovery response, otherwise Envoy will remove the cluster. This means that the management service would need to track which Envoy instance requested for which cluster, so it can send a “tailored” state-of-the-world response.

so this seems intentional.

I wonder if this is a misbehavior of the specific management service that was used.

@markdroth
Copy link
Contributor

@adisuissa I think that conversation assumes that Envoy will always use a wildcard subscription for CDS. In principle, there's no reason that Envoy can't request individual CDS resources in addition to the wildcard subscription. I would expect it to do exactly that when on-demand CDS is used, as per discussion in #16861 and #16855.

It may be that Envoy does not yet support this mechanism for SotW, but there's no protocol-level reason that it can't do so.

@jewertow
Copy link
Contributor Author

jewertow commented Jan 5, 2023

@adisuissa thank you for linking that document. Now I understand the wider context.

the interaction between “on-demand” discovery and “state-of-the-world” discovery may be troublesome. Basically once a cluster is discovered on-demand, it will also need to appear in the state-of-the-world discovery response, otherwise Envoy will remove the cluster. This means that the management service would need to track which Envoy instance requested for which cluster, so it can send a “tailored” state-of-the-world response.

I was thinking about this and in my opinion SotW could also be supported and there's no protocol-level reason that it can't do so, as @markdroth said. Even if a control plane will not to "tailor" the SotW pushes that remove previously requested clusters, then on-demand CDS will load those clusters again on the next request.

My use case is to prevent loading new routes by the listener manager when ClusterLoadAssignments were not received and clusters were not warmed. In such a case, a listener returns 503, because it routes request to a non-existing cluster that in fact exists, but is not ready to serve traffic, because of warming.
On-demand CDS should solve this problem, because it pauses processing requests until cluster exist. It's also important to note that in this case, the control plane would not need to tailor SotW response, because it will send the expected state.

@adisuissa
Copy link
Contributor

I agree that Envoy essentially supports CDS as wildcard/collection subscription, and that is why SotW isn't supported (although generally speaking, it shouldn't be a problem).

My use case is to prevent loading new routes by the listener manager when ClusterLoadAssignments were not received and clusters were not warmed. In such a case, a listener returns 503, because it routes request to a non-existing cluster that in fact exists, but is not ready to serve traffic, because of warming. On-demand CDS should solve this problem, because it pauses processing requests until cluster exist. It's also important to note that in this case, the control plane would not need to tailor SotW response, because it will send the expected state.

IIUC this is because there's no warming between RDS and CDS. Technically speaking, a control-plane can send an RDS update with a route to non-existent cluster (or that will be added in a future CDS response), and that route will also return 503s.
If the control-plane were to wait until the endpoints (EDS) of that cluster were ack'd before sending the listener and/or route, that wouldn't have happened, no?

@jewertow
Copy link
Contributor Author

jewertow commented Jan 5, 2023

IIUC this is because there's no warming between RDS and CDS.

Yes, you understand it correctly.

If the control-plane were to wait until the endpoints (EDS) of that cluster were ack'd before sending the listener and/or route, that wouldn't have happened, no?

You're right. I was considering this approach, but it has the following downsides when we consider SotW stream:

  1. When a route is changed so that it switches from cluster c1 to c2, the control plane would have to send both c1 and c2 clusters to avoid breaking traffic during an update. It means the control plane must somehow remember the previous state, but this implies more memory consumption and scalability concerns. Beside that, there is still a case when it would not work.
  2. Waiting on EDS ACK before sending would impact performance of pushing resources to Envoy and @howardjohn is concerned about this.

@markdroth
Copy link
Contributor

If the control-plane were to wait until the endpoints (EDS) of that cluster were ack'd before sending the listener and/or route, that wouldn't have happened, no?

In general, I don't think it's a good idea for control planes to do things like that, because this is a layering violation: the transport protocol layer should just treat the resources as opaque blobs and should not need to know anything about their contents or their relationships to each other. (Note that this is essentially the same argument I made against the control plane being required to re-send the EDS resource when the CDS resource changes in #13009.)

@markdroth
Copy link
Contributor

I think the ideal solution to @jewertow's use case would be for Envoy to have the ability to fetch only a specific set of individual cluster names (determined from processing the RouteConfig) rather than making a wildcard CDS subscription. This is basically what gRPC does today, and I think it would effectively involve warming between RDS and CDS.

But if that is too complicated for Envoy to implement, then I think on-demand CDS would also be a fine approach. And I do not believe that there is any conceptual reason that this won't work with the xDS transport protocol.

Ultimately, I think this comes down to a question of which approach is easier to implement in Envoy.

@jewertow
Copy link
Contributor Author

Thank you very much for feedback, but after analyzing this feature in more detail, I decided to not working on this for now.
@adisuissa can we merge this warning?

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left a minor suggestion.

@adisuissa
Copy link
Contributor

FWIW, I think that it is possible that ODCDS will use SotW, it just needs to handle a few corner cases.
It would be simpler if there will be a separation between singleton and collection subscriptions, but this is somewhat orthogonal to this discussion.

Co-authored-by: Adi (Suissa) Peleg <[email protected]>
Signed-off-by: Jacek Ewertowski <[email protected]>
@jewertow
Copy link
Contributor Author

FWIW, I think that it is possible that ODCDS will use SotW, it just needs to handle a few corner cases.

Sure, thanks. I wanted to add this warning to just save someone's time when trying to use this feature with Istio.

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@KBaichoo
Copy link
Contributor

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #24763 (comment) was created by @KBaichoo.

see: more, trace.

@KBaichoo KBaichoo enabled auto-merge (squash) January 10, 2023 18:28
@KBaichoo KBaichoo merged commit cdc1c1d into envoyproxy:main Jan 10, 2023
VishalDamgude pushed a commit to freshworks/envoy that referenced this pull request Feb 1, 2023
* Clarify that on-demand CDS does not support SotW xDS

Signed-off-by: Jacek Ewertowski <[email protected]>
Signed-off-by: VishalDamgude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants