Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eds: introducing EDS resources cache #28079

Merged
merged 6 commits into from
Jun 28, 2023

Conversation

adisuissa
Copy link
Contributor

@adisuissa adisuissa commented Jun 21, 2023

Commit Message: introducing EDS resources cache
Additional Description:

Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used without endpoints.
This PR introduces a cache to store EDS resources (ClusterLoadAssignments), that allows invoking a callback when a resource is removed or expired.
This will be followed up by 2 PRs:

  • Plumbing the cache in the cluster-manager and GrpcMux implementations.
  • Modifying the EdsClusterImpl class to use the cache.

Risk Level: low - introducing a new component which isn't used anywhere.
Testing: Adding a unit test for the class.
Docs Changes: N/A.
Release Notes: N/A.
Platform Specific Features: N/A.
Part of #26749

The entire change can be looked here: adisuissa@f0b7ac8

Signed-off-by: Adi Suissa-Peleg <[email protected]>
@zuercher
Copy link
Member

/assign-from @envoyproxy/envoy-maintainers

@repokitteh-read-only
Copy link

@envoyproxy/envoy-maintainers assignee is @ggreenway

🐱

Caused by: a #28079 (comment) was created by @zuercher.

see: more, trace.

@adisuissa
Copy link
Contributor Author

cc @abeyad as an expert on the config-plane pipeline

Copy link
Contributor

@abeyad abeyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this thoroughly well-written and well-tested PR! Overall looks great, left a few minor comments.

// Represents an xDS resources cache for EDS resources, and currently supports
// a single config-source (ADS). The motivation is that clusters that are
// updated (not added) during a CDS response will be able to use the current EDS
// configuration, thus avoiding an additional EDS response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the CDS response always be followed by an EDS response if it's an EDS cluster? Or not necessarily?
The reason I ask is, assuming connectivity to the xDS server is sound, would that mean the cached EDS resources would only be used for a very brief period until the EDS response comes soon after the CDS response's arrival? An explanation to that effect in the comments might be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll provide some more context here, and let me know if you think this should go into the comment.

A CDS response will create a cluster in warming mode. If an EDS response will not be delivered in 15 seconds, that cluster will become active with an empty assignment (no endpoints). The series of these PRs will allow Envoy to use the previously cached assignment if no EDS response is received by the time the warming ends.

According to the xDS-protocol, Envoy should use the cached EDS assignment as soon as the CDS response arrives. However, Envoy intentionally doesn't do so, because it supports a case where the cluster's endpoints are substantially changed. For example, when a cluster is changed from a non-TLS cluster to a TLS cluster, the old assignment should not be used, and Envoy should use the new assignment.

This work is essentially allowing the use of cached resources, thus not requiring the server to send an assignment if there was no update, while still allowing a use-case as described above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks for the explanation. I think just an extra note in the comments saying the xDS server isn't required to send an EDS response after a CDS response would help clarify things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the comment.

envoy/config/eds_resources_cache.h Outdated Show resolved Hide resolved
envoy/config/eds_resources_cache.h Outdated Show resolved Hide resolved
// The value of the map, holds the resource and the removal callbacks.
struct ResourceData {
envoy::config::endpoint::v3::ClusterLoadAssignment resource_;
std::list<EdsResourceRemovalCallback*> removal_cbs_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think std::list probably makes sense here since removal from the list would likely be frequent, but I think if we expect removal to not be a regular occurrence than maybe std::vector is a better choice for better cache locality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... good point. I expect the list to be of size 1 or 2 at most. I'm not sure if it will matter that much, but I'm willing to change it to a vector if you think it makes more sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think std::vector probably makes more sense, with a call to reserve(2) ? The only benefit of std::list would be frequent random insertion and deletion in a larger list, but with such a small list, that benefit wouldn't exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed to a vector. I don't think that reserving the size to 2 makes sense here, as the typical use-case is an empty-vector.

Signed-off-by: Adi Suissa-Peleg <[email protected]>
// Represents an xDS resources cache for EDS resources, and currently supports
// a single config-source (ADS). The motivation is that clusters that are
// updated (not added) during a CDS response will be able to use the current EDS
// configuration, thus avoiding an additional EDS response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks for the explanation. I think just an extra note in the comments saying the xDS server isn't required to send an EDS response after a CDS response would help clarify things.

* resource doesn't exist.
*/
virtual OptRef<const envoy::config::endpoint::v3::ClusterLoadAssignment>
getResource(absl::string_view resource_name, EdsResourceRemovalCallback* removal_cb) PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on the lifetime expectations of the EdsResourceRemovalCallback pointer? i.e. why is it not owned by the EdsResourcesCache class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comment to clarify the lifetime.

// The value of the map, holds the resource and the removal callbacks.
struct ResourceData {
envoy::config::endpoint::v3::ClusterLoadAssignment resource_;
std::list<EdsResourceRemovalCallback*> removal_cbs_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think std::vector probably makes more sense, with a call to reserve(2) ? The only benefit of std::list would be frequent random insertion and deletion in a larger list, but with such a small list, that benefit wouldn't exist.

@abeyad abeyad self-assigned this Jun 23, 2023
Signed-off-by: Adi Suissa-Peleg <[email protected]>
Signed-off-by: Adi Suissa-Peleg <[email protected]>
Copy link
Contributor

@abeyad abeyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adisuissa !

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ggreenway ggreenway merged commit 057c80b into envoyproxy:main Jun 28, 2023
reskin89 pushed a commit to reskin89/envoy that referenced this pull request Jul 11, 2023
Signed-off-by: Adi Suissa-Peleg <[email protected]>
Signed-off-by: Ryan Eskin <[email protected]>
htuch pushed a commit that referenced this pull request Aug 2, 2023
Continuation of PR #28079 (as part of the work for issue #26749).

Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used without endpoints.

This PR adds the use of caching into the GrpcMux. The GrpcMux object adds an EDS resource to the cache when it is received/updated, and removes it when there are no longer subscriptions (watchers).

A runtime flag is added to disable the use of the cache, and will be enabled in a future PR when ADS is used.

Next PR will plumb this into ADS, and add fetching of resources from the cache as part of the EdsClusterImpl.

The entire change can be looked here: adisuissa/envoy@f0b7ac8

Risk Level: Low - the disabled runtime flag should prevent the use of the cache in non-tests code.
Testing: Added unit tests.
Docs Changes: N/A.
Release Notes: N/A (future PR).
Platform Specific Features: N/A.
Runtime guard: disabled by default: envoy_restart_features_use_eds_cache_for_ads

Signed-off-by: Adi Suissa-Peleg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants