-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eds: introducing EDS resources cache #28079
Conversation
Signed-off-by: Adi Suissa-Peleg <[email protected]>
Signed-off-by: Adi Suissa-Peleg <[email protected]>
/assign-from @envoyproxy/envoy-maintainers |
@envoyproxy/envoy-maintainers assignee is @ggreenway |
cc @abeyad as an expert on the config-plane pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this thoroughly well-written and well-tested PR! Overall looks great, left a few minor comments.
envoy/config/eds_resources_cache.h
Outdated
// Represents an xDS resources cache for EDS resources, and currently supports | ||
// a single config-source (ADS). The motivation is that clusters that are | ||
// updated (not added) during a CDS response will be able to use the current EDS | ||
// configuration, thus avoiding an additional EDS response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the CDS response always be followed by an EDS response if it's an EDS cluster? Or not necessarily?
The reason I ask is, assuming connectivity to the xDS server is sound, would that mean the cached EDS resources would only be used for a very brief period until the EDS response comes soon after the CDS response's arrival? An explanation to that effect in the comments might be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll provide some more context here, and let me know if you think this should go into the comment.
A CDS response will create a cluster in warming mode. If an EDS response will not be delivered in 15 seconds, that cluster will become active with an empty assignment (no endpoints). The series of these PRs will allow Envoy to use the previously cached assignment if no EDS response is received by the time the warming ends.
According to the xDS-protocol, Envoy should use the cached EDS assignment as soon as the CDS response arrives. However, Envoy intentionally doesn't do so, because it supports a case where the cluster's endpoints are substantially changed. For example, when a cluster is changed from a non-TLS cluster to a TLS cluster, the old assignment should not be used, and Envoy should use the new assignment.
This work is essentially allowing the use of cached resources, thus not requiring the server to send an assignment if there was no update, while still allowing a use-case as described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks for the explanation. I think just an extra note in the comments saying the xDS server isn't required to send an EDS response after a CDS response would help clarify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified the comment.
// The value of the map, holds the resource and the removal callbacks. | ||
struct ResourceData { | ||
envoy::config::endpoint::v3::ClusterLoadAssignment resource_; | ||
std::list<EdsResourceRemovalCallback*> removal_cbs_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think std::list probably makes sense here since removal from the list would likely be frequent, but I think if we expect removal to not be a regular occurrence than maybe std::vector is a better choice for better cache locality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... good point. I expect the list to be of size 1 or 2 at most. I'm not sure if it will matter that much, but I'm willing to change it to a vector if you think it makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think std::vector probably makes more sense, with a call to reserve(2)
? The only benefit of std::list would be frequent random insertion and deletion in a larger list, but with such a small list, that benefit wouldn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed to a vector. I don't think that reserving the size to 2 makes sense here, as the typical use-case is an empty-vector.
test/extensions/config_subscription/grpc/eds_resources_cache_impl_test.cc
Outdated
Show resolved
Hide resolved
Signed-off-by: Adi Suissa-Peleg <[email protected]>
envoy/config/eds_resources_cache.h
Outdated
// Represents an xDS resources cache for EDS resources, and currently supports | ||
// a single config-source (ADS). The motivation is that clusters that are | ||
// updated (not added) during a CDS response will be able to use the current EDS | ||
// configuration, thus avoiding an additional EDS response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks for the explanation. I think just an extra note in the comments saying the xDS server isn't required to send an EDS response after a CDS response would help clarify things.
* resource doesn't exist. | ||
*/ | ||
virtual OptRef<const envoy::config::endpoint::v3::ClusterLoadAssignment> | ||
getResource(absl::string_view resource_name, EdsResourceRemovalCallback* removal_cb) PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment on the lifetime expectations of the EdsResourceRemovalCallback pointer? i.e. why is it not owned by the EdsResourcesCache class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comment to clarify the lifetime.
// The value of the map, holds the resource and the removal callbacks. | ||
struct ResourceData { | ||
envoy::config::endpoint::v3::ClusterLoadAssignment resource_; | ||
std::list<EdsResourceRemovalCallback*> removal_cbs_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think std::vector probably makes more sense, with a call to reserve(2)
? The only benefit of std::list would be frequent random insertion and deletion in a larger list, but with such a small list, that benefit wouldn't exist.
Signed-off-by: Adi Suissa-Peleg <[email protected]>
Signed-off-by: Adi Suissa-Peleg <[email protected]>
Signed-off-by: Adi Suissa-Peleg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adisuissa !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Adi Suissa-Peleg <[email protected]> Signed-off-by: Ryan Eskin <[email protected]>
Continuation of PR #28079 (as part of the work for issue #26749). Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used without endpoints. This PR adds the use of caching into the GrpcMux. The GrpcMux object adds an EDS resource to the cache when it is received/updated, and removes it when there are no longer subscriptions (watchers). A runtime flag is added to disable the use of the cache, and will be enabled in a future PR when ADS is used. Next PR will plumb this into ADS, and add fetching of resources from the cache as part of the EdsClusterImpl. The entire change can be looked here: adisuissa/envoy@f0b7ac8 Risk Level: Low - the disabled runtime flag should prevent the use of the cache in non-tests code. Testing: Added unit tests. Docs Changes: N/A. Release Notes: N/A (future PR). Platform Specific Features: N/A. Runtime guard: disabled by default: envoy_restart_features_use_eds_cache_for_ads Signed-off-by: Adi Suissa-Peleg <[email protected]>
Commit Message: introducing EDS resources cache
Additional Description:
Currently after an EDS-cluster update, Envoy waits for an EDS response. If a timeout occurs, the EDS-cluster will be used without endpoints.
This PR introduces a cache to store EDS resources (ClusterLoadAssignments), that allows invoking a callback when a resource is removed or expired.
This will be followed up by 2 PRs:
Risk Level: low - introducing a new component which isn't used anywhere.
Testing: Adding a unit test for the class.
Docs Changes: N/A.
Release Notes: N/A.
Platform Specific Features: N/A.
Part of #26749
The entire change can be looked here: adisuissa@f0b7ac8