Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cds: Configure LRS in CDS response #7953
cds: Configure LRS in CDS response #7953
Changes from 3 commits
8d3285e
6dd7370
cc3906d
1c66fa6
a3e0810
6fa9e39
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In discussing this with @vishalpowar, one problem here is that some management server deployments want the LRS stream to be sent to the same management server that is providing the CDS request, but they don't currently know what name or credentials their clients are using to access the management server. It would be more work for them to figure that out just to give the client information that it already has (because it used the name and credentials to access the CDS request in the first place).
Currently, ConfigSource has a special case for ADS that basically says "use the same ADS stream you're already using", but that won't work here, because LRS can't be sent over ADS. To address this, can we add a new field to the "oneof" in ConfigSource that basically says "talk to the same server you're getting this message from" but without being linked to ADS?
I think this new type of ConfigSource might also be useful in cases that are not using ADS but where LDS points to RDS on the same server or where CDS points to EDS on the same server.
@htuch and @mattklein123, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me but can we track this in a new issue/proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I've sent out #8201 for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'm largely supportive of adding this. There are a few things to sort out though:
envoy/api/envoy/config/bootstrap/v2/bootstrap.proto
Line 194 in 817b2e3
LoadStatsResponse
in here? Should we just have aninitial_load_stats_response
message? Or should we just leave this to the bidi stream as exists today. Would be good to discuss this one further.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting -- I didn't realize that Envoy had the LRS server configured via its bootstrap. Should we deprecate that field in the bootstrap proto as part of this? (See below for further discussion about migration strategy.)
I've added the requested comments.
My thinking here was that this new mechanism would completely replace the existing mechanism, so when this new field is populated in CDS, there will never be a
LoadStatsResponse
message sent on the LRS stream. Instead, for both initial responses and for subsequent changes, CDS will send an update of the cluster with the new values in these fields. For example, to change the load reporting interval for a given cluster, CDS will send an update that includes the new load reporting interval. Or, to stop sending load reports for a given cluster, CDS will send an update that leaves theload_report_config
field unset.However, thinking about this further, maybe we need to give more thought to the migration story between the current approach and the new approach. I see two migration cases to consider here:
Envoy clients that do not yet support this new field. These will ignore the new field even if the management server sends it, and they will require the management server to send the
LoadStatsResponse
message on the LRS stream.Envoy clients that do support this new field but are talking to a management server that does not support the new field. In this case, the client would need to continue configuring LRS via the bootstrap file as a fallback, and the management server would continue to send the
LoadStatsResponse
message on the LRS stream.One possible way to handle this is to say that clients that start LRS based on the new CDS field will ignore any
LoadStatsResponse
sent on the LRS stream and document that management servers must continue sendingLoadStatsResponse
on the LRS stream until they are certain that there are no more old clients.Another way to do this would be to add a new boolean field to the LRS request indicating whether the client was configured to use LRS via the new CDS field, in which case the LRS server will know not to send any responses on the LRS stream. This is more explicit, but it does leave this new boolean field lying around even after the migration is complete, which seems sub-optimal.
Do we need to worry about federation cases where some clusters come from a management server that provides the new CDS field but others come from a management server that expects LRS to be configured via the bootstrap? Or is that not a thing in the v2 APIs?
I was not expecting to have to implement this in Envoy. Envoy and gRPC are both consumers of this API, but if we agree jointly on an API change, I was assuming that it would be up to each of us to update our implementations to support the change. If Envoy needs an API change in the future, I would not expect you folks to implement the change in gRPC. And this approach certainly will not scale once there are other clients of this API.
We should probably discuss this further offline. I do agree that this should be implemented in Envoy, and I don't want us to be in a situation where one client of this API is incentivized to say no to a change because they don't want to have to do any work to implement the change, nor do I want us to be in a situation where the ecosystem gets fragmented because new API additions get supported by only a subset of clients. But I also don't want to be on the hook for updating Envoy whenever gRPC needs an API change. So let's talk about this further to see if we can come up with a more sustainable approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we have two situations:
LoadStatsResponse
here, that will be used by the client in the absence of any server message. We can deprecate in documentation etc. the streamed aspect here.How does that sound?
I agree we need to better coordinate on the implementation front. I don't think it should be expected that gRPC folks implement any API change on the Envoy side (although we wouldn't complain!), but simultaneously, we need to schedule and budget engineering bandwidth in OSS or at Google to make the Envoy migration happen when we introduce deprecations like this; we can chat offline further about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay in principle with the idea of leaving the LRS protocol as-is for v2 and making further changes for v3. However, I want to make sure that the semantics we provide actually make sense.
If we leave the LRS protocol itself as-is but still allow configuring the LRS server in the CDS response, it seems like there could be some confusion in the protocol. For example, let's say CDS for cluster A points to LRS server 1 and CDS for cluster B points to LRS server 2. But when the client contacts each of the LRS servers, the LRS server will indicate to the client which clusters it wants data for. So what happens if LRS server 1 says that it wants data for clusters A and B? Is that okay, given that the CDS response for B indicated to use a different LRS server? And what happens if both LRS servers request data for the same cluster? Wouldn't that result in double-reporting the load?
This potential confusion is why I was trying to change LRS such that if the LRS server is configured via CDS, the set of clusters to send data for is no longer controlled by LRS itself. Otherwise, we have two potentially conflicting sources of information.
To be clear, I don't actually have a use-case where I want to configure a different LRS server for different clusters. What I'm really after here is a way to get the LRS server via xDS, so that I don't have to configure it on the client (e.g., via a bootstrap file). The CDS response seems like the appropriate place to put it in xDS, which means that it would inherently be able to be set on a per-cluster basis, even though that's not really what I'm after.
If you have an alternative suggestion, I'd be open to hearing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an earlier comment you mentioned using a flag in LoadStatsRequest indicating that it was configured via CDS. This along with making the client ignore LoadStatsResponse from Management Server (if the Management Server has not yet implemented the flag) should cover the migration and compatibility story between the clients and server.
Including the example of two clusters that you have above.
Or, am I missing something,