cds: Configure LRS in CDS response #7953

markdroth · 2019-08-16T21:49:37Z

Description: This PR adds fields to CDS that allow for configuring LRS.
Risk Level: Low
Testing: None (but if anything is needed, please let me know)
Docs Changes: Inline with API protos
Release Notes: N/A

cc @htuch @vishalpowar

Signed-off-by: Mark D. Roth <[email protected]>

htuch

/wait

htuch · 2019-08-19T15:43:16Z

api/envoy/api/v2/cds.proto

+    google.protobuf.Duration load_reporting_interval = 2;
+    bool report_endpoint_granularity = 3;
+  }
+  LoadReportConfig load_report_config = 41;


Thanks, I'm largely supportive of adding this. There are a few things to sort out though:

We should have comments that are similar to

envoy/api/envoy/config/bootstrap/v2/bootstrap.proto

Line 194 in 817b2e3

// A management server endpoint to stream load stats to via

.

Do we really want to duplicate bits of LoadStatsResponse in here? Should we just have an initial_load_stats_response message? Or should we just leave this to the bidi stream as exists today. Would be good to discuss this one further.

As per the other PR today, I'm wondering if someone is signed up to implement this in Envoy? We can't really take fields in the API like this that don't have an Envoy implementation.

We should have comments that are similar to

envoy/api/envoy/config/bootstrap/v2/bootstrap.proto

Line 194 in 817b2e3

// A management server endpoint to stream load stats to via

.

Ah, interesting -- I didn't realize that Envoy had the LRS server configured via its bootstrap. Should we deprecate that field in the bootstrap proto as part of this? (See below for further discussion about migration strategy.)

I've added the requested comments.

Do we really want to duplicate bits of LoadStatsResponse in here? Should we just have an initial_load_stats_response message? Or should we just leave this to the bidi stream as exists today. Would be good to discuss this one further.

My thinking here was that this new mechanism would completely replace the existing mechanism, so when this new field is populated in CDS, there will never be a LoadStatsResponse message sent on the LRS stream. Instead, for both initial responses and for subsequent changes, CDS will send an update of the cluster with the new values in these fields. For example, to change the load reporting interval for a given cluster, CDS will send an update that includes the new load reporting interval. Or, to stop sending load reports for a given cluster, CDS will send an update that leaves the load_report_config field unset.

However, thinking about this further, maybe we need to give more thought to the migration story between the current approach and the new approach. I see two migration cases to consider here:

Envoy clients that do not yet support this new field. These will ignore the new field even if the management server sends it, and they will require the management server to send the LoadStatsResponse message on the LRS stream.

Envoy clients that do support this new field but are talking to a management server that does not support the new field. In this case, the client would need to continue configuring LRS via the bootstrap file as a fallback, and the management server would continue to send the LoadStatsResponse message on the LRS stream.

One possible way to handle this is to say that clients that start LRS based on the new CDS field will ignore any LoadStatsResponse sent on the LRS stream and document that management servers must continue sending LoadStatsResponse on the LRS stream until they are certain that there are no more old clients.

Another way to do this would be to add a new boolean field to the LRS request indicating whether the client was configured to use LRS via the new CDS field, in which case the LRS server will know not to send any responses on the LRS stream. This is more explicit, but it does leave this new boolean field lying around even after the migration is complete, which seems sub-optimal.

Do we need to worry about federation cases where some clusters come from a management server that provides the new CDS field but others come from a management server that expects LRS to be configured via the bootstrap? Or is that not a thing in the v2 APIs?

As per the other PR today, I'm wondering if someone is signed up to implement this in Envoy? We can't really take fields in the API like this that don't have an Envoy implementation.

I was not expecting to have to implement this in Envoy. Envoy and gRPC are both consumers of this API, but if we agree jointly on an API change, I was assuming that it would be up to each of us to update our implementations to support the change. If Envoy needs an API change in the future, I would not expect you folks to implement the change in gRPC. And this approach certainly will not scale once there are other clients of this API.

We should probably discuss this further offline. I do agree that this should be implemented in Envoy, and I don't want us to be in a situation where one client of this API is incentivized to say no to a change because they don't want to have to do any work to implement the change, nor do I want us to be in a situation where the ecosystem gets fragmented because new API additions get supported by only a subset of clients. But I also don't want to be on the hook for updating Envoy whenever gRPC needs an API change. So let's talk about this further to see if we can come up with a more sustainable approach.

I think that we have two situations:

For v2 xDS, we should add the LRS field, to allow distinct config sources and servers per Cluster. This is directionally where we want to go. We can even specify an initial LoadStatsResponse here, that will be used by the client in the absence of any server message. We can deprecate in documentation etc. the streamed aspect here.

For v3 (which we can make happen at EOQ roughly), we can make the more radical changes.

How does that sound?

I agree we need to better coordinate on the implementation front. I don't think it should be expected that gRPC folks implement any API change on the Envoy side (although we wouldn't complain!), but simultaneously, we need to schedule and budget engineering bandwidth in OSS or at Google to make the Envoy migration happen when we introduce deprecations like this; we can chat offline further about this.

I'm okay in principle with the idea of leaving the LRS protocol as-is for v2 and making further changes for v3. However, I want to make sure that the semantics we provide actually make sense.

If we leave the LRS protocol itself as-is but still allow configuring the LRS server in the CDS response, it seems like there could be some confusion in the protocol. For example, let's say CDS for cluster A points to LRS server 1 and CDS for cluster B points to LRS server 2. But when the client contacts each of the LRS servers, the LRS server will indicate to the client which clusters it wants data for. So what happens if LRS server 1 says that it wants data for clusters A and B? Is that okay, given that the CDS response for B indicated to use a different LRS server? And what happens if both LRS servers request data for the same cluster? Wouldn't that result in double-reporting the load?

This potential confusion is why I was trying to change LRS such that if the LRS server is configured via CDS, the set of clusters to send data for is no longer controlled by LRS itself. Otherwise, we have two potentially conflicting sources of information.

To be clear, I don't actually have a use-case where I want to configure a different LRS server for different clusters. What I'm really after here is a way to get the LRS server via xDS, so that I don't have to configure it on the client (e.g., via a bootstrap file). The CDS response seems like the appropriate place to put it in xDS, which means that it would inherently be able to be set on a per-cluster basis, even though that's not really what I'm after.

If you have an alternative suggestion, I'd be open to hearing it.

To be clear, I don't actually have a use-case where I want to configure a different LRS server for different clusters.
What I'm really after here is a way to get the LRS server via xDS, so that I don't have to configure it on the
client (e.g., via a bootstrap file). The CDS response seems like the appropriate place to put it in xDS, which means
that it would inherently be able to be set on a per-cluster basis, even though that's not really what I'm after.

In an earlier comment you mentioned using a flag in LoadStatsRequest indicating that it was configured via CDS. This along with making the client ignore LoadStatsResponse from Management Server (if the Management Server has not yet implemented the flag) should cover the migration and compatibility story between the clients and server.
Including the example of two clusters that you have above.

Or, am I missing something,

Signed-off-by: Mark D. Roth <[email protected]>

repokitteh-read-only · 2019-08-19T18:49:38Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #7953 was synchronize by markdroth.

see: more, trace.

stale · 2019-08-27T14:40:28Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

costinm

How does it fit with UDPA ? I was hoping that in UDPA the client can use a single bi-streaming sticky connection to handle all its config needs - including send load info and getting load assignments. There are few benefits for large-scale when doing this. Would it be possible to send the load report payload as part of XDS, by extending the DiscoveryRequest to allow an Any payload ( and make it closer to UDPA ) ?

htuch · 2019-08-29T20:05:37Z

@costinm in UDPA we will have support for affinity. This issue is about v2/v3, which will be supported APIs (in aggregate) for the next 2 and a bit years.

api/envoy/api/v2/cds.proto

stale · 2019-09-05T22:16:22Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Signed-off-by: Mark D. Roth <[email protected]>

repokitteh-read-only · 2019-09-05T22:21:35Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #7953 was synchronize by markdroth.

see: more, trace.

mattklein123

Thanks for adding this. I need this for something I want to build at Lyft so great to see this in flight. A couple of comments/questions.

mattklein123 · 2019-09-09T23:33:49Z

api/envoy/api/v2/cds.proto

+  // If multiple clusters are configured to use the same LRS server, then the data for all of
+  // those clusters may be sent on the same LRS stream.
+  //
+  // Note that if LRS is configured this way, then there is no need for the LRS server to send


From a federation perspective I think it would be better if the LRS server could still independently configure the interval, and potentially even turn off reports until they are needed. So I guess 2 questions:

Can we clarify potentially that duration can be a sentinel value (empty, 0, whatever) that means that no reports are sent?

And that this can either be configured in the CDS or LRS response?

WDYT?

Playing devil's advocate for a moment, why is it better to configure the interval via LRS and via CDS for federation purposes? Shouldn't federation be able to configure this via CDS?

This seems similar to the federation use-case we discussed for setting load balancing weights. How did we wind up deciding to handle that use-case once we removed proto-merging from the UDPA federation design? Could we apply whatever solution we used for that to this case as well?

If there's a good reason why this can be more easily done via LRS than via CDS, then maybe we should just leave the negotiation in the LRS stream where it is today and instead restrict this change to simply point the client at the right LRS server. I'd prefer to avoid the complexity of trying to support two different ways to do the same thing.

then maybe we should just leave the negotiation in the LRS stream where it is today and instead restrict this change to simply point the client at the right LRS server. I'd prefer to avoid the complexity of trying to support two different ways to do the same thing.

+1. This reason I want this is because there are different implementations and a separation of concerns between CDS and LRDS, and IMO it's better for LRS to control the reporting rate, so my strong preference is to have that be done by LRS only as you propose.

Thinking about this further... I think the reason I was trying to move this out of LRS in the first place is that as soon as we start configuring the LRS server in the CDS response, if we leave the negotiation in the LRS stream, then a misconfiguration could cause duplicate load reporting. Consider the case where cluster A configures LRS server X and cluster B configures LRS server Y. If the LRS streams to X and Y both request data for both clusters, then the data could wind up being double-reported.

Also, I think Harvey's proposal for UDPA was that use-cases like LRS were going to be handled by having the client publish the load-report resources on the single UDPA stream. I'm not sure we've fully thought through how we were going to decide which management server was going to get which load-report data in a federation deployment.

We should probably talk this through in a bit more detail. How about we have a short meeting about this once Harvey is back?

markdroth · 2019-09-10T20:38:20Z

api/envoy/api/v2/cds.proto

+    // *StreamLoadStats*. This must have :ref:`api_type
+    // <envoy_api_field_core.ApiConfigSource.api_type>` :ref:`GRPC
+    // <envoy_api_enum_value_core.ApiConfigSource.ApiType.GRPC>`.
+    core.ApiConfigSource lrs_server = 1;


In discussing this with @vishalpowar, one problem here is that some management server deployments want the LRS stream to be sent to the same management server that is providing the CDS request, but they don't currently know what name or credentials their clients are using to access the management server. It would be more work for them to figure that out just to give the client information that it already has (because it used the name and credentials to access the CDS request in the first place).

Currently, ConfigSource has a special case for ADS that basically says "use the same ADS stream you're already using", but that won't work here, because LRS can't be sent over ADS. To address this, can we add a new field to the "oneof" in ConfigSource that basically says "talk to the same server you're getting this message from" but without being linked to ADS?

I think this new type of ConfigSource might also be useful in cases that are not using ADS but where LDS points to RDS on the same server or where CDS points to EDS on the same server.

@htuch and @mattklein123, WDYT?

This makes sense to me but can we track this in a new issue/proposal?

Sure. I've sent out #8201 for review.

Signed-off-by: Mark D. Roth <[email protected]>

markdroth · 2019-09-17T17:36:27Z

As per my discussion yesterday with @htuch and @mattklein123, I've changed this PR to leave all of the negotiation in the LRS stream but to configure the LRS server in the CDS response. Note that this requires #8201, which Harvery is investigating the feasibility of.

I have addressed the concern about duplicate load reporting by saying that the client is responsible for ensuring that the data for a given cluster is reported on no more than one stream.

mattklein123

Thanks this LGTM. @htuch will defer to you for further review.

htuch

LGTM, some comment changes requested.

api/envoy/api/v2/cds.proto

htuch · 2019-09-19T15:30:47Z

api/envoy/api/v2/cds.proto

+  //
+  // Note that if multiple clusters point to the same LRS server, the client may choose to
+  // create a separate stream for each cluster or it may choose to coalesce the data for
+  // multiple clusters onto a single stream. Either way, the client must make sure to send


Why is this the case? It would seem reasonable (but potentially a complication the client can opt into) to report on multiple streams for a given cluster. This basically depends on what the LRS server is asking for.

I added this to mitigate the possibility of accidentally double-reporting data for the same cluster, since we're configuring the LRS server to use on a per-cluster basis but leaving the cluster negotiation in the LRS stream. If CDS returns two clusters pointing to the same LRS server and the client implementation chooses to create a separate LRS stream for each cluster rather than coalesce them into a single stream, then even if the LRS server asks for data for both clusters on both streams, each stream needs to report the data only for the cluster it was created for.

As far as I know, we don't have any use-case where we'd want to send the same data to multiple LRS servers, and that's not functionality that we provide today with LRS configured in the bootstrap file, so I'd prefer to not complicate this by trying to address that hypothetical case. But if we do need to do this for some reason, I would prefer to have a way to allow the user to explicitly specify that intention (e.g., by making this new field a repeated field, so that multiple LRS servers can be configured) rather than relaxing the restriction I've described here, because this is still not something that we want happening by accident.

Yeah, fair enough. I definitely want to see the CDS-LRS relationship cleaned up in the future.

api/envoy/api/v2/cds.proto

Signed-off-by: Mark D. Roth <[email protected]>

This PR adds fields to CDS that allow for configuring LRS. Risk Level: Low Testing: None (but if anything is needed, please let me know) Docs Changes: Inline with API protos Signed-off-by: Mark D. Roth <[email protected]>

markdroth added 2 commits August 16, 2019 14:45

Add configuration of LRS in CDS response.

8d3285e

Signed-off-by: Mark D. Roth <[email protected]>

pacify spelling checker

6dd7370

Signed-off-by: Mark D. Roth <[email protected]>

mattklein123 assigned htuch Aug 16, 2019

htuch suggested changes Aug 19, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Aug 19, 2019

Add comments.

cc3906d

Signed-off-by: Mark D. Roth <[email protected]>

repokitteh-read-only bot removed the waiting label Aug 19, 2019

htuch added the waiting:any label Aug 20, 2019

stale bot added stale stalebot believes this issue/PR has not been touched recently and removed stale stalebot believes this issue/PR has not been touched recently labels Aug 27, 2019

repokitteh-read-only bot removed waiting:any labels Aug 28, 2019

costinm reviewed Aug 28, 2019

View reviewed changes

vishalpowar reviewed Aug 29, 2019

View reviewed changes

api/envoy/api/v2/cds.proto Outdated Show resolved Hide resolved

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 5, 2019

Merge remote-tracking branch 'upstream/master' into cds_lrs_config

1c66fa6

Signed-off-by: Mark D. Roth <[email protected]>

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 5, 2019

mattklein123 self-assigned this Sep 9, 2019

mattklein123 requested changes Sep 9, 2019

View reviewed changes

mattklein123 added the waiting label Sep 10, 2019

markdroth commented Sep 10, 2019

View reviewed changes

markdroth mentioned this pull request Sep 10, 2019

xds: Add self config source type. #8201

Merged

Leave negotiation in the LRS stream.

a3e0810

Signed-off-by: Mark D. Roth <[email protected]>

repokitteh-read-only bot removed the waiting label Sep 17, 2019

mattklein123 previously approved these changes Sep 17, 2019

View reviewed changes

htuch suggested changes Sep 19, 2019

View reviewed changes

Update comment.

6fa9e39

Signed-off-by: Mark D. Roth <[email protected]>

markdroth dismissed mattklein123’s stale review via 6fa9e39 September 19, 2019 15:57

htuch approved these changes Sep 20, 2019

View reviewed changes

htuch merged commit ca3056b into envoyproxy:master Sep 20, 2019

markdroth deleted the cds_lrs_config branch September 20, 2019 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cds: Configure LRS in CDS response #7953

cds: Configure LRS in CDS response #7953

markdroth commented Aug 16, 2019

htuch left a comment

htuch Aug 19, 2019

markdroth Aug 19, 2019

htuch Aug 20, 2019

markdroth Aug 28, 2019

vishalpowar Aug 29, 2019

repokitteh-read-only bot commented Aug 19, 2019

stale bot commented Aug 27, 2019

costinm left a comment

htuch commented Aug 29, 2019

stale bot commented Sep 5, 2019

repokitteh-read-only bot commented Sep 5, 2019

mattklein123 left a comment

mattklein123 Sep 9, 2019

markdroth Sep 10, 2019

mattklein123 Sep 10, 2019

markdroth Sep 10, 2019

mattklein123 Sep 10, 2019

markdroth Sep 10, 2019

mattklein123 Sep 10, 2019

markdroth Sep 10, 2019

markdroth commented Sep 17, 2019

mattklein123 left a comment

htuch left a comment

htuch Sep 19, 2019

markdroth Sep 19, 2019

htuch Sep 20, 2019

cds: Configure LRS in CDS response #7953

cds: Configure LRS in CDS response #7953

Conversation

markdroth commented Aug 16, 2019

htuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

repokitteh-read-only bot commented Aug 19, 2019

stale bot commented Aug 27, 2019

costinm left a comment

Choose a reason for hiding this comment

htuch commented Aug 29, 2019

stale bot commented Sep 5, 2019

repokitteh-read-only bot commented Sep 5, 2019

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markdroth commented Sep 17, 2019

mattklein123 left a comment

Choose a reason for hiding this comment

htuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment