Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cds: Configure LRS in CDS response #7953

Merged
merged 6 commits into from
Sep 20, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion api/envoy/api/v2/cds.proto
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ service ClusterDiscoveryService {
// [#protodoc-title: Clusters]

// Configuration for a single upstream cluster.
// [#comment:next free field: 41]
// [#comment:next free field: 42]
message Cluster {
// Supplies the name of the cluster which must be unique across all clusters.
// The cluster name is used when emitting
Expand Down Expand Up @@ -640,6 +640,23 @@ message Cluster {
// outgoing connections made by the cluster. Order matters as the filters are
// processed sequentially as connection events happen.
repeated cluster.Filter filters = 40;

// If present, tells the client where to send load reports via LRS.
//
// If multiple clusters are configured to use the same LRS server,
// then the data for all of those clusters may be sent on the same LRS
// stream.
//
// Note that if LRS is configured this way, then there is no need for
// the LRS server to send LoadStatsResponse messages on the LRS
// stream, because the data it would have sent is already configured
// here.
message LoadReportConfig {
core.ApiConfigSource lrs_server = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In discussing this with @vishalpowar, one problem here is that some management server deployments want the LRS stream to be sent to the same management server that is providing the CDS request, but they don't currently know what name or credentials their clients are using to access the management server. It would be more work for them to figure that out just to give the client information that it already has (because it used the name and credentials to access the CDS request in the first place).

Currently, ConfigSource has a special case for ADS that basically says "use the same ADS stream you're already using", but that won't work here, because LRS can't be sent over ADS. To address this, can we add a new field to the "oneof" in ConfigSource that basically says "talk to the same server you're getting this message from" but without being linked to ADS?

I think this new type of ConfigSource might also be useful in cases that are not using ADS but where LDS points to RDS on the same server or where CDS points to EDS on the same server.

@htuch and @mattklein123, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me but can we track this in a new issue/proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've sent out #8201 for review.

google.protobuf.Duration load_reporting_interval = 2;
bool report_endpoint_granularity = 3;
}
LoadReportConfig load_report_config = 41;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'm largely supportive of adding this. There are a few things to sort out though:

  1. We should have comments that are similar to
    // A management server endpoint to stream load stats to via
    .
  2. Do we really want to duplicate bits of LoadStatsResponse in here? Should we just have an initial_load_stats_response message? Or should we just leave this to the bidi stream as exists today. Would be good to discuss this one further.
  3. As per the other PR today, I'm wondering if someone is signed up to implement this in Envoy? We can't really take fields in the API like this that don't have an Envoy implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We should have comments that are similar to
    // A management server endpoint to stream load stats to via

    .

Ah, interesting -- I didn't realize that Envoy had the LRS server configured via its bootstrap. Should we deprecate that field in the bootstrap proto as part of this? (See below for further discussion about migration strategy.)

I've added the requested comments.

  1. Do we really want to duplicate bits of LoadStatsResponse in here? Should we just have an initial_load_stats_response message? Or should we just leave this to the bidi stream as exists today. Would be good to discuss this one further.

My thinking here was that this new mechanism would completely replace the existing mechanism, so when this new field is populated in CDS, there will never be a LoadStatsResponse message sent on the LRS stream. Instead, for both initial responses and for subsequent changes, CDS will send an update of the cluster with the new values in these fields. For example, to change the load reporting interval for a given cluster, CDS will send an update that includes the new load reporting interval. Or, to stop sending load reports for a given cluster, CDS will send an update that leaves the load_report_config field unset.

However, thinking about this further, maybe we need to give more thought to the migration story between the current approach and the new approach. I see two migration cases to consider here:

  1. Envoy clients that do not yet support this new field. These will ignore the new field even if the management server sends it, and they will require the management server to send the LoadStatsResponse message on the LRS stream.

  2. Envoy clients that do support this new field but are talking to a management server that does not support the new field. In this case, the client would need to continue configuring LRS via the bootstrap file as a fallback, and the management server would continue to send the LoadStatsResponse message on the LRS stream.

One possible way to handle this is to say that clients that start LRS based on the new CDS field will ignore any LoadStatsResponse sent on the LRS stream and document that management servers must continue sending LoadStatsResponse on the LRS stream until they are certain that there are no more old clients.

Another way to do this would be to add a new boolean field to the LRS request indicating whether the client was configured to use LRS via the new CDS field, in which case the LRS server will know not to send any responses on the LRS stream. This is more explicit, but it does leave this new boolean field lying around even after the migration is complete, which seems sub-optimal.

Do we need to worry about federation cases where some clusters come from a management server that provides the new CDS field but others come from a management server that expects LRS to be configured via the bootstrap? Or is that not a thing in the v2 APIs?

  1. As per the other PR today, I'm wondering if someone is signed up to implement this in Envoy? We can't really take fields in the API like this that don't have an Envoy implementation.

I was not expecting to have to implement this in Envoy. Envoy and gRPC are both consumers of this API, but if we agree jointly on an API change, I was assuming that it would be up to each of us to update our implementations to support the change. If Envoy needs an API change in the future, I would not expect you folks to implement the change in gRPC. And this approach certainly will not scale once there are other clients of this API.

We should probably discuss this further offline. I do agree that this should be implemented in Envoy, and I don't want us to be in a situation where one client of this API is incentivized to say no to a change because they don't want to have to do any work to implement the change, nor do I want us to be in a situation where the ecosystem gets fragmented because new API additions get supported by only a subset of clients. But I also don't want to be on the hook for updating Envoy whenever gRPC needs an API change. So let's talk about this further to see if we can come up with a more sustainable approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we have two situations:

  1. For v2 xDS, we should add the LRS field, to allow distinct config sources and servers per Cluster. This is directionally where we want to go. We can even specify an initial LoadStatsResponse here, that will be used by the client in the absence of any server message. We can deprecate in documentation etc. the streamed aspect here.
  2. For v3 (which we can make happen at EOQ roughly), we can make the more radical changes.

How does that sound?

I agree we need to better coordinate on the implementation front. I don't think it should be expected that gRPC folks implement any API change on the Envoy side (although we wouldn't complain!), but simultaneously, we need to schedule and budget engineering bandwidth in OSS or at Google to make the Envoy migration happen when we introduce deprecations like this; we can chat offline further about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay in principle with the idea of leaving the LRS protocol as-is for v2 and making further changes for v3. However, I want to make sure that the semantics we provide actually make sense.

If we leave the LRS protocol itself as-is but still allow configuring the LRS server in the CDS response, it seems like there could be some confusion in the protocol. For example, let's say CDS for cluster A points to LRS server 1 and CDS for cluster B points to LRS server 2. But when the client contacts each of the LRS servers, the LRS server will indicate to the client which clusters it wants data for. So what happens if LRS server 1 says that it wants data for clusters A and B? Is that okay, given that the CDS response for B indicated to use a different LRS server? And what happens if both LRS servers request data for the same cluster? Wouldn't that result in double-reporting the load?

This potential confusion is why I was trying to change LRS such that if the LRS server is configured via CDS, the set of clusters to send data for is no longer controlled by LRS itself. Otherwise, we have two potentially conflicting sources of information.

To be clear, I don't actually have a use-case where I want to configure a different LRS server for different clusters. What I'm really after here is a way to get the LRS server via xDS, so that I don't have to configure it on the client (e.g., via a bootstrap file). The CDS response seems like the appropriate place to put it in xDS, which means that it would inherently be able to be set on a per-cluster basis, even though that's not really what I'm after.

If you have an alternative suggestion, I'd be open to hearing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't actually have a use-case where I want to configure a different LRS server for different clusters.
What I'm really after here is a way to get the LRS server via xDS, so that I don't have to configure it on the
client (e.g., via a bootstrap file). The CDS response seems like the appropriate place to put it in xDS, which means
that it would inherently be able to be set on a per-cluster basis, even though that's not really what I'm after.

In an earlier comment you mentioned using a flag in LoadStatsRequest indicating that it was configured via CDS. This along with making the client ignore LoadStatsResponse from Management Server (if the Management Server has not yet implemented the flag) should cover the migration and compatibility story between the clients and server.
Including the example of two clusters that you have above.

Or, am I missing something,

}

// An extensible structure containing the address Envoy should bind to when
Expand Down
1 change: 1 addition & 0 deletions tools/spelling_dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ LDS
LEV
LHS
LF
LRS
MB
MD
MGET
Expand Down