-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream: subsetting load balancer [rough draft] #1735
Conversation
Signed-off-by: Stephan Zuercher <[email protected]>
Signed-off-by: Stephan Zuercher <[email protected]>
Signed-off-by: Stephan Zuercher <[email protected]>
631d6d0
to
e51658d
Compare
Signed-off-by: Stephan Zuercher <[email protected]>
b69f3e2
to
1f6a078
Compare
Thanks @zuercher, this should be rad. Can you provide some worked examples of how the above precomputed subsets and match algorithms work? E.g. imagine we have 5 hosts and let's say 2 different match criteria, what would is looks like in a few different requests? I think this would help me grok the intuition in the above algorithm. |
return ProtobufUtil::MessageDifferencer::Equals(v1.struct_value(), v2.struct_value()); | ||
|
||
case ProtobufWkt::Value::kListValue: | ||
return ProtobufUtil::MessageDifferencer::Equals(v1.list_value(), v2.list_value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just use MessageDifferencer::Equals
on v1
and v2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 voiced concern about using protobuf reflection (which is what the MessageDifferencer seems to use, in my casual inspection of its implementation). The equality check is used in the load balancer path (albeit only after a hash comparison).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to be swayed here, but IMO we should try to make this path as fast as possible and we should avoid reflection if possible in the fast path. I'm not super familiar with this stuff. This function basically does not use reflection unless it involves an embedded struct or list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this implementation only uses reflection for embedded lists and structs. That said, I believe I can write a version that doesn't use reflection at all, since the values in the list/struct are just ProtobufWkt::Value instances.
@htuch, Not sure if this exactly what you were looking for, but here goes. Given a
And a corresponding
The load balancer will create a default subset containing e1 (the only host with metadata matching stage="prod", version="1.0". It will also create a the following subsets:
Given routes with no metadata_match information, all requests will be routed to the default subset (e1). Given a virtual host containing these `envoy::api::v2::Route``` entries:
Routes that match the first route entry (with an In the other route, the metadata from the weighted clusters and the route are combined so that endpoints with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zuercher I mostly skimmed but in general I like this approach. It all fits together very nicely. Well done!
I hate asking this because it sounds like an interview question, but can you briefly describe the runtime of the metadata matching at LB time (I don't care that much about host update time). Also whether/when it involves reflection? I'm sure I can figure it out by looking through in detail but it might better if you just call it out. The example you provided @htuch is great.
return ProtobufUtil::MessageDifferencer::Equals(v1.struct_value(), v2.struct_value()); | ||
|
||
case ProtobufWkt::Value::kListValue: | ||
return ProtobufUtil::MessageDifferencer::Equals(v1.list_value(), v2.list_value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to be swayed here, but IMO we should try to make this path as fast as possible and we should avoid reflection if possible in the fast path. I'm not super familiar with this stuff. This function basically does not use reflection unless it involves an embedded struct or list?
switch (cluster->lbType()) { | ||
case LoadBalancerType::LeastRequest: { | ||
lb_.reset(new LeastRequestLoadBalancer(host_set_, parent.local_host_set_, cluster->stats(), | ||
if (cluster->lbSubsetInfo().isEnabled() && cluster->lbType() != LoadBalancerType::OriginalDst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: we should just fail config / doc that subset and original_dst can't be used together.
Yes, this example is great. Can you add this to a Markdown file or big block comment somewhere? Also, can you provide examples of requests that just match I think worst case space complexity for precomputed subsets is O(E*N) where E is the number of endpoints and N is the number of subset key lists. We're binning endpoints into distinct subsets for each subset key list. Is this correct? Is the worst case time complexity the same? I.e. we would have to iterate over all subsets? |
Nice example! One question: If you recall our earlier discussion on the shard router use case, I have one key mapping to 100K possible values. And these are distributed over few hundred nodes. Based on the request, the route block would select one value for version (e.g., I don't think this use case can be satisfied by the implementation above. Just thought I would confirm. I could flatten this into a simple map of form |
The runtime at load balancing time, as written (see below), it's Each of the n iterations in the search incurs one comparison of a pair of ProtobufWkt::Value instances (short-circuited by comparing their hashes first). Above, I say "as written" because I don't think it searches quite hard enough when the keys for subsets overlap. For instance, given subsets over [a, b] and [b, c], if a route says [a, b, c] it will only find [a, b] right now, but should probably find both and choose one based on definition order. I can think of a small change that will fix this, making the worst case I need to get home, so @rshriram I'll try to address your question tonight or first thing tomorrow. |
@rshriram I misunderstood your use case when we discussed it before. It should, however, be possible to change the config definition to allow subset keys to call out prefixes in addition to full keys. You could then flatten your endpoint metadata and ust specify "version:" for subset keys, yes? Is that worth it for your use case? |
I would prefer that we be very careful about anything that is n^2 in this path, especially that uses user supplied config. IMO the way you have it written now is flexible enough w/ good performance. IMO we could make more powerful matching opt-in later? Anyway, I'm fine to proceed with this PR for full tests, etc. I think this will get kind of big so if there is any way to break this up into smaller individual PRs w/ tests that would be cool. |
What if instead of worst case We'd iterate over the subset keys, checking to see if the metadata matches contained a value for for key in the subset (and simultaneously doing the constant-time hashtable lookups that lead us to the actual subset lb). The first entry in the subset keys that is fully matched and has a non-empty subset wins. I think this produces the "correct" behavior that was Notably, this would actually be far worse for @rshriram's use case since it would be a linear search over the subset keys (of which he'd have many). |
@zuercher I'm still trying to grok this. In your example:
Why is |
It's there for the backoff behavior you mentioned. Having said that, I spent time thinking about it last night and I think it's fine for the first version to attempt to match the metadata from the route exactly (with no backoff). If you want to reach the subset based on version alone, you'd need a route that only specifies version=v. If the route calls for stage=x and version=y and there's no subset for that combination, you get the behavior defined by the fallback behavior (default subset, any host, or no host). This lets us use the Then, in the future, if we wanted more complex subset selection behavior, we could add a configuration flag for enabling it and use that to enable a different search algorithm. |
@zuercher +1 on that. I think it's a lot simpler to understand (and faster). We can always add more complex stuff later. |
Cool. I'll close this PR and see about cleaning this up and splitting it into smaller chunks. |
@zuercher I don't fully comprehend the prefix thing you said. Would you mind explaining with an example? Also, why would the runtime be O(n^2) or even O(n) ? Can't we materialize the LB pools apriori, one for each subset. If we impose the restriction that the route's keys must match exactly with the subset, and perhaps even assign unique hash IDs to each subset as well as convert the key combo described in the weighted_cluster to the respective subset hash ID, then its just a matter of hashing the key set and getting a pointer to the appropriate pre-computed LB pool right? [in the fast path I mean] |
For reference, here was an example I posted in envoy slack in response to the question about prefixes: https://gist.github.com/zuercher/65ede6d0296ff2a9e4a310262ebf8b05 |
Updates the comment to reflect the design decision made in envoyproxy/envoy#1735 and renames subset_keys to subset_selectors. My reasoning for the rename is that the field is all about determining how endpoints are grouped into subsets. Using the names of metadata keys in the endpoints is only one way and others might be added. Signed-off-by: Stephan Zuercher <[email protected]>
First review split off of #1735. This provides the interface used retrieve the RouteAction metadata_match fields, but leaves them unimlemented for now. It also provides protobuf utilities that are used to implement using values from metadata_match in STL maps. Signed-off-by: Stephan Zuercher <[email protected]>
Second PR split out of #1735. This one adds the LbSubsetConfig data from the CDS Cluster message to the existing ClusterInfo class. Signed-off-by: Stephan Zuercher <[email protected]>
…1844) 3rd of 4 reviews split out from #1735. Router::MetadataMatchCriteria is implemented. Instances of the implementation are created when a route's RouteAction has an "envoy.lb" section in metadata_match values. These are merged with additional "envoy.lb" metadata_match values from the WeightedCluster, if any. In the event of a collision, the WeightedCluster keys take precedence. The LoadBalancerContext interface is modified to allow MetadataMatchCriteria to be accessed by load balancers. Router::Filter's implementation of the LoadBalancerContext is updated to return the selected route's criteria, if any. Signed-off-by: Stephan Zuercher <[email protected]>
Fixes #1279, and is the final part of splitting #1735. Provides the actual load balancer for subsets with basic stats for the number of subsets active, created, removed, selected (used for host selection), and fallbacks (used fallback path). Signed-off-by: Stephan Zuercher <[email protected]>
As requested in envoyproxy/data-plane-api#173, here's a rough draft the implementation of the subsetting load balancer.
The load balancer's chooseHost is implemented as follows:
Metadata values are stored as a ProtobufWkt::Value and corresponding hash value. I reused the existing protobuf hash utility function and it seems expensive, so I took care not to compute any hash in the load balancing path.
The load balancer produces subsets by watching the original HostSet(s) for updates. Each subset has its own pair for HostSetImpl objects which represent filtered copies of the
host_set
andlocal_host_set
originally passed to the load balancer.Presuming the other PR is approved, this PR has a bunch of known deficiencies that I will address: