upstream: subsetting load balancer [rough draft] #1735

zuercher · 2017-09-25T20:41:26Z

As requested in envoyproxy/data-plane-api#173, here's a rough draft the implementation of the subsetting load balancer.

The load balancer's chooseHost is implemented as follows:

If the LoadBalancerContext contains no metadata to match, skip to 6.
Otherwise, for each entry in the metadata do a pair of unordered map lookups for the name and value.
If both lookups succeed, we see if there is a load balancer defined. If so, we have a match and we remember the match if it's creation order is smaller than any previous match. Match or not, we continue the metadata iteration.
If either lookup fails or we reach the end of the metadata, there are no further matching subsets, so return the best found, if any.
If a subset is found, we invoke it's load balancer with the LoadBalancerContext and return.
Otherwise, we implement the fallback policy.

Metadata values are stored as a ProtobufWkt::Value and corresponding hash value. I reused the existing protobuf hash utility function and it seems expensive, so I took care not to compute any hash in the load balancing path.

The load balancer produces subsets by watching the original HostSet(s) for updates. Each subset has its own pair for HostSetImpl objects which represent filtered copies of the host_set and local_host_set originally passed to the load balancer.

Presuming the other PR is approved, this PR has a bunch of known deficiencies that I will address:

It needs tests and additional comments.
ValueUtil::equal (for ProtobufWkt::Value comparison) should operate recursively on list and struct values.
MetadataMatches is in the Router namespace -- it should probably find another home to simplify dependencies.
I think there's an opportunity to have MetadataMatches return a set instead of a vector -- the load balancer depends on the matches being sorted and using a set with a custom Compare operator should make it possible to put that requirement into the code rather than just documenting it.
Similarly the LoadBalancer depends on using the same hash algorithmin Upstream::MetadataMatchImpl, so it probably makes sense to introduce a small class that wraps a ProtobufWkt::Value and its hash and to use that in both places.

Signed-off-by: Stephan Zuercher <[email protected]>

htuch · 2017-09-25T22:21:58Z

Thanks @zuercher, this should be rad. Can you provide some worked examples of how the above precomputed subsets and match algorithms work? E.g. imagine we have 5 hosts and let's say 2 different match criteria, what would is looks like in a few different requests? I think this would help me grok the intuition in the above algorithm.

htuch · 2017-09-25T22:08:01Z

source/common/protobuf/utility.cc

+    return ProtobufUtil::MessageDifferencer::Equals(v1.struct_value(), v2.struct_value());
+
+  case ProtobufWkt::Value::kListValue:
+    return ProtobufUtil::MessageDifferencer::Equals(v1.list_value(), v2.list_value());


Why can't we just use MessageDifferencer::Equals on v1 and v2?

@mattklein123 voiced concern about using protobuf reflection (which is what the MessageDifferencer seems to use, in my casual inspection of its implementation). The equality check is used in the load balancer path (albeit only after a hash comparison).

I'm happy to be swayed here, but IMO we should try to make this path as fast as possible and we should avoid reflection if possible in the fast path. I'm not super familiar with this stuff. This function basically does not use reflection unless it involves an embedded struct or list?

Yes, this implementation only uses reflection for embedded lists and structs. That said, I believe I can write a version that doesn't use reflection at all, since the values in the list/struct are just ProtobufWkt::Value instances.

zuercher · 2017-09-25T23:41:16Z

@htuch, Not sure if this exactly what you were looking for, but here goes.

Given a envoy::api::v2::Cluster with

{
      "name": "c1",
      "lb_policy": "ROUND_ROBIN",    
      "lb_subset_config": {
        "fallback_policy": "DEFAULT_SUBSET",
        "default_subset": {
          "version": "1.0",
          "stage": "prod"
        },
        "subset_keys": [
          { "keys": [ "version", "stage" ]},
          { "keys": [ "version" ]}
        ]
      }
}

And a corresponding envoy::api:v2::ClusterLoadAssignment with

"endpoints": {
  "lb_endpoints": [
    {
      "endpoint": { /* omitted for brevity, call it e1 */ },
      "metadata": {
        "filter_metadata": {
          "envoy.lb": {
            "version": "1.0",
            "stage": "prod"
          }
        }
      }
    },
    {
      "endpoint": { /* omitted, call it e2 */ },
      "metadata": {
        "filter_metadata": {
          "envoy.lb": {
            "version": "1.1",
            "stage": "prod"
          }
        }
      }
    },
    {
      "endpoint": { /* omitted, call it e3 */ },
      "metadata": {
        "filter_metadata": {
          "envoy.lb": {
            "version": "1.2-pre",
            "stage": "dev"
          }
        }
      }
    }
  ]
}

The load balancer will create a default subset containing e1 (the only host with metadata matching stage="prod", version="1.0". It will also create a the following subsets:

Metadata	host
stage="prod", "version=1.0"	e1
stage="prod", "version=1.1"	e2
stage="dev", "version=1.2-pre"	e3
version="1.0"	e1
version="1.1"	e2
version="1.2-pre"	e3

Given routes with no metadata_match information, all requests will be routed to the default subset (e1).

Given a virtual host containing these `envoy::api::v2::Route``` entries:

"routes": [
  {
    "match": {
      "prefix": "/",
      "headers": [
        {
          "name": "x-custom-version",
          "value": "1.2-pre"
        }
      ]
    },
    "route": {
      "cluster": "c1",
      "metadata_match": {
        "filter_metadata": {
          "envoy.lb": {
            "version": "1.2-pre",
            "stage": "dev"
          }
        }
      }
    }
  },
  {
    "match": {
      "prefix": "/"
    },
    "route": {
      "weighted_clusters": {
        clusters: [
          {
            "name": "c1",
            "weight": 90,
            "metadata_match": {
              "filter_metadata": {
                "envoy.lb": {
                  "version": "1.0"
                }
              }
            }
          },
          {
            "name": "c1",
            "weight": 10,
            "metadata_match": {
              "filter_metadata": {
                "envoy.lb": {
                  "version": "1.1"
                }
              }
            }
          }
        ]
      },
      "metadata_match": {
        "filter_metadata": {
          "envoy.lb": {
            "stage": "prod"
          }
        }
      }
    }
  }
]

Routes that match the first route entry (with an x-custom-version header) will produce a LoadBalancerContext where the metadata matches are stage=dev, version=1.2-pre. The endpoint e3 is matched (and if there were multiple hosts with that metadata, round-robin balancing would occur over them). If the cluster is updated and e3 is removed, the current implementation will return no host (triggering a 5xx), but that's a bug. It should detect this condition and trigger the fallback policy.

In the other route, the metadata from the weighted clusters and the route are combined so that endpoints with stage=prod, version=1.0 are selected 90% of the time while endpoints with stage=prod, version=1.1 are selected 10% of the time. Again, given multiple endpoints with the same metadata round-robin load balancing would take place.

mattklein123

@zuercher I mostly skimmed but in general I like this approach. It all fits together very nicely. Well done!

I hate asking this because it sounds like an interview question, but can you briefly describe the runtime of the metadata matching at LB time (I don't care that much about host update time). Also whether/when it involves reflection? I'm sure I can figure it out by looking through in detail but it might better if you just call it out. The example you provided @htuch is great.

mattklein123 · 2017-09-26T21:53:47Z

source/common/protobuf/utility.cc

+    return ProtobufUtil::MessageDifferencer::Equals(v1.struct_value(), v2.struct_value());
+
+  case ProtobufWkt::Value::kListValue:
+    return ProtobufUtil::MessageDifferencer::Equals(v1.list_value(), v2.list_value());


I'm happy to be swayed here, but IMO we should try to make this path as fast as possible and we should avoid reflection if possible in the fast path. I'm not super familiar with this stuff. This function basically does not use reflection unless it involves an embedded struct or list?

mattklein123 · 2017-09-26T21:55:49Z

source/common/upstream/cluster_manager_impl.cc

-  switch (cluster->lbType()) {
-  case LoadBalancerType::LeastRequest: {
-    lb_.reset(new LeastRequestLoadBalancer(host_set_, parent.local_host_set_, cluster->stats(),
+  if (cluster->lbSubsetInfo().isEnabled() && cluster->lbType() != LoadBalancerType::OriginalDst) {


small nit: we should just fail config / doc that subset and original_dst can't be used together.

htuch · 2017-09-26T22:30:56Z

Yes, this example is great. Can you add this to a Markdown file or big block comment somewhere? Also, can you provide examples of requests that just match version, or where the two subset keys are reversed in order? And also in the big comment, some examples of subsets > |1| in size.

I think worst case space complexity for precomputed subsets is O(E*N) where E is the number of endpoints and N is the number of subset key lists. We're binning endpoints into distinct subsets for each subset key list. Is this correct?

Is the worst case time complexity the same? I.e. we would have to iterate over all subsets?

rshriram · 2017-09-26T23:18:38Z

Nice example! One question: If you recall our earlier discussion on the shard router use case, I have one key mapping to 100K possible values. And these are distributed over few hundred nodes.
E.g.,
e1 could have version: 1, version: 100, version: 1000
e2 could have version: 2, version: 200, version: 2000
e3 could have version: 1000, version: 2000

Based on the request, the route block would select one value for version (e.g., version: 2000). This would have to be routed to e2/e3.

I don't think this use case can be satisfied by the implementation above. Just thought I would confirm. I could flatten this into a simple map of form version1: true, version2000:true, so that I have unique key-value pairs.. However, I would now have to implement both CDS and EDS (instead of just EDS), to update the subset keys, and the endpoint metadata. And in this case, I am virtually creating 100K clusters, which is fine by me as well.

zuercher · 2017-09-26T23:50:16Z

The runtime at load balancing time, as written (see below), it's O(n) where n = number of metadata keys from the LoadBalancerContext, and provided that the keys used in the unordered_maps have well-distributed hash values. At that point we have the subset's load balancer and delegate to it, incurring whatever it's runtime is for the number of hosts in the subset.

Each of the n iterations in the search incurs one comparison of a pair of ProtobufWkt::Value instances (short-circuited by comparing their hashes first).

Above, I say "as written" because I don't think it searches quite hard enough when the keys for subsets overlap. For instance, given subsets over [a, b] and [b, c], if a route says [a, b, c] it will only find [a, b] right now, but should probably find both and choose one based on definition order. I can think of a small change that will fix this, making the worst case n(n + 1)/2 iterations, so O(n^2) if I understand the syntax.

I need to get home, so @rshriram I'll try to address your question tonight or first thing tomorrow.

zuercher · 2017-09-27T00:03:52Z

@rshriram I misunderstood your use case when we discussed it before. It should, however, be possible to change the config definition to allow subset keys to call out prefixes in addition to full keys. You could then flatten your endpoint metadata and ust specify "version:" for subset keys, yes? Is that worth it for your use case?

mattklein123 · 2017-09-27T20:38:35Z

Above, I say "as written" because I don't think it searches quite hard enough when the keys for subsets overlap. For instance, given subsets over [a, b] and [b, c], if a route says [a, b, c] it will only find [a, b] right now, but should probably find both and choose one based on definition order. I can think of a small change that will fix this, making the worst case n(n + 1)/2 iterations, so O(n^2) if I understand the syntax.

I would prefer that we be very careful about anything that is n^2 in this path, especially that uses user supplied config. IMO the way you have it written now is flexible enough w/ good performance. IMO we could make more powerful matching opt-in later?

Anyway, I'm fine to proceed with this PR for full tests, etc. I think this will get kind of big so if there is any way to break this up into smaller individual PRs w/ tests that would be cool.

zuercher · 2017-09-28T00:00:47Z

What if instead of worst case O(n^2) with n = number of metadata matches in the route, it were worst-case O(j*k) with j = the number of entries in the lb_subset_config.subset_keys (which is <= the number of subsets) and k = the average number of keys used for each subset?

We'd iterate over the subset keys, checking to see if the metadata matches contained a value for for key in the subset (and simultaneously doing the constant-time hashtable lookups that lead us to the actual subset lb). The first entry in the subset keys that is fully matched and has a non-empty subset wins. I think this produces the "correct" behavior that was O(n^2) w.r.t. metadata matches. (This would require the metadata matches returned from LoadBalancerContext to be an unordered_map. With no metadata matches it short-circuits to the fallback lb in O(1). The subset keys no longer need to be sorted and you can place the most restrictive keys first in a subset to improve performance.)

Notably, this would actually be far worse for @rshriram's use case since it would be a linear search over the subset keys (of which he'd have many).

mattklein123 · 2017-09-28T15:30:13Z

@zuercher I'm still trying to grok this. In your example:

"subset_keys": [
          { "keys": [ "version", "stage" ]},
          { "keys": [ "version" ]}
        ]

Why is { "keys": [ "version" ]} needed? Is that just an example? Since AFAICT your routes always want to match on both version/stage. Or is this the point that it would then back off to matching only on version if you don't get a good match on version/stage?

zuercher · 2017-09-28T17:14:28Z

It's there for the backoff behavior you mentioned.

Having said that, I spent time thinking about it last night and I think it's fine for the first version to attempt to match the metadata from the route exactly (with no backoff). If you want to reach the subset based on version alone, you'd need a route that only specifies version=v. If the route calls for stage=x and version=y and there's no subset for that combination, you get the behavior defined by the fallback behavior (default subset, any host, or no host). This lets us use the O(n) algorithm currently in the review (with a tweak to insure a complete match, but that shouldn't change the complexity).

Then, in the future, if we wanted more complex subset selection behavior, we could add a configuration flag for enabling it and use that to enable a different search algorithm.

mattklein123 · 2017-09-28T17:16:29Z

@zuercher +1 on that. I think it's a lot simpler to understand (and faster). We can always add more complex stuff later.

zuercher · 2017-09-28T17:19:24Z

Cool. I'll close this PR and see about cleaning this up and splitting it into smaller chunks.

rshriram · 2017-09-28T17:46:18Z

@zuercher I don't fully comprehend the prefix thing you said. Would you mind explaining with an example?

Also, why would the runtime be O(n^2) or even O(n) ? Can't we materialize the LB pools apriori, one for each subset. If we impose the restriction that the route's keys must match exactly with the subset, and perhaps even assign unique hash IDs to each subset as well as convert the key combo described in the weighted_cluster to the respective subset hash ID, then its just a matter of hashing the key set and getting a pointer to the appropriate pre-computed LB pool right? [in the fast path I mean]

zuercher · 2017-09-28T20:51:17Z

For reference, here was an example I posted in envoy slack in response to the question about prefixes:

https://gist.github.com/zuercher/65ede6d0296ff2a9e4a310262ebf8b05

Updates the comment to reflect the design decision made in envoyproxy/envoy#1735 and renames subset_keys to subset_selectors. My reasoning for the rename is that the field is all about determining how endpoints are grouped into subsets. Using the names of metadata keys in the endpoints is only one way and others might be added. Signed-off-by: Stephan Zuercher <[email protected]>

First review split off of #1735. This provides the interface used retrieve the RouteAction metadata_match fields, but leaves them unimlemented for now. It also provides protobuf utilities that are used to implement using values from metadata_match in STL maps. Signed-off-by: Stephan Zuercher <[email protected]>

Second PR split out of #1735. This one adds the LbSubsetConfig data from the CDS Cluster message to the existing ClusterInfo class. Signed-off-by: Stephan Zuercher <[email protected]>

…1844) 3rd of 4 reviews split out from #1735. Router::MetadataMatchCriteria is implemented. Instances of the implementation are created when a route's RouteAction has an "envoy.lb" section in metadata_match values. These are merged with additional "envoy.lb" metadata_match values from the WeightedCluster, if any. In the event of a collision, the WeightedCluster keys take precedence. The LoadBalancerContext interface is modified to allow MetadataMatchCriteria to be accessed by load balancers. Router::Filter's implementation of the LoadBalancerContext is updated to return the selected route's criteria, if any. Signed-off-by: Stephan Zuercher <[email protected]>

Fixes #1279, and is the final part of splitting #1735. Provides the actual load balancer for subsets with basic stats for the number of subsets active, created, removed, selected (used for host selection), and fallbacks (used fallback path). Signed-off-by: Stephan Zuercher <[email protected]>

zuercher added 3 commits September 25, 2017 11:24

point data-plane-api at my PR

5e296db

Signed-off-by: Stephan Zuercher <[email protected]>

rough draft: subset load balancer

41db6eb

Signed-off-by: Stephan Zuercher <[email protected]>

fix gcc/clang disagreement, ring_hash_lb_tet, format

e51658d

Signed-off-by: Stephan Zuercher <[email protected]>

zuercher force-pushed the subsetting-load-balancer branch from 631d6d0 to e51658d Compare September 25, 2017 21:36

format

1f6a078

Signed-off-by: Stephan Zuercher <[email protected]>

zuercher force-pushed the subsetting-load-balancer branch from b69f3e2 to 1f6a078 Compare September 25, 2017 22:01

htuch reviewed Sep 25, 2017

View reviewed changes

zuercher mentioned this pull request Sep 26, 2017

cds: config for load balancer subsets envoyproxy/data-plane-api#173

Merged

mattklein123 reviewed Sep 26, 2017

View reviewed changes

zuercher closed this Sep 28, 2017

zuercher mentioned this pull request Sep 28, 2017

cds: rename field, update comments to match design choices envoyproxy/data-plane-api#183

Merged

zuercher mentioned this pull request Sep 28, 2017

design: subset load balancer design doc #1774

Merged

This was referenced Oct 10, 2017

router: thread metadata_matches through to RouteEntryImplBase #1836

Merged

upstream: add CDS LbSubsetConfig to ClusterInfo #1837

Merged

zuercher mentioned this pull request Oct 11, 2017

load balancer: thread metadata matches through LoadBalancerContext #1844

Merged

zuercher mentioned this pull request Oct 14, 2017

upstream: subset load balancer #1857

Merged

zuercher deleted the subsetting-load-balancer branch October 25, 2017 22:37

adisuissa mentioned this pull request Aug 16, 2023

upstream subset lb: add redundant keys support #28874

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream: subsetting load balancer [rough draft] #1735

upstream: subsetting load balancer [rough draft] #1735

zuercher commented Sep 25, 2017

htuch commented Sep 25, 2017

htuch Sep 25, 2017

zuercher Sep 25, 2017

mattklein123 Sep 26, 2017

zuercher Sep 26, 2017

zuercher commented Sep 25, 2017

mattklein123 left a comment

mattklein123 Sep 26, 2017

mattklein123 Sep 26, 2017

htuch commented Sep 26, 2017 •

edited

Loading

rshriram commented Sep 26, 2017

zuercher commented Sep 26, 2017

zuercher commented Sep 27, 2017

mattklein123 commented Sep 27, 2017

zuercher commented Sep 28, 2017

mattklein123 commented Sep 28, 2017

zuercher commented Sep 28, 2017 •

edited

Loading

mattklein123 commented Sep 28, 2017

zuercher commented Sep 28, 2017

rshriram commented Sep 28, 2017

zuercher commented Sep 28, 2017

upstream: subsetting load balancer [rough draft] #1735

upstream: subsetting load balancer [rough draft] #1735

Conversation

zuercher commented Sep 25, 2017

htuch commented Sep 25, 2017

htuch Sep 25, 2017

Choose a reason for hiding this comment

zuercher Sep 25, 2017

Choose a reason for hiding this comment

mattklein123 Sep 26, 2017

Choose a reason for hiding this comment

zuercher Sep 26, 2017

Choose a reason for hiding this comment

zuercher commented Sep 25, 2017

mattklein123 left a comment

Choose a reason for hiding this comment

mattklein123 Sep 26, 2017

Choose a reason for hiding this comment

mattklein123 Sep 26, 2017

Choose a reason for hiding this comment

htuch commented Sep 26, 2017 • edited Loading

rshriram commented Sep 26, 2017

zuercher commented Sep 26, 2017

zuercher commented Sep 27, 2017

mattklein123 commented Sep 27, 2017

zuercher commented Sep 28, 2017

mattklein123 commented Sep 28, 2017

zuercher commented Sep 28, 2017 • edited Loading

mattklein123 commented Sep 28, 2017

zuercher commented Sep 28, 2017

rshriram commented Sep 28, 2017

zuercher commented Sep 28, 2017

htuch commented Sep 26, 2017 •

edited

Loading

zuercher commented Sep 28, 2017 •

edited

Loading