Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: subsetting load balancer [rough draft] #1735

Closed

Conversation

zuercher
Copy link
Member

As requested in envoyproxy/data-plane-api#173, here's a rough draft the implementation of the subsetting load balancer.

The load balancer's chooseHost is implemented as follows:

  1. If the LoadBalancerContext contains no metadata to match, skip to 6.
  2. Otherwise, for each entry in the metadata do a pair of unordered map lookups for the name and value.
  3. If both lookups succeed, we see if there is a load balancer defined. If so, we have a match and we remember the match if it's creation order is smaller than any previous match. Match or not, we continue the metadata iteration.
  4. If either lookup fails or we reach the end of the metadata, there are no further matching subsets, so return the best found, if any.
  5. If a subset is found, we invoke it's load balancer with the LoadBalancerContext and return.
  6. Otherwise, we implement the fallback policy.

Metadata values are stored as a ProtobufWkt::Value and corresponding hash value. I reused the existing protobuf hash utility function and it seems expensive, so I took care not to compute any hash in the load balancing path.

The load balancer produces subsets by watching the original HostSet(s) for updates. Each subset has its own pair for HostSetImpl objects which represent filtered copies of the host_set and local_host_set originally passed to the load balancer.

Presuming the other PR is approved, this PR has a bunch of known deficiencies that I will address:

  1. It needs tests and additional comments.
  2. ValueUtil::equal (for ProtobufWkt::Value comparison) should operate recursively on list and struct values.
  3. MetadataMatches is in the Router namespace -- it should probably find another home to simplify dependencies.
  4. I think there's an opportunity to have MetadataMatches return a set instead of a vector -- the load balancer depends on the matches being sorted and using a set with a custom Compare operator should make it possible to put that requirement into the code rather than just documenting it.
  5. Similarly the LoadBalancer depends on using the same hash algorithmin Upstream::MetadataMatchImpl, so it probably makes sense to introduce a small class that wraps a ProtobufWkt::Value and its hash and to use that in both places.

@zuercher zuercher force-pushed the subsetting-load-balancer branch from 631d6d0 to e51658d Compare September 25, 2017 21:36
Signed-off-by: Stephan Zuercher <[email protected]>
@zuercher zuercher force-pushed the subsetting-load-balancer branch from b69f3e2 to 1f6a078 Compare September 25, 2017 22:01
@htuch
Copy link
Member

htuch commented Sep 25, 2017

Thanks @zuercher, this should be rad. Can you provide some worked examples of how the above precomputed subsets and match algorithms work? E.g. imagine we have 5 hosts and let's say 2 different match criteria, what would is looks like in a few different requests? I think this would help me grok the intuition in the above algorithm.

return ProtobufUtil::MessageDifferencer::Equals(v1.struct_value(), v2.struct_value());

case ProtobufWkt::Value::kListValue:
return ProtobufUtil::MessageDifferencer::Equals(v1.list_value(), v2.list_value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just use MessageDifferencer::Equals on v1 and v2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 voiced concern about using protobuf reflection (which is what the MessageDifferencer seems to use, in my casual inspection of its implementation). The equality check is used in the load balancer path (albeit only after a hash comparison).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to be swayed here, but IMO we should try to make this path as fast as possible and we should avoid reflection if possible in the fast path. I'm not super familiar with this stuff. This function basically does not use reflection unless it involves an embedded struct or list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this implementation only uses reflection for embedded lists and structs. That said, I believe I can write a version that doesn't use reflection at all, since the values in the list/struct are just ProtobufWkt::Value instances.

@zuercher
Copy link
Member Author

@htuch, Not sure if this exactly what you were looking for, but here goes.

Given a envoy::api::v2::Cluster with

{
      "name": "c1",
      "lb_policy": "ROUND_ROBIN",    
      "lb_subset_config": {
        "fallback_policy": "DEFAULT_SUBSET",
        "default_subset": {
          "version": "1.0",
          "stage": "prod"
        },
        "subset_keys": [
          { "keys": [ "version", "stage" ]},
          { "keys": [ "version" ]}
        ]
      }
}

And a corresponding envoy::api:v2::ClusterLoadAssignment with

"endpoints": {
  "lb_endpoints": [
    {
      "endpoint": { /* omitted for brevity, call it e1 */ },
      "metadata": {
        "filter_metadata": {
          "envoy.lb": {
            "version": "1.0",
            "stage": "prod"
          }
        }
      }
    },
    {
      "endpoint": { /* omitted, call it e2 */ },
      "metadata": {
        "filter_metadata": {
          "envoy.lb": {
            "version": "1.1",
            "stage": "prod"
          }
        }
      }
    },
    {
      "endpoint": { /* omitted, call it e3 */ },
      "metadata": {
        "filter_metadata": {
          "envoy.lb": {
            "version": "1.2-pre",
            "stage": "dev"
          }
        }
      }
    }
  ]
}

The load balancer will create a default subset containing e1 (the only host with metadata matching stage="prod", version="1.0". It will also create a the following subsets:

Metadata host
stage="prod", "version=1.0" e1
stage="prod", "version=1.1" e2
stage="dev", "version=1.2-pre" e3
version="1.0" e1
version="1.1" e2
version="1.2-pre" e3

Given routes with no metadata_match information, all requests will be routed to the default subset (e1).

Given a virtual host containing these `envoy::api::v2::Route``` entries:

"routes": [
  {
    "match": {
      "prefix": "/",
      "headers": [
        {
          "name": "x-custom-version",
          "value": "1.2-pre"
        }
      ]
    },
    "route": {
      "cluster": "c1",
      "metadata_match": {
        "filter_metadata": {
          "envoy.lb": {
            "version": "1.2-pre",
            "stage": "dev"
          }
        }
      }
    }
  },
  {
    "match": {
      "prefix": "/"
    },
    "route": {
      "weighted_clusters": {
        clusters: [
          {
            "name": "c1",
            "weight": 90,
            "metadata_match": {
              "filter_metadata": {
                "envoy.lb": {
                  "version": "1.0"
                }
              }
            }
          },
          {
            "name": "c1",
            "weight": 10,
            "metadata_match": {
              "filter_metadata": {
                "envoy.lb": {
                  "version": "1.1"
                }
              }
            }
          }
        ]
      },
      "metadata_match": {
        "filter_metadata": {
          "envoy.lb": {
            "stage": "prod"
          }
        }
      }
    }
  }
]

Routes that match the first route entry (with an x-custom-version header) will produce a LoadBalancerContext where the metadata matches are stage=dev, version=1.2-pre. The endpoint e3 is matched (and if there were multiple hosts with that metadata, round-robin balancing would occur over them). If the cluster is updated and e3 is removed, the current implementation will return no host (triggering a 5xx), but that's a bug. It should detect this condition and trigger the fallback policy.

In the other route, the metadata from the weighted clusters and the route are combined so that endpoints with stage=prod, version=1.0 are selected 90% of the time while endpoints with stage=prod, version=1.1 are selected 10% of the time. Again, given multiple endpoints with the same metadata round-robin load balancing would take place.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zuercher I mostly skimmed but in general I like this approach. It all fits together very nicely. Well done!

I hate asking this because it sounds like an interview question, but can you briefly describe the runtime of the metadata matching at LB time (I don't care that much about host update time). Also whether/when it involves reflection? I'm sure I can figure it out by looking through in detail but it might better if you just call it out. The example you provided @htuch is great.

return ProtobufUtil::MessageDifferencer::Equals(v1.struct_value(), v2.struct_value());

case ProtobufWkt::Value::kListValue:
return ProtobufUtil::MessageDifferencer::Equals(v1.list_value(), v2.list_value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to be swayed here, but IMO we should try to make this path as fast as possible and we should avoid reflection if possible in the fast path. I'm not super familiar with this stuff. This function basically does not use reflection unless it involves an embedded struct or list?

switch (cluster->lbType()) {
case LoadBalancerType::LeastRequest: {
lb_.reset(new LeastRequestLoadBalancer(host_set_, parent.local_host_set_, cluster->stats(),
if (cluster->lbSubsetInfo().isEnabled() && cluster->lbType() != LoadBalancerType::OriginalDst) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: we should just fail config / doc that subset and original_dst can't be used together.

@htuch
Copy link
Member

htuch commented Sep 26, 2017

Yes, this example is great. Can you add this to a Markdown file or big block comment somewhere? Also, can you provide examples of requests that just match version, or where the two subset keys are reversed in order? And also in the big comment, some examples of subsets > |1| in size.

I think worst case space complexity for precomputed subsets is O(E*N) where E is the number of endpoints and N is the number of subset key lists. We're binning endpoints into distinct subsets for each subset key list. Is this correct?

Is the worst case time complexity the same? I.e. we would have to iterate over all subsets?

@rshriram
Copy link
Member

Nice example! One question: If you recall our earlier discussion on the shard router use case, I have one key mapping to 100K possible values. And these are distributed over few hundred nodes.
E.g.,
e1 could have version: 1, version: 100, version: 1000
e2 could have version: 2, version: 200, version: 2000
e3 could have version: 1000, version: 2000

Based on the request, the route block would select one value for version (e.g., version: 2000). This would have to be routed to e2/e3.

I don't think this use case can be satisfied by the implementation above. Just thought I would confirm. I could flatten this into a simple map of form version1: true, version2000:true, so that I have unique key-value pairs.. However, I would now have to implement both CDS and EDS (instead of just EDS), to update the subset keys, and the endpoint metadata. And in this case, I am virtually creating 100K clusters, which is fine by me as well.

@zuercher
Copy link
Member Author

The runtime at load balancing time, as written (see below), it's O(n) where n = number of metadata keys from the LoadBalancerContext, and provided that the keys used in the unordered_maps have well-distributed hash values. At that point we have the subset's load balancer and delegate to it, incurring whatever it's runtime is for the number of hosts in the subset.

Each of the n iterations in the search incurs one comparison of a pair of ProtobufWkt::Value instances (short-circuited by comparing their hashes first).

Above, I say "as written" because I don't think it searches quite hard enough when the keys for subsets overlap. For instance, given subsets over [a, b] and [b, c], if a route says [a, b, c] it will only find [a, b] right now, but should probably find both and choose one based on definition order. I can think of a small change that will fix this, making the worst case n(n + 1)/2 iterations, so O(n^2) if I understand the syntax.

I need to get home, so @rshriram I'll try to address your question tonight or first thing tomorrow.

@zuercher
Copy link
Member Author

@rshriram I misunderstood your use case when we discussed it before. It should, however, be possible to change the config definition to allow subset keys to call out prefixes in addition to full keys. You could then flatten your endpoint metadata and ust specify "version:" for subset keys, yes? Is that worth it for your use case?

@mattklein123
Copy link
Member

Above, I say "as written" because I don't think it searches quite hard enough when the keys for subsets overlap. For instance, given subsets over [a, b] and [b, c], if a route says [a, b, c] it will only find [a, b] right now, but should probably find both and choose one based on definition order. I can think of a small change that will fix this, making the worst case n(n + 1)/2 iterations, so O(n^2) if I understand the syntax.

I would prefer that we be very careful about anything that is n^2 in this path, especially that uses user supplied config. IMO the way you have it written now is flexible enough w/ good performance. IMO we could make more powerful matching opt-in later?

Anyway, I'm fine to proceed with this PR for full tests, etc. I think this will get kind of big so if there is any way to break this up into smaller individual PRs w/ tests that would be cool.

@zuercher
Copy link
Member Author

What if instead of worst case O(n^2) with n = number of metadata matches in the route, it were worst-case O(j*k) with j = the number of entries in the lb_subset_config.subset_keys (which is <= the number of subsets) and k = the average number of keys used for each subset?

We'd iterate over the subset keys, checking to see if the metadata matches contained a value for for key in the subset (and simultaneously doing the constant-time hashtable lookups that lead us to the actual subset lb). The first entry in the subset keys that is fully matched and has a non-empty subset wins. I think this produces the "correct" behavior that was O(n^2) w.r.t. metadata matches. (This would require the metadata matches returned from LoadBalancerContext to be an unordered_map. With no metadata matches it short-circuits to the fallback lb in O(1). The subset keys no longer need to be sorted and you can place the most restrictive keys first in a subset to improve performance.)

Notably, this would actually be far worse for @rshriram's use case since it would be a linear search over the subset keys (of which he'd have many).

@mattklein123
Copy link
Member

@zuercher I'm still trying to grok this. In your example:

"subset_keys": [
          { "keys": [ "version", "stage" ]},
          { "keys": [ "version" ]}
        ]

Why is { "keys": [ "version" ]} needed? Is that just an example? Since AFAICT your routes always want to match on both version/stage. Or is this the point that it would then back off to matching only on version if you don't get a good match on version/stage?

@zuercher
Copy link
Member Author

zuercher commented Sep 28, 2017

It's there for the backoff behavior you mentioned.

Having said that, I spent time thinking about it last night and I think it's fine for the first version to attempt to match the metadata from the route exactly (with no backoff). If you want to reach the subset based on version alone, you'd need a route that only specifies version=v. If the route calls for stage=x and version=y and there's no subset for that combination, you get the behavior defined by the fallback behavior (default subset, any host, or no host). This lets us use the O(n) algorithm currently in the review (with a tweak to insure a complete match, but that shouldn't change the complexity).

Then, in the future, if we wanted more complex subset selection behavior, we could add a configuration flag for enabling it and use that to enable a different search algorithm.

@mattklein123
Copy link
Member

@zuercher +1 on that. I think it's a lot simpler to understand (and faster). We can always add more complex stuff later.

@zuercher
Copy link
Member Author

Cool. I'll close this PR and see about cleaning this up and splitting it into smaller chunks.

@zuercher zuercher closed this Sep 28, 2017
@rshriram
Copy link
Member

@zuercher I don't fully comprehend the prefix thing you said. Would you mind explaining with an example?

Also, why would the runtime be O(n^2) or even O(n) ? Can't we materialize the LB pools apriori, one for each subset. If we impose the restriction that the route's keys must match exactly with the subset, and perhaps even assign unique hash IDs to each subset as well as convert the key combo described in the weighted_cluster to the respective subset hash ID, then its just a matter of hashing the key set and getting a pointer to the appropriate pre-computed LB pool right? [in the fast path I mean]

@zuercher
Copy link
Member Author

For reference, here was an example I posted in envoy slack in response to the question about prefixes:

https://gist.github.com/zuercher/65ede6d0296ff2a9e4a310262ebf8b05

htuch pushed a commit to envoyproxy/data-plane-api that referenced this pull request Sep 29, 2017
Updates the comment to reflect the design decision made in envoyproxy/envoy#1735 and renames subset_keys to subset_selectors.

My reasoning for the rename is that the field is all about determining how endpoints are grouped into subsets. Using the names of metadata keys in the endpoints is only one way and others might be added.

Signed-off-by: Stephan Zuercher <[email protected]>
htuch pushed a commit that referenced this pull request Oct 11, 2017
First review split off of #1735.

This provides the interface used retrieve the RouteAction metadata_match fields, but leaves them unimlemented for now. It also provides protobuf utilities that are used to implement using values from metadata_match in STL maps.

Signed-off-by: Stephan Zuercher <[email protected]>
htuch pushed a commit that referenced this pull request Oct 11, 2017
Second PR split out of #1735.

This one adds the LbSubsetConfig data from the CDS Cluster message to the existing ClusterInfo class.

Signed-off-by: Stephan Zuercher <[email protected]>
htuch pushed a commit that referenced this pull request Oct 13, 2017
…1844)

3rd of 4 reviews split out from #1735.

Router::MetadataMatchCriteria is implemented. Instances of the implementation are created when a route's RouteAction has an "envoy.lb" section in metadata_match values. These are merged with additional "envoy.lb" metadata_match values from the WeightedCluster, if any. In the event of a collision, the WeightedCluster keys take precedence.

The LoadBalancerContext interface is modified to allow MetadataMatchCriteria to be accessed by load balancers. Router::Filter's implementation of the LoadBalancerContext is updated to return the selected route's criteria, if any.

Signed-off-by: Stephan Zuercher <[email protected]>
htuch pushed a commit that referenced this pull request Oct 24, 2017
Fixes #1279, and is the final part of splitting #1735.

Provides the actual load balancer for subsets with basic stats for the number of subsets active, created, removed, selected (used for host selection), and fallbacks (used fallback path).

Signed-off-by: Stephan Zuercher <[email protected]>
@zuercher zuercher deleted the subsetting-load-balancer branch October 25, 2017 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants