Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive Load Controller protos #398

Merged
merged 33 commits into from
Jul 29, 2020
Merged

Conversation

eric846
Copy link
Contributor

@eric846 eric846 commented Jul 9, 2020

All protos, including plugin-specific config protos for all plugins we plan to ship initially.

Works on #416

@eric846 eric846 requested review from oschaaf and mum4k July 9, 2020 02:02
@eric846 eric846 added waiting-for-review A PR waiting for a review. waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Jul 9, 2020
@mum4k
Copy link
Collaborator

mum4k commented Jul 9, 2020

@oschaaf please assign to me once you are done reviewing this PR (after Eric implements the latest round of changes).

@eric846 eric846 changed the title Adaptive RPS protos Adaptive Load Controller protos Jul 10, 2020
@eric846
Copy link
Contributor Author

eric846 commented Jul 10, 2020

Hi Otto,
This should be ready to review now. Renamed Adaptive RPS Controller to Adaptive Load Controller to reflect the general capability.

The step_controller_impl.proto configs now take a NighthawkFieldSelector (defined in step_controller.proto) enum input that tells the controller what single field within CommandLineOptions it should adjust dynamically. I included every numeric field in the enum that I didn't know to rule out. It's possible that some of the fields I included in NighthawkFieldSelector don't actually make sense and should be removed.

Future numerical fields added to CommandLineOptions would need a corresponding value added to NighthawkFieldSelector.

We will ship only basic single-variable StepControllers at first, but they will already work equally well for any field, not just RPS.

Advanced multivariable optimization StepControllers could still use NighthawkFieldSelector fields to specify the multiple variables.

Thanks!

@eric846 eric846 added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Jul 10, 2020
@eric846
Copy link
Contributor Author

eric846 commented Jul 11, 2020

Updated again to remove some of the hard-coded stuff for setting multiple field and instead use plugins, which actually simplifies the design.

Copy link
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great to me, just a couple of toughts and a possible proto3/validation nit.

oschaaf
oschaaf previously approved these changes Jul 13, 2020
Copy link
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations, LGTM!

@eric846
Copy link
Contributor Author

eric846 commented Jul 15, 2020

@mum4k PTAL, simplified protos as discussed

mum4k
mum4k previously approved these changes Jul 17, 2020
Copy link
Collaborator

@mum4k mum4k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good, let's wait to see what Harvey thinks about this.

@dubious90
Copy link
Contributor

/retest

@repokitteh-read-only
Copy link

🤷‍♀️ nothing to rebuild.

🐱

Caused by: a #398 (comment) was created by @dubious90.

see: more, trace.

…ent.Output turns out not to include the status

Signed-off-by: eric846 <[email protected]>
@eric846 eric846 force-pushed the adaptive-rps-protos2 branch from 070e404 to 677b783 Compare July 19, 2020 23:42
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great, a few questions/comments.

@eric846 eric846 requested a review from htuch July 22, 2020 02:05
@mum4k
Copy link
Collaborator

mum4k commented Jul 22, 2020

@pamorgan please review and assign back to me when done.

@mum4k mum4k requested a review from pamorgan July 22, 2020 13:59
@mum4k mum4k removed their assignment Jul 22, 2020
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API LGTM modulo comment.

message MetricSpec {
// Name of the metric to evaluate. For the set of built-in metric names, see
// source/adaptive_load/metrics_plugin_impl.cc. Required.
string metric_name = 1 [(validate.rules).string.min_len = 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do plugins and metric names relate? Does each plugin only export out a fixed number of metric names? Or is metric name an opaque ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each plugin will support its own fixed set of metric names. We will query each plugin for its metric names at startup in order to validate the MetricSpec in the adaptive load session input proto.

For nighthawk.builtin:

  • latency-min
  • latency-mean
  • latency-mean-plus-1stdev
  • latency-mean-plus-2stdev
  • latency-mean-plus-3stdev
  • latency-max
  • achieved-rps
  • attempted-rps
  • send-rate
  • success-rate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. I'm going to be OOTO next two weeks, so please go ahead with this set of protos. I think as long as we're the only consumers, which is highly likely for the near future, it's fine to iterate as needed.

eric846 added 2 commits July 22, 2020 19:08
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
@mum4k mum4k self-assigned this Jul 23, 2020
@mum4k
Copy link
Collaborator

mum4k commented Jul 23, 2020

fyi pending review by @pamorgan. Please assign to me once done.

@mum4k mum4k removed their assignment Jul 23, 2020
BenchmarkResult testing_stage_result = 3;
// Metrics and thresholds that were used to determine load adjustments, as referenced in the
// BenchmarkResults.
repeated MetricSpecWithThreshold metric_thresholds = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be different than the AdaptiveLoadSessionSpec metric_thresholds. If not, then maybe this is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's copied directly from metric_thresholds in the input. I included it in the output to preserve the context when dumping the output proto to an archive. Otherwise the archive would only have the value and the score, but not the threshold. If somebody just archived the full input proto alongside the output proto, this field would be redundant.

// The duration of the single benchmark session of the testing stage to
// confirm the performance at the level of load found in the adjusting stage.
// Required.
google.protobuf.Duration testing_stage_duration = 7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a default reasonable value for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Service input spec.
nighthawk.client.Output nighthawk_service_output = 1;
// Status of this Nighthawk Service benchmark session.
google.rpc.Status status = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add more comments on what the status should we get for special errors. For example what is the error if we never converge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the comment on this field, which applies narrowly to failures to call the Nighthawk Service and internal errors returned by Nighthawk Service.

Also extended the comment in AdaptiveLoadSessionOutput where things like convergence status are recorded.

string metric_name = 1 [(validate.rules).string.min_len = 1];
// Name of the MetricsPlugin providing the metric ("nighthawk.builtin" for built-in).
// Required.
string metrics_plugin_name = 2 [(validate.rules).string.min_len = 1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add default for "nighthawk.builtin"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,21 @@
load("@envoy_api//bazel:api_build_system.bzl", "api_cc_py_proto_library")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some nits but LGTM

@mum4k mum4k self-assigned this Jul 28, 2020
@mum4k mum4k merged commit 67d0f96 into envoyproxy:master Jul 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-review A PR waiting for a review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants