Adaptive Load main loop #483

eric846 · 2020-08-24T22:15:34Z

This is the main function of the Adaptive Load Controller library:

Check the input proto for errors and apply default input values.
For the Adjusting Stage: One big while loop:
- Check for convergence deadline exceeded.
- Get the latest dynamically generated CommandLineOptions from the StepController.
- Run a short benchmark with the Nighthawk Service.
- Obtain metric values for this benchmark from MetricsPlugins.
- Score the metrics using ScoringFunction plugins.
- Report scores back to the StepController, which recalculates the load for the next iteration.
- Check for StepController convergence.
- Check for StepController doom.
For the Testing Stage: Run one long benchmark on the Nighthawk Service at the converged load.

The unit test has an example of mocking a gRPC stub, which is not easy.

update from master

merge from upstream

Signed-off-by: eric846 <[email protected]>

mum4k · 2020-08-24T22:36:32Z

@dubious90 please review and assign back to me once done.

…tepController, improve coverage, fix clang-tidy Signed-off-by: eric846 <[email protected]>

eric846 · 2020-08-25T03:32:33Z

tools/check_format.sh

@@ -8,7 +8,7 @@ TO_CHECK="${2:-$PWD}"
 bazel run @envoy//tools:code_format/check_format.py -- \
  --skip_envoy_build_rule_check  --namespace_check Nighthawk \
  --build_fixer_check_excluded_paths=$(realpath ".") \
-  --include_dir_order envoy,nighthawk,external/source/envoy,external,api,common,source,exe,server,client,test_common,test \
+  --include_dir_order envoy,nighthawk,external/source/envoy,external,api,common,source,exe,server,client,test_common,grpcpp,test \


Without separating grpcpp into its own section, the formatter arranges the include files in an order that is then detected as invalid.

Signed-off-by: eric846 <[email protected]>

oschaaf

Flushing out my first comments based on a first read of this PR; I feel this looks great overall

source/adaptive_load/adaptive_load_controller_impl.cc

dubious90 · 2020-08-25T14:03:07Z

source/adaptive_load/adaptive_load_controller_impl.cc

+  }
+  ::grpc::Status status = stream->Finish();
+  if (!status.ok()) {
+    response.mutable_error_detail()->set_code(status.error_code());


Shouldn't the ExecutionResponse already have error information here? Or is this information more valuable?

Actually, I wonder if returning an absl::StatusOr<nighthawk.client.Output> would be clearer here for what you want.

Done -- we are now all-in on StatusOr!

source/adaptive_load/adaptive_load_controller_impl.cc

dubious90 · 2020-08-25T14:40:54Z

source/adaptive_load/adaptive_load_controller_impl.cc

+  absl::flat_hash_map<const nighthawk::adaptive_load::MetricSpec*,
+                      const nighthawk::adaptive_load::ThresholdSpec*>
+      threshold_spec_from_metric_spec;
+  for (const MetricSpecWithThreshold& metric_threshold : spec.metric_thresholds()) {


There are a lot of variables here that are a little difficult to keep track of in readability. Maybe comments explaining what maps are being built (functionally, not describing the elements but what their purposes are)

Commented the complicated map and also extracted another function.

dubious90 · 2020-08-25T14:42:14Z

source/adaptive_load/adaptive_load_controller_impl.cc

+ * @return BenchmarkResult Proto containing metric scores for this Nighthawk Service benchmark
+ * session, or an error propagated from the Nighthawk Service or MetricsPlugins.
+ */
+BenchmarkResult AnalyzeNighthawkBenchmark(


This function is very long, which adds to it being difficult to read. Is there any way to break it up?

Extracted EvaluateMetric function, should make it a bit simpler.

dubious90 · 2020-08-25T16:30:06Z

source/adaptive_load/adaptive_load_controller_impl.cc

+
+} // namespace
+
+AdaptiveLoadSessionOutput PerformAdaptiveLoadSession(


I realize that this API was already partially reviewed, but that was before we brought in absl::Status. I question whether much of this would be easier to follow (and potentially easier to use) if you used StatusOr rather than having a proto that contains status fields. Open to pushback here.

Done, adopted StatusOr throughout, including the API.

dubious90 · 2020-08-25T16:39:51Z

test/adaptive_load/fake_time_source.h

+ * Fake time source that ticks 1 second on every query, starting from the Unix epoch. Supports only
+ * monotonicTime().
+ */
+class FakeIncrementingMonotonicTimeSource : public Envoy::TimeSource {


I thought envoy already supported mock timers. Is there a reason why we need to implement a fake?

I see a mock TimeSource being used, and there's TestTimeSystem, but here we need something where:

time advances between calls

no intervention is needed between calls

because once we unleash the main loop we can't pause it to manipulate the clock.

So personally, I love the simplicity of this. It might be possible to use simulated time, but then we may need another thread to advance time or some other relatively complex construct. Also, as I think monotonic time promises us to advance each time we call it, this seems quite an accurate modelling of it. So personally, when weighing the potential duplication introduced against the simplicity we win, I'm rooting for the way this is right now.

oschaaf

Another round of feedback; I've read most of the PR now, but still need to go through most of the tests

source/adaptive_load/adaptive_load_controller_impl.cc

test/adaptive_load/fake_time_source.h

test/adaptive_load/adaptive_load_controller_test.cc

…rors, shorten function, rearrange main loop Signed-off-by: eric846 <[email protected]>

eric846 · 2020-08-26T22:50:17Z

Added several more commits with nontrivial content, starting from "delete proto error status fields." I had to introduce new functionality to the FakeStepController to be able to cover the last bit of the main function.

…epController helper Signed-off-by: eric846 <[email protected]>

mum4k · 2020-08-26T23:37:07Z

I am wondering what we can do about the fact that this PR is both fairly long and hard to navigate / review. The main issue may be that there are multiple effort tracks combined in this PR which makes it hard to mentally map the code changes to individual tracks. As it stands, I don't feel confident enough that I won't miss anything important while reviewing this.

The best solution might be to shelve this PR, while keeping it as a reference for the reviewers who already posted comments in here. We can then try to split the individual effort tracks into multiple separate PRs, each of them ideally focused on just one thing. This can also allow us to better track the reasoning for individual changes in the description of each PR.

I may not know enough to prescribe the split exactly, but here is a rough list of ideas. Please feel free to adapt this to reality as you see fit.

a separate PR performing the API changes related to StatusOr.
a separate PR fixing any proto errors.
a separate PR implementing the fake time source.
a separate PR performing changes to the fake step controller.

Next we should address how long the adaptive load controller implementation is itself. Skimming over the helper functions in the anonymous namespace - some of them look like they contain enough logic to warrant their own tests. I will leave this to your best judgment. Having private functions makes the code agile, but there is a balance between keeping everything private and having separate test coverage for large enough behaviors. Maybe we can find some such candidates and their move to standalone libraries with test coverage will both simplify the PRs and make it faster to find errors when their individual unit tests break. E.g maybe the function that evaluates metrics deserves its own library? This should leave us in a situation where the main library of the adaptive load controller only contains its main business logic.

Any such libraries can then be sent as separate PRs. We can then send one last PR to add the main logic of the adaptive load controller.

eric846 · 2020-08-27T02:39:03Z

Commencing this split now

Migrates PerformAdaptiveLoadSession() to return absl::StatusOr. Deletes status fields from result protos since they are no longer needed. Temporarily deletes the simulated doom mechanism from FakeStepController because it relied on one of the deleted proto fields; it will be replaced by a new mechanism in a subsequent PR. Changes the contract of the adaptive load session spec proto with respect to open_loop in nighthawk_traffic_template: We no longer return an error if the user sets a value for open_loop; instead we override open_loop to true if it was not specified (the opposite of the Nighthawk client default), while still allowing the user to explicitly set open_loop to either true or false. Part 1 of splitting PR #483.

A fake `Envoy::TimeSource` that ticks forward every time it is called, starting from 1970-01-01 00:00:00. Only `monotonicTime()` is implemented. This simple autonomous behavior is needed because the adaptive load controller checks the clock many times automatically and can't be interrupted in the middle of the run to update the fake time. Part 2 of splitting PR #483.

Updates FakeStepController: - Set doomed state when `FakeStepController::UpdateAndRecompute()` receives a `BenchmarkResult` with a negative metric score. Previously `BenchmarkResult` could contain an error status, and we used the error status to trigger simulated doom. Now errors do not get passed to `StepController::UpdateAndRecompute()`, so we need a new mechanism. - Adds the ability to simulate an `InputVariableSetter` failure in `GetCurrentCommandLineOptions()`. The artificial failure is specified in the config proto. - Adds a countdown mechanism so that `GetCurrentCommandLineOptions()` returns successfully until `UpdateAndRecompute()` has been called a configured number of times, then starts returning the simulated input variable setter failure mentioned above. This simple autonomous behavior is needed to test error handling in the testing stage without triggering an early exit in the adjusting stage; the adaptive load controller can't be paused in the middle of the test to reconfigure the `FakeStepController` to change its behavior. The behavior of `FakeStepController` after an `UpdateAndRecompute()` with a `BenchmarkResult` containing the given metric scores: - Any positive score: converged - All scores zero: neither converged nor doomed - Any negative score: doomed Part 3 of splitting PR #483.

A library that calls a Nighthawk Service gRPC stub with the given `CommandLineOptions`, translating all possible gRPC errors into `absl::StatusOr`. Will need to be updated when Nighthawk Service starts returning more than one message over the stream. Part 4 of splitting PR #483.

A library that calls a Nighthawk Service gRPC stub with the given `CommandLineOptions`, translating all possible gRPC errors into `absl::StatusOr`. Will need to be updated when Nighthawk Service starts returning more than one message over the stream. Part 4 of splitting PR envoyproxy#483.

A library that combines the latest Nighthawk Service response, metric and threshold spec configuration, and results from MetricsPlugins to produce metric scores. Part 5 of splitting PR #483. Signed-off-by: eric846 <[email protected]>

- `SetSessionSpecDefaults()`: Returns a copy of the `AdaptiveLoadSessionSpec` with default values added. - `CheckSessionSpec()`: Checks an `AdaptiveLoadSessionSpec` for illegal values, invalid plugin references, and invalid plugin configs. Part 6 of splitting PR #483. Signed-off-by: eric846 <[email protected]>

- Add missing `const` to helper methods. - Add a missing build dep - Add a missing `#pragma once ` Part 7 of splitting PR #483. Signed-off-by: eric846 <[email protected]>

- MockNighthawkServiceClient - MockMetricsEvaluator - MockAdaptiveLoadSessionSpecProtoHelper Part 8 of splitting PR #483. Signed-off-by: eric846 <[email protected]>

This is the main function of the Adaptive Load Controller library: - Check the input proto for errors - Apply default input values - Adjusting Stage loop: - Get the latest dynamically generated CommandLineOptions from the StepController - Run a short benchmark with the Nighthawk Service - Obtain values from MetricsPlugins - Score metrics using ScoringFunction plugins - Report scores back to the StepController, which recalculates the load for the next iteration - Check for convergence deadline exceeded - Check for StepController convergence - Check for StepController doom - Testing Stage: Run one long benchmark on the Nighthawk Service at the converged load Fixes #485. Part 9 of splitting PR #483. Signed-off-by: eric846 <[email protected]>

eric846 added 19 commits June 1, 2020 17:23

Merge pull request #5 from envoyproxy/master

8ea442d

update from master

Merge pull request #6 from envoyproxy/master

5ac755a

merge from upstream

Merge pull request #7 from envoyproxy/master

b8c25a5

merge from upstream

Merge pull request #11 from envoyproxy/master

9907bf9

merge from upstream

Merge remote-tracking branch 'upstream/master' into master

651e699

Signed-off-by: eric846 <[email protected]>

Merge remote-tracking branch 'upstream/master' into master

3df3d13

Signed-off-by: eric846 <[email protected]>

Merge remote-tracking branch 'upstream/master'

28e5056

Signed-off-by: eric846 <[email protected]>

Merge remote-tracking branch 'upstream/master'

8ae837c

Signed-off-by: eric846 <[email protected]>

Merge remote-tracking branch 'upstream/master'

3522ba9

Signed-off-by: eric846 <[email protected]>

Merge remote-tracking branch 'upstream/master'

3500924

Signed-off-by: eric846 <[email protected]>

Merge remote-tracking branch 'upstream/master'

6a83fbc

Signed-off-by: eric846 <[email protected]>

Merge remote-tracking branch 'upstream/master'

9b65e56

Signed-off-by: eric846 <[email protected]>

Merge remote-tracking branch 'upstream/master'

f0289ac

Signed-off-by: eric846 <[email protected]>

Merge remote-tracking branch 'upstream/master'

62e47b5

Signed-off-by: eric846 <[email protected]>

adaptive load main loop initial commit

ce68103

Signed-off-by: eric846 <[email protected]>

add assertions for conditions impossible after input validation

d870cf2

Signed-off-by: eric846 <[email protected]>

edit comments

9d7769b

Signed-off-by: eric846 <[email protected]>

fix format

3b102c1

Signed-off-by: eric846 <[email protected]>

fix typo

28ae670

Signed-off-by: eric846 <[email protected]>

eric846 added the waiting-for-review A PR waiting for a review. label Aug 24, 2020

mum4k requested review from oschaaf and dubious90 August 24, 2020 22:35

extract some helper functions, support input setting failure in FakeS…

ecf6cfe

…tepController, improve coverage, fix clang-tidy Signed-off-by: eric846 <[email protected]>

eric846 commented Aug 25, 2020

View reviewed changes

fix comments

2fc3d7a

Signed-off-by: eric846 <[email protected]>

oschaaf reviewed Aug 25, 2020

View reviewed changes

dubious90 suggested changes Aug 25, 2020

View reviewed changes

oschaaf reviewed Aug 25, 2020

View reviewed changes

fix open loop setting, update API to use StatusOr, catch more gRPC er…

d5dfbcc

…rors, shorten function, rearrange main loop Signed-off-by: eric846 <[email protected]>

eric846 added the waiting-for-review A PR waiting for a review. label Aug 26, 2020

eric846 requested review from oschaaf and dubious90 August 26, 2020 22:48

add pre-countdown fixed rps value to failed input value setter FakeSt…

76986d3

…epController helper Signed-off-by: eric846 <[email protected]>

mum4k added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Aug 26, 2020

eric846 mentioned this pull request Aug 27, 2020

Adaptive Load migration to absl::StatusOr #490

Merged

eric846 closed this Aug 27, 2020

This was referenced Aug 27, 2020

Adaptive Load fake time source #491

Merged

Adaptive Load FakeStepController doom update #492

Merged

Adaptive Load library for calling Nighthawk Service #493

Merged

eric846 mentioned this pull request Aug 27, 2020

Adaptive Load metrics evaluator library #495

Merged

eric846 mentioned this pull request Sep 3, 2020

Adaptive load session spec proto helpers #508

Merged

This was referenced Sep 11, 2020

Adaptive load controller main loop #526

Closed

Adaptive load helper interface cleanup and dep/header cleanup #527

Merged

eric846 mentioned this pull request Sep 12, 2020

Adaptive load helper mocks #529

Merged

dubious90 pushed a commit that referenced this pull request Sep 14, 2020

Adaptive load helper mocks (#529)

e744a10

- MockNighthawkServiceClient - MockMetricsEvaluator - MockAdaptiveLoadSessionSpecProtoHelper Part 8 of splitting PR #483. Signed-off-by: eric846 <[email protected]>

eric846 mentioned this pull request Sep 14, 2020

Adaptive Load Controller main loop #535

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive Load main loop #483

Adaptive Load main loop #483

eric846 commented Aug 24, 2020 •

edited

Loading

mum4k commented Aug 24, 2020

eric846 Aug 25, 2020

oschaaf left a comment

dubious90 Aug 25, 2020

dubious90 Aug 25, 2020

eric846 Aug 26, 2020

dubious90 Aug 25, 2020

eric846 Aug 26, 2020

dubious90 Aug 25, 2020

eric846 Aug 26, 2020

dubious90 Aug 25, 2020

eric846 Aug 26, 2020

dubious90 Aug 25, 2020

eric846 Aug 26, 2020

oschaaf Aug 26, 2020

oschaaf left a comment

eric846 commented Aug 26, 2020

mum4k commented Aug 26, 2020

eric846 commented Aug 27, 2020


		} // namespace

		AdaptiveLoadSessionOutput PerformAdaptiveLoadSession(

Adaptive Load main loop #483

Adaptive Load main loop #483

Conversation

eric846 commented Aug 24, 2020 • edited Loading

mum4k commented Aug 24, 2020

Choose a reason for hiding this comment

oschaaf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oschaaf left a comment

Choose a reason for hiding this comment

eric846 commented Aug 26, 2020

mum4k commented Aug 26, 2020

eric846 commented Aug 27, 2020

eric846 commented Aug 24, 2020 •

edited

Loading