Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive Load main loop #483

Closed
wants to merge 32 commits into from

Conversation

eric846
Copy link
Contributor

@eric846 eric846 commented Aug 24, 2020

This is the main function of the Adaptive Load Controller library:

  • Check the input proto for errors and apply default input values.
  • For the Adjusting Stage: One big while loop:
    • Check for convergence deadline exceeded.
    • Get the latest dynamically generated CommandLineOptions from the StepController.
    • Run a short benchmark with the Nighthawk Service.
    • Obtain metric values for this benchmark from MetricsPlugins.
    • Score the metrics using ScoringFunction plugins.
    • Report scores back to the StepController, which recalculates the load for the next iteration.
    • Check for StepController convergence.
    • Check for StepController doom.
  • For the Testing Stage: Run one long benchmark on the Nighthawk Service at the converged load.

The unit test has an example of mocking a gRPC stub, which is not easy.

eric846 added 19 commits June 1, 2020 17:23
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
@eric846 eric846 added the waiting-for-review A PR waiting for a review. label Aug 24, 2020
@mum4k mum4k requested review from oschaaf and dubious90 August 24, 2020 22:35
@mum4k
Copy link
Collaborator

mum4k commented Aug 24, 2020

@dubious90 please review and assign back to me once done.

…tepController, improve coverage, fix clang-tidy

Signed-off-by: eric846 <[email protected]>
@@ -8,7 +8,7 @@ TO_CHECK="${2:-$PWD}"
bazel run @envoy//tools:code_format/check_format.py -- \
--skip_envoy_build_rule_check --namespace_check Nighthawk \
--build_fixer_check_excluded_paths=$(realpath ".") \
--include_dir_order envoy,nighthawk,external/source/envoy,external,api,common,source,exe,server,client,test_common,test \
--include_dir_order envoy,nighthawk,external/source/envoy,external,api,common,source,exe,server,client,test_common,grpcpp,test \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without separating grpcpp into its own section, the formatter arranges the include files in an order that is then detected as invalid.

Signed-off-by: eric846 <[email protected]>
Copy link
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing out my first comments based on a first read of this PR; I feel this looks great overall

}
::grpc::Status status = stream->Finish();
if (!status.ok()) {
response.mutable_error_detail()->set_code(status.error_code());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the ExecutionResponse already have error information here? Or is this information more valuable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wonder if returning an absl::StatusOr<nighthawk.client.Output> would be clearer here for what you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- we are now all-in on StatusOr!

absl::flat_hash_map<const nighthawk::adaptive_load::MetricSpec*,
const nighthawk::adaptive_load::ThresholdSpec*>
threshold_spec_from_metric_spec;
for (const MetricSpecWithThreshold& metric_threshold : spec.metric_thresholds()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of variables here that are a little difficult to keep track of in readability. Maybe comments explaining what maps are being built (functionally, not describing the elements but what their purposes are)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented the complicated map and also extracted another function.

* @return BenchmarkResult Proto containing metric scores for this Nighthawk Service benchmark
* session, or an error propagated from the Nighthawk Service or MetricsPlugins.
*/
BenchmarkResult AnalyzeNighthawkBenchmark(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is very long, which adds to it being difficult to read. Is there any way to break it up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted EvaluateMetric function, should make it a bit simpler.


} // namespace

AdaptiveLoadSessionOutput PerformAdaptiveLoadSession(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that this API was already partially reviewed, but that was before we brought in absl::Status. I question whether much of this would be easier to follow (and potentially easier to use) if you used StatusOr rather than having a proto that contains status fields. Open to pushback here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, adopted StatusOr throughout, including the API.

* Fake time source that ticks 1 second on every query, starting from the Unix epoch. Supports only
* monotonicTime().
*/
class FakeIncrementingMonotonicTimeSource : public Envoy::TimeSource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought envoy already supported mock timers. Is there a reason why we need to implement a fake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a mock TimeSource being used, and there's TestTimeSystem, but here we need something where:

  • time advances between calls
  • no intervention is needed between calls

because once we unleash the main loop we can't pause it to manipulate the clock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So personally, I love the simplicity of this. It might be possible to use simulated time, but then we may need another thread to advance time or some other relatively complex construct. Also, as I think monotonic time promises us to advance each time we call it, this seems quite an accurate modelling of it. So personally, when weighing the potential duplication introduced against the simplicity we win, I'm rooting for the way this is right now.

Copy link
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of feedback; I've read most of the PR now, but still need to go through most of the tests

…rors, shorten function, rearrange main loop

Signed-off-by: eric846 <[email protected]>
@eric846 eric846 added the waiting-for-review A PR waiting for a review. label Aug 26, 2020
@eric846 eric846 requested review from oschaaf and dubious90 August 26, 2020 22:48
@eric846
Copy link
Contributor Author

eric846 commented Aug 26, 2020

Added several more commits with nontrivial content, starting from "delete proto error status fields." I had to introduce new functionality to the FakeStepController to be able to cover the last bit of the main function.

@mum4k
Copy link
Collaborator

mum4k commented Aug 26, 2020

I am wondering what we can do about the fact that this PR is both fairly long and hard to navigate / review. The main issue may be that there are multiple effort tracks combined in this PR which makes it hard to mentally map the code changes to individual tracks. As it stands, I don't feel confident enough that I won't miss anything important while reviewing this.

The best solution might be to shelve this PR, while keeping it as a reference for the reviewers who already posted comments in here. We can then try to split the individual effort tracks into multiple separate PRs, each of them ideally focused on just one thing. This can also allow us to better track the reasoning for individual changes in the description of each PR.

I may not know enough to prescribe the split exactly, but here is a rough list of ideas. Please feel free to adapt this to reality as you see fit.

  • a separate PR performing the API changes related to StatusOr.
  • a separate PR fixing any proto errors.
  • a separate PR implementing the fake time source.
  • a separate PR performing changes to the fake step controller.

Next we should address how long the adaptive load controller implementation is itself. Skimming over the helper functions in the anonymous namespace - some of them look like they contain enough logic to warrant their own tests. I will leave this to your best judgment. Having private functions makes the code agile, but there is a balance between keeping everything private and having separate test coverage for large enough behaviors. Maybe we can find some such candidates and their move to standalone libraries with test coverage will both simplify the PRs and make it faster to find errors when their individual unit tests break. E.g maybe the function that evaluates metrics deserves its own library? This should leave us in a situation where the main library of the adaptive load controller only contains its main business logic.

Any such libraries can then be sent as separate PRs. We can then send one last PR to add the main logic of the adaptive load controller.

@mum4k mum4k added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Aug 26, 2020
@eric846
Copy link
Contributor Author

eric846 commented Aug 27, 2020

Commencing this split now

@eric846 eric846 closed this Aug 27, 2020
dubious90 pushed a commit that referenced this pull request Aug 27, 2020
Migrates PerformAdaptiveLoadSession() to return absl::StatusOr.

Deletes status fields from result protos since they are no longer needed.

Temporarily deletes the simulated doom mechanism from FakeStepController because it relied on one of the deleted proto fields; it will be replaced by a new mechanism in a subsequent PR.

Changes the contract of the adaptive load session spec proto with respect to open_loop in nighthawk_traffic_template: We no longer return an error if the user sets a value for open_loop; instead we override open_loop to true if it was not specified (the opposite of the Nighthawk client default), while still allowing the user to explicitly set open_loop to either true or false.

Part 1 of splitting PR #483.
dubious90 pushed a commit that referenced this pull request Aug 27, 2020
A fake `Envoy::TimeSource` that ticks forward every time it is called, starting from 1970-01-01 00:00:00. Only `monotonicTime()` is implemented.

This simple autonomous behavior is needed because the adaptive load controller checks the clock many times automatically and can't be interrupted in the middle of the run to update the fake time.

Part 2 of splitting PR #483.
dubious90 pushed a commit that referenced this pull request Aug 29, 2020
Updates FakeStepController:

- Set doomed state when `FakeStepController::UpdateAndRecompute()` receives a `BenchmarkResult` with a negative metric score. Previously `BenchmarkResult` could contain an error status, and we used the error status to trigger simulated doom. Now errors do not get passed to  `StepController::UpdateAndRecompute()`, so we need a new mechanism.
- Adds the ability to simulate an `InputVariableSetter` failure in `GetCurrentCommandLineOptions()`. The artificial failure is specified in the config proto.
- Adds a countdown mechanism so that `GetCurrentCommandLineOptions()` returns successfully until `UpdateAndRecompute()` has been called a configured number of times, then starts returning the simulated input variable setter failure mentioned above. This simple autonomous behavior is needed to test error handling in the testing stage without triggering an early exit in the adjusting stage; the adaptive load controller can't be paused in the middle of the test to reconfigure the `FakeStepController` to change its behavior.

The behavior of `FakeStepController` after an `UpdateAndRecompute()` with a `BenchmarkResult` containing the given metric scores:
- Any positive score: converged
- All scores zero: neither converged nor doomed
- Any negative score: doomed

Part 3 of splitting PR #483.
dubious90 pushed a commit that referenced this pull request Sep 8, 2020
A library that calls a Nighthawk Service gRPC stub with the given `CommandLineOptions`, translating all possible gRPC errors into `absl::StatusOr`.

Will need to be updated when Nighthawk Service starts returning more than one message over the stream.

Part 4 of splitting PR #483.
abaptiste pushed a commit to abaptiste/nighthawk that referenced this pull request Sep 8, 2020
A library that calls a Nighthawk Service gRPC stub with the given `CommandLineOptions`, translating all possible gRPC errors into `absl::StatusOr`.

Will need to be updated when Nighthawk Service starts returning more than one message over the stream.

Part 4 of splitting PR envoyproxy#483.
dubious90 pushed a commit that referenced this pull request Sep 10, 2020
A library that combines the latest Nighthawk Service response, metric and threshold spec configuration, and results from MetricsPlugins to produce metric scores.

Part 5 of splitting PR #483.

Signed-off-by: eric846 <[email protected]>
dubious90 pushed a commit that referenced this pull request Sep 11, 2020
- `SetSessionSpecDefaults()`: Returns a copy of the `AdaptiveLoadSessionSpec` with default values added.
- `CheckSessionSpec()`: Checks an `AdaptiveLoadSessionSpec` for illegal values, invalid plugin references, and invalid plugin configs.

Part 6 of splitting PR #483.

Signed-off-by: eric846 <[email protected]>
dubious90 pushed a commit that referenced this pull request Sep 12, 2020
- Add missing `const` to helper methods.
- Add a missing build dep
- Add a missing `#pragma once `

Part 7 of splitting PR #483.

Signed-off-by: eric846 <[email protected]>
dubious90 pushed a commit that referenced this pull request Sep 14, 2020
- MockNighthawkServiceClient
- MockMetricsEvaluator
- MockAdaptiveLoadSessionSpecProtoHelper

Part 8 of splitting PR #483.

Signed-off-by: eric846 <[email protected]>
mum4k pushed a commit that referenced this pull request Sep 22, 2020
This is the main function of the Adaptive Load Controller library:

- Check the input proto for errors
- Apply default input values
- Adjusting Stage loop:
  - Get the latest dynamically generated CommandLineOptions from the StepController
  - Run a short benchmark with the Nighthawk Service
  - Obtain values from MetricsPlugins
  - Score metrics using ScoringFunction plugins
  - Report scores back to the StepController, which recalculates the load for the next iteration
  - Check for convergence deadline exceeded
  - Check for StepController convergence
  - Check for StepController doom
- Testing Stage: Run one long benchmark on the Nighthawk Service at the converged load

Fixes #485.

Part 9 of splitting PR #483.

Signed-off-by: eric846 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-changes A PR waiting for comments to be resolved and changes to be applied.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants