-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive Load main loop #483
Conversation
update from master
merge from upstream
merge from upstream
merge from upstream
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
@dubious90 please review and assign back to me once done. |
…tepController, improve coverage, fix clang-tidy Signed-off-by: eric846 <[email protected]>
@@ -8,7 +8,7 @@ TO_CHECK="${2:-$PWD}" | |||
bazel run @envoy//tools:code_format/check_format.py -- \ | |||
--skip_envoy_build_rule_check --namespace_check Nighthawk \ | |||
--build_fixer_check_excluded_paths=$(realpath ".") \ | |||
--include_dir_order envoy,nighthawk,external/source/envoy,external,api,common,source,exe,server,client,test_common,test \ | |||
--include_dir_order envoy,nighthawk,external/source/envoy,external,api,common,source,exe,server,client,test_common,grpcpp,test \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without separating grpcpp into its own section, the formatter arranges the include files in an order that is then detected as invalid.
Signed-off-by: eric846 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing out my first comments based on a first read of this PR; I feel this looks great overall
} | ||
::grpc::Status status = stream->Finish(); | ||
if (!status.ok()) { | ||
response.mutable_error_detail()->set_code(status.error_code()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the ExecutionResponse already have error information here? Or is this information more valuable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I wonder if returning an absl::StatusOr<nighthawk.client.Output> would be clearer here for what you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -- we are now all-in on StatusOr!
absl::flat_hash_map<const nighthawk::adaptive_load::MetricSpec*, | ||
const nighthawk::adaptive_load::ThresholdSpec*> | ||
threshold_spec_from_metric_spec; | ||
for (const MetricSpecWithThreshold& metric_threshold : spec.metric_thresholds()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of variables here that are a little difficult to keep track of in readability. Maybe comments explaining what maps are being built (functionally, not describing the elements but what their purposes are)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented the complicated map and also extracted another function.
* @return BenchmarkResult Proto containing metric scores for this Nighthawk Service benchmark | ||
* session, or an error propagated from the Nighthawk Service or MetricsPlugins. | ||
*/ | ||
BenchmarkResult AnalyzeNighthawkBenchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is very long, which adds to it being difficult to read. Is there any way to break it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted EvaluateMetric function, should make it a bit simpler.
|
||
} // namespace | ||
|
||
AdaptiveLoadSessionOutput PerformAdaptiveLoadSession( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that this API was already partially reviewed, but that was before we brought in absl::Status. I question whether much of this would be easier to follow (and potentially easier to use) if you used StatusOr rather than having a proto that contains status fields. Open to pushback here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, adopted StatusOr throughout, including the API.
* Fake time source that ticks 1 second on every query, starting from the Unix epoch. Supports only | ||
* monotonicTime(). | ||
*/ | ||
class FakeIncrementingMonotonicTimeSource : public Envoy::TimeSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought envoy already supported mock timers. Is there a reason why we need to implement a fake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a mock TimeSource being used, and there's TestTimeSystem, but here we need something where:
- time advances between calls
- no intervention is needed between calls
because once we unleash the main loop we can't pause it to manipulate the clock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So personally, I love the simplicity of this. It might be possible to use simulated time, but then we may need another thread to advance time or some other relatively complex construct. Also, as I think monotonic time promises us to advance each time we call it, this seems quite an accurate modelling of it. So personally, when weighing the potential duplication introduced against the simplicity we win, I'm rooting for the way this is right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of feedback; I've read most of the PR now, but still need to go through most of the tests
…rors, shorten function, rearrange main loop Signed-off-by: eric846 <[email protected]>
Added several more commits with nontrivial content, starting from "delete proto error status fields." I had to introduce new functionality to the FakeStepController to be able to cover the last bit of the main function. |
…epController helper Signed-off-by: eric846 <[email protected]>
I am wondering what we can do about the fact that this PR is both fairly long and hard to navigate / review. The main issue may be that there are multiple effort tracks combined in this PR which makes it hard to mentally map the code changes to individual tracks. As it stands, I don't feel confident enough that I won't miss anything important while reviewing this. The best solution might be to shelve this PR, while keeping it as a reference for the reviewers who already posted comments in here. We can then try to split the individual effort tracks into multiple separate PRs, each of them ideally focused on just one thing. This can also allow us to better track the reasoning for individual changes in the description of each PR. I may not know enough to prescribe the split exactly, but here is a rough list of ideas. Please feel free to adapt this to reality as you see fit.
Next we should address how long the adaptive load controller implementation is itself. Skimming over the helper functions in the anonymous namespace - some of them look like they contain enough logic to warrant their own tests. I will leave this to your best judgment. Having private functions makes the code agile, but there is a balance between keeping everything private and having separate test coverage for large enough behaviors. Maybe we can find some such candidates and their move to standalone libraries with test coverage will both simplify the PRs and make it faster to find errors when their individual unit tests break. E.g maybe the function that evaluates metrics deserves its own library? This should leave us in a situation where the main library of the adaptive load controller only contains its main business logic. Any such libraries can then be sent as separate PRs. We can then send one last PR to add the main logic of the adaptive load controller. |
Commencing this split now |
Migrates PerformAdaptiveLoadSession() to return absl::StatusOr. Deletes status fields from result protos since they are no longer needed. Temporarily deletes the simulated doom mechanism from FakeStepController because it relied on one of the deleted proto fields; it will be replaced by a new mechanism in a subsequent PR. Changes the contract of the adaptive load session spec proto with respect to open_loop in nighthawk_traffic_template: We no longer return an error if the user sets a value for open_loop; instead we override open_loop to true if it was not specified (the opposite of the Nighthawk client default), while still allowing the user to explicitly set open_loop to either true or false. Part 1 of splitting PR #483.
A fake `Envoy::TimeSource` that ticks forward every time it is called, starting from 1970-01-01 00:00:00. Only `monotonicTime()` is implemented. This simple autonomous behavior is needed because the adaptive load controller checks the clock many times automatically and can't be interrupted in the middle of the run to update the fake time. Part 2 of splitting PR #483.
Updates FakeStepController: - Set doomed state when `FakeStepController::UpdateAndRecompute()` receives a `BenchmarkResult` with a negative metric score. Previously `BenchmarkResult` could contain an error status, and we used the error status to trigger simulated doom. Now errors do not get passed to `StepController::UpdateAndRecompute()`, so we need a new mechanism. - Adds the ability to simulate an `InputVariableSetter` failure in `GetCurrentCommandLineOptions()`. The artificial failure is specified in the config proto. - Adds a countdown mechanism so that `GetCurrentCommandLineOptions()` returns successfully until `UpdateAndRecompute()` has been called a configured number of times, then starts returning the simulated input variable setter failure mentioned above. This simple autonomous behavior is needed to test error handling in the testing stage without triggering an early exit in the adjusting stage; the adaptive load controller can't be paused in the middle of the test to reconfigure the `FakeStepController` to change its behavior. The behavior of `FakeStepController` after an `UpdateAndRecompute()` with a `BenchmarkResult` containing the given metric scores: - Any positive score: converged - All scores zero: neither converged nor doomed - Any negative score: doomed Part 3 of splitting PR #483.
A library that calls a Nighthawk Service gRPC stub with the given `CommandLineOptions`, translating all possible gRPC errors into `absl::StatusOr`. Will need to be updated when Nighthawk Service starts returning more than one message over the stream. Part 4 of splitting PR #483.
A library that calls a Nighthawk Service gRPC stub with the given `CommandLineOptions`, translating all possible gRPC errors into `absl::StatusOr`. Will need to be updated when Nighthawk Service starts returning more than one message over the stream. Part 4 of splitting PR envoyproxy#483.
A library that combines the latest Nighthawk Service response, metric and threshold spec configuration, and results from MetricsPlugins to produce metric scores. Part 5 of splitting PR #483. Signed-off-by: eric846 <[email protected]>
- `SetSessionSpecDefaults()`: Returns a copy of the `AdaptiveLoadSessionSpec` with default values added. - `CheckSessionSpec()`: Checks an `AdaptiveLoadSessionSpec` for illegal values, invalid plugin references, and invalid plugin configs. Part 6 of splitting PR #483. Signed-off-by: eric846 <[email protected]>
- Add missing `const` to helper methods. - Add a missing build dep - Add a missing `#pragma once ` Part 7 of splitting PR #483. Signed-off-by: eric846 <[email protected]>
- MockNighthawkServiceClient - MockMetricsEvaluator - MockAdaptiveLoadSessionSpecProtoHelper Part 8 of splitting PR #483. Signed-off-by: eric846 <[email protected]>
This is the main function of the Adaptive Load Controller library: - Check the input proto for errors - Apply default input values - Adjusting Stage loop: - Get the latest dynamically generated CommandLineOptions from the StepController - Run a short benchmark with the Nighthawk Service - Obtain values from MetricsPlugins - Score metrics using ScoringFunction plugins - Report scores back to the StepController, which recalculates the load for the next iteration - Check for convergence deadline exceeded - Check for StepController convergence - Check for StepController doom - Testing Stage: Run one long benchmark on the Nighthawk Service at the converged load Fixes #485. Part 9 of splitting PR #483. Signed-off-by: eric846 <[email protected]>
This is the main function of the Adaptive Load Controller library:
while
loop:The unit test has an example of mocking a gRPC stub, which is not easy.