Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive Load FakeStepController doom update #492

Conversation

eric846
Copy link
Contributor

@eric846 eric846 commented Aug 27, 2020

Updates FakeStepController:

  • Set doomed state when FakeStepController::UpdateAndRecompute() receives a BenchmarkResult with a negative metric score. Previously BenchmarkResult could contain an error status, and we used the error status to trigger simulated doom. Now errors do not get passed to StepController::UpdateAndRecompute(), so we need a new mechanism.
  • Adds the ability to simulate an InputVariableSetter failure in GetCurrentCommandLineOptions(). The artificial failure is specified in the config proto.
  • Adds a countdown mechanism so that GetCurrentCommandLineOptions() returns successfully until UpdateAndRecompute() has been called a configured number of times, then starts returning the simulated input variable setter failure mentioned above. This simple autonomous behavior is needed to test error handling in the testing stage without triggering an early exit in the adjusting stage; the adaptive load controller can't be paused in the middle of the test to reconfigure the FakeStepController to change its behavior.

The behavior of FakeStepController after an UpdateAndRecompute() with a BenchmarkResult containing the given metric scores:

  • Any positive score: converged
  • All scores zero: neither converged nor doomed
  • Any negative score: doomed

Part 3 of splitting PR #483.

eric846 added 30 commits June 1, 2020 17:23
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
…ent.Output turns out not to include the status

Signed-off-by: eric846 <[email protected]>
…plugin names, log thresholds only once per session

Signed-off-by: eric846 <[email protected]>
Signed-off-by: eric846 <[email protected]>
Copy link
Contributor

@dubious90 dubious90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly one big question that I want more clarity on.

nighthawk::client::CommandLineOptions command_line_options_template)
: is_converged_{false}, is_doomed_{false}, fixed_rps_value_{config.fixed_rps_value()},
: input_setting_failure_countdown_{config.artificial_input_setting_failure_countdown()},
config_{std::move(config)}, is_converged_{false}, is_doomed_{false},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a complaint, but what is std::move accomplishing here, given that this isn't a smart pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny story, this was actually required by clang-tidy. If the function implementation is making a copy of a const reference parameter, clang-tidy tells you to change the parameter to be pass by value so it's obvious to the caller that a copy is happening. The argument gets copied at the call site into an unnamed temporary value. Then in the constructor we can actually move this temporary value into the field without incurring the cost of a second copy.

int32 fixed_rps_value = 1;
// Artificial error that the plugin factory should return during validation. Optional.
google.rpc.Status artificial_validation_failure = 2;
// Artificial error that should be returned from GetCurrentCommandLineOptions(). Optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enhance this comment to indicate the relation to the countdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

int32 fixed_rps_value = 1;
// Artificial error that the plugin factory should return during validation. Optional.
google.rpc.Status artificial_validation_failure = 2;
// Artificial error that should be returned from GetCurrentCommandLineOptions(). Optional.
google.rpc.Status artificial_input_setting_failure = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm having trouble understanding the full use case here. If this is 3, then we are going to succeed 3 times, then fail on the 4th attempt. Makes sense, but I'm not sure I understand why that's useful. What is the test that you're supporting by creating it in this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetCurrentCommandLine() can return an error status that should be handled cleanly by the main controller loop. The controller calls GetCurrentCommandLine() repeatedly during the adjusting stage until the step controller says it's converged, and then in the testing stage we call GetCurrentCommandLine() one last time, reusing the last converged value.

In order to test handling of these errors, we need the FakeStepController to return successfully from GetCurrentCommandLine() during the adjusting stage, but then start returning errors just in time for the testing stage. We don't have any way to update the FakeStepController during the run, so it has to somehow be programmed up front to behave differently at different times.

An alternative would be for magic values in UpdateAndRecompute() to trigger GetCurrentCommandLine() error behavior. We already use magic values to control convergence and doom. But there's only so much information we can encode in metric score doubles without having it get out of hand. Currently the UpdateAndRecompute() behavior is: zero scores=non-converged non-doomed, any positive score=converged, any negative score=doomed.

This trick wouldn't be necessary if the step controller were aware of what stage it was operating in.

@dubious90 dubious90 assigned mum4k and unassigned dubious90 Aug 27, 2020
@dubious90
Copy link
Contributor

@mum4k Can you give this a review and assign back to me when done?

@mum4k
Copy link
Collaborator

mum4k commented Aug 27, 2020

@eric846 please address / respond to the comments from @dubious90 as that can help me by providing additional context for the review.

@mum4k mum4k requested a review from oschaaf August 27, 2020 22:56
@mum4k mum4k added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Aug 27, 2020
@eric846 eric846 added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Aug 28, 2020
oschaaf
oschaaf previously approved these changes Aug 28, 2020
Copy link
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a const nit, LGTM otherwise. Left a question that popped up over at #466 (comment) as that's out of scope here.

@oschaaf oschaaf added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Aug 28, 2020
mum4k
mum4k previously approved these changes Aug 28, 2020
Copy link
Collaborator

@mum4k mum4k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with just a few nits.

*
* @param fixed_rps_value Value for RPS to set in the FakeStepControllerConfig proto until the
* countdown reaches zero.
* @param artificial_input_setting_failure An error status.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we expand this comment, explaining what is the meaning of the error status, i.e. what it is used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -22,7 +22,7 @@ class FakeStepController : public StepController {
* @param config FakeStepControllerConfig proto for setting the fixed RPS value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pre-existing, optional) We should probably mention in the class comment that this class isn't thread-safe. At least I am assuming it isn't meant to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mum4k mum4k assigned dubious90 and unassigned mum4k Aug 28, 2020
@eric846 eric846 dismissed stale reviews from mum4k and oschaaf via 049baed August 28, 2020 17:13
@eric846 eric846 added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Aug 28, 2020
Copy link
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for resolving the const nit. LGTM for my part of the review.

@dubious90 dubious90 merged commit 886702f into envoyproxy:master Aug 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-review A PR waiting for a review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants