Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Habitat supervisor startup behavior with event stream reporting to Automate configurable #6740

Closed
ericcalabretta opened this issue Jul 17, 2019 · 2 comments
Assignees
Labels
Type: Feature Issues that describe a new desired feature

Comments

@ericcalabretta
Copy link
Contributor

Currently if the Habitat Event Stream flags are fed to the supervisor, it required Automate to be reachable and available to successfully start or load services.

Automate could be unavailable due to upgrades, network outages, or other issues. In these situations new supervisors would not be able to start & and if any existing supervisors had a reboot or restart event they would not be able to load/manage services. This could make a simple Automate outage cause production workloads to be impacted.

I recommend we make this behavior configurable so a user can choose if Habitat should fail to start or successfully start when Automate is unavailable

The user could indicate their preference by an environment variable like this:

HAB_REQUIRE_AUTOMATE=false

If set the behavior should be that the habitat supervisor should always start regardless of Automate's availability. This will allow services managed by Habitat to function even if a problem exists with Automate at the moment.

This will also allow user to use Habitat and its upgrade capabilities to ship fixes to problems that could resolve any Automate misconfigurations. For example if this is combined with the Effortless Infra pattern a user may need to ship a cookbook change via Habitat to remedy the solution. If the Habitat Supervisor doesn't start that may not be possible.

@christophermaier
Copy link
Contributor

One subtlety is that if a Supervisor is allowed to start in event stream mode, but without being able to connect to Automate, how do we (and should we?) differentiate between the following cases:

  • User provided incorrect connection parameters
  • User provided correct parameters, but Automate is unavailable for some reason

@OkJaybird
Copy link

OkJaybird commented Jul 17, 2019

I know that our team would prefer something more like errors in the logs that indicate connection failures. Our big criteria is because we are going the Effortless design route, the hab supervisor is our hand into managing the system. So the sup & our services have to be able to come up, even if reporting isn't working properly. If the supervisor is up, we can ship updates and address the issue, but if not we could find ourselves dead in the water. Basically reporting is important, but it's secondary to being able to get the system up and running.

I think a different error logged based on the kind of issue that occurred would work well for us - i.e. authentication issue with Automate, Automate server wasn't reachable at all, etc. But that's just for ease of human diagnosis. From an automation standpoint, it probably matters less... down is down.

If there is a strong desire to keep a blocking startup, maybe this could be flag-driven, so the consumer can pick the behavior needed.

@davidMcneil davidMcneil self-assigned this Aug 9, 2019
@dmccown dmccown closed this as completed Sep 16, 2019
@christophermaier christophermaier added Type: Feature Issues that describe a new desired feature and removed C-feature labels Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Issues that describe a new desired feature
Projects
None yet
Development

No branches or pull requests

5 participants