Make Habitat supervisor startup behavior with event stream reporting to Automate configurable #6740

ericcalabretta · 2019-07-17T20:11:33Z

Currently if the Habitat Event Stream flags are fed to the supervisor, it required Automate to be reachable and available to successfully start or load services.

Automate could be unavailable due to upgrades, network outages, or other issues. In these situations new supervisors would not be able to start & and if any existing supervisors had a reboot or restart event they would not be able to load/manage services. This could make a simple Automate outage cause production workloads to be impacted.

I recommend we make this behavior configurable so a user can choose if Habitat should fail to start or successfully start when Automate is unavailable

The user could indicate their preference by an environment variable like this:

HAB_REQUIRE_AUTOMATE=false

If set the behavior should be that the habitat supervisor should always start regardless of Automate's availability. This will allow services managed by Habitat to function even if a problem exists with Automate at the moment.

This will also allow user to use Habitat and its upgrade capabilities to ship fixes to problems that could resolve any Automate misconfigurations. For example if this is combined with the Effortless Infra pattern a user may need to ship a cookbook change via Habitat to remedy the solution. If the Habitat Supervisor doesn't start that may not be possible.

The text was updated successfully, but these errors were encountered:

christophermaier · 2019-07-17T20:37:11Z

One subtlety is that if a Supervisor is allowed to start in event stream mode, but without being able to connect to Automate, how do we (and should we?) differentiate between the following cases:

User provided incorrect connection parameters
User provided correct parameters, but Automate is unavailable for some reason

OkJaybird · 2019-07-17T20:47:30Z

I know that our team would prefer something more like errors in the logs that indicate connection failures. Our big criteria is because we are going the Effortless design route, the hab supervisor is our hand into managing the system. So the sup & our services have to be able to come up, even if reporting isn't working properly. If the supervisor is up, we can ship updates and address the issue, but if not we could find ourselves dead in the water. Basically reporting is important, but it's secondary to being able to get the system up and running.

I think a different error logged based on the kind of issue that occurred would work well for us - i.e. authentication issue with Automate, Automate server wasn't reachable at all, etc. But that's just for ease of human diagnosis. From an automation standpoint, it probably matters less... down is down.

If there is a strong desire to keep a blocking startup, maybe this could be flag-driven, so the consumer can pick the behavior needed.

ericcalabretta added the C-feature label Jul 17, 2019

davidMcneil self-assigned this Aug 9, 2019

davidMcneil mentioned this issue Aug 14, 2019

Update event stream #6853

Merged

dmccown closed this as completed Sep 16, 2019

christophermaier added Type: Feature Issues that describe a new desired feature and removed C-feature labels Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Habitat supervisor startup behavior with event stream reporting to Automate configurable #6740

Make Habitat supervisor startup behavior with event stream reporting to Automate configurable #6740

ericcalabretta commented Jul 17, 2019

christophermaier commented Jul 17, 2019

OkJaybird commented Jul 17, 2019 •

edited

Loading

Make Habitat supervisor startup behavior with event stream reporting to Automate configurable #6740

Make Habitat supervisor startup behavior with event stream reporting to Automate configurable #6740

Comments

ericcalabretta commented Jul 17, 2019

christophermaier commented Jul 17, 2019

OkJaybird commented Jul 17, 2019 • edited Loading

OkJaybird commented Jul 17, 2019 •

edited

Loading