You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently if the Habitat Event Stream flags are fed to the supervisor, it required Automate to be reachable and available to successfully start or load services.
Automate could be unavailable due to upgrades, network outages, or other issues. In these situations new supervisors would not be able to start & and if any existing supervisors had a reboot or restart event they would not be able to load/manage services. This could make a simple Automate outage cause production workloads to be impacted.
I recommend we make this behavior configurable so a user can choose if Habitat should fail to start or successfully start when Automate is unavailable
The user could indicate their preference by an environment variable like this:
HAB_REQUIRE_AUTOMATE=false
If set the behavior should be that the habitat supervisor should always start regardless of Automate's availability. This will allow services managed by Habitat to function even if a problem exists with Automate at the moment.
This will also allow user to use Habitat and its upgrade capabilities to ship fixes to problems that could resolve any Automate misconfigurations. For example if this is combined with the Effortless Infra pattern a user may need to ship a cookbook change via Habitat to remedy the solution. If the Habitat Supervisor doesn't start that may not be possible.
The text was updated successfully, but these errors were encountered:
One subtlety is that if a Supervisor is allowed to start in event stream mode, but without being able to connect to Automate, how do we (and should we?) differentiate between the following cases:
User provided incorrect connection parameters
User provided correct parameters, but Automate is unavailable for some reason
I know that our team would prefer something more like errors in the logs that indicate connection failures. Our big criteria is because we are going the Effortless design route, the hab supervisor is our hand into managing the system. So the sup & our services have to be able to come up, even if reporting isn't working properly. If the supervisor is up, we can ship updates and address the issue, but if not we could find ourselves dead in the water. Basically reporting is important, but it's secondary to being able to get the system up and running.
I think a different error logged based on the kind of issue that occurred would work well for us - i.e. authentication issue with Automate, Automate server wasn't reachable at all, etc. But that's just for ease of human diagnosis. From an automation standpoint, it probably matters less... down is down.
If there is a strong desire to keep a blocking startup, maybe this could be flag-driven, so the consumer can pick the behavior needed.
Currently if the Habitat Event Stream flags are fed to the supervisor, it required Automate to be reachable and available to successfully start or load services.
Automate could be unavailable due to upgrades, network outages, or other issues. In these situations new supervisors would not be able to start & and if any existing supervisors had a reboot or restart event they would not be able to load/manage services. This could make a simple Automate outage cause production workloads to be impacted.
I recommend we make this behavior configurable so a user can choose if Habitat should fail to start or successfully start when Automate is unavailable
The user could indicate their preference by an environment variable like this:
HAB_REQUIRE_AUTOMATE=false
If set the behavior should be that the habitat supervisor should always start regardless of Automate's availability. This will allow services managed by Habitat to function even if a problem exists with Automate at the moment.
This will also allow user to use Habitat and its upgrade capabilities to ship fixes to problems that could resolve any Automate misconfigurations. For example if this is combined with the Effortless Infra pattern a user may need to ship a cookbook change via Habitat to remedy the solution. If the Habitat Supervisor doesn't start that may not be possible.
The text was updated successfully, but these errors were encountered: