Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crashloop back off for k3s-server release #63

Open
gberche-orange opened this issue Aug 1, 2024 · 0 comments
Open

Add crashloop back off for k3s-server release #63

gberche-orange opened this issue Aug 1, 2024 · 0 comments

Comments

@gberche-orange
Copy link
Member

gberche-orange commented Aug 1, 2024

Expected behavior

As an operator
In order to avoid crash loop that go unnoticed and mask error root cause such as https://github.com/orange-cloudfoundry/paas-templates/issues/2398
I need k3s-wrapper-boshrelease to back off when entering a crash loop

Observed behavior

tail -f -n 200 /var/vcap/monit/monit.log

#> UTC Aug  1 10:41:31] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:41:41] info     : 'k3s-server' process is running with pid 366216
#> [UTC Aug  1 10:42:41] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:42:41] info     : 'k3s-server' trying to restart
#> [UTC Aug  1 10:42:41] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:42:52] info     : 'k3s-server' process is running with pid 366278
#> [UTC Aug  1 10:43:42] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:43:42] info     : 'k3s-server' trying to restart
#> [UTC Aug  1 10:43:42] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:43:52] info     : 'k3s-server' process is running with pid 366344
#> [UTC Aug  1 10:44:12] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:44:12] info     : 'k3s-server' trying to restart

Possible fix

Use monit support for slow process start

https://web.archive.org/web/20110816041503/https://mmonit.com/monit/documentation/monit.html

if 2 restarts within 3 cycles then timeout

SERVICE TIMEOUT

monit provides a service timeout mechanism for situations where a service simply refuses to start or respond over a longer period.

The timeout mechanism is based on number if service restarts and number of poll-cycles. For example, if a service had x restarts within y poll-cycles (where x <= y) then Monit will perform an action (for example unmonitor the service). If a timeout occurs Monit will send an alert message if you have register interest for this event.

The syntax for the timeout statement is as follows (keywords are in capital):

IF RESTART CYCLE(S) THEN

Here is an example where Monit will unmonitor the service if it was restarted 2 times within 3 cycles:

if 2 restarts within 3 cycles then unmonitor

To have Monit check the service again after a monitoring was disabled, run 'monit monitor ' from the command line.

Example for setting custom exec on timeout:

if 5 restarts within 5 cycles then exec "/foo/bar"

Example for stopping the service:

if 7 restarts within 10 cycles then stop

See inspiration in monit usage from cloudfoundry https://github.com/search?q=org%3Acloudfoundry+if+restart+cycles+within+then+path%3A%2F%28%5E%7C%5C%2F%29monit%24%2F&type=code

https://github.com/cloudfoundry/healthchecker-release

This repository is a BOSH release for healthchecker that is a go executable designed to perform TCP/HTTP based health checks of processes managed by monit in BOSH releases. Since the version of monit included in BOSH does not support specific tcp/http health checks, we designed this utility to perform health checking and restart processes if they become unreachable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant