Add crashloop back off for k3s-server release #63

gberche-orange · 2024-08-01T11:00:49Z

Expected behavior

As an operator
In order to avoid crash loop that go unnoticed and mask error root cause such as https://github.com/orange-cloudfoundry/paas-templates/issues/2398
I need k3s-wrapper-boshrelease to back off when entering a crash loop

Observed behavior

tail -f -n 200 /var/vcap/monit/monit.log

#> UTC Aug  1 10:41:31] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:41:41] info     : 'k3s-server' process is running with pid 366216
#> [UTC Aug  1 10:42:41] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:42:41] info     : 'k3s-server' trying to restart
#> [UTC Aug  1 10:42:41] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:42:52] info     : 'k3s-server' process is running with pid 366278
#> [UTC Aug  1 10:43:42] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:43:42] info     : 'k3s-server' trying to restart
#> [UTC Aug  1 10:43:42] info     : 'k3s-server' start: /var/vcap/jobs/k3s-server/bin/ctl
#> [UTC Aug  1 10:43:52] info     : 'k3s-server' process is running with pid 366344
#> [UTC Aug  1 10:44:12] error    : 'k3s-server' process is not running
#> [UTC Aug  1 10:44:12] info     : 'k3s-server' trying to restart

Possible fix

Use monit support for slow process start

https://web.archive.org/web/20110816041503/https://mmonit.com/monit/documentation/monit.html

if 2 restarts within 3 cycles then timeout

SERVICE TIMEOUT

monit provides a service timeout mechanism for situations where a service simply refuses to start or respond over a longer period.

The timeout mechanism is based on number if service restarts and number of poll-cycles. For example, if a service had x restarts within y poll-cycles (where x <= y) then Monit will perform an action (for example unmonitor the service). If a timeout occurs Monit will send an alert message if you have register interest for this event.

The syntax for the timeout statement is as follows (keywords are in capital):

IF RESTART CYCLE(S) THEN

Here is an example where Monit will unmonitor the service if it was restarted 2 times within 3 cycles:

if 2 restarts within 3 cycles then unmonitor

To have Monit check the service again after a monitoring was disabled, run 'monit monitor ' from the command line.

Example for setting custom exec on timeout:

if 5 restarts within 5 cycles then exec "/foo/bar"

Example for stopping the service:

if 7 restarts within 10 cycles then stop

See inspiration in monit usage from cloudfoundry https://github.com/search?q=org%3Acloudfoundry+if+restart+cycles+within+then+path%3A%2F%28%5E%7C%5C%2F%29monit%24%2F&type=code

https://github.com/cloudfoundry/healthchecker-release

This repository is a BOSH release for healthchecker that is a go executable designed to perform TCP/HTTP based health checks of processes managed by monit in BOSH releases. Since the version of monit included in BOSH does not support specific tcp/http health checks, we designed this utility to perform health checking and restart processes if they become unreachable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add crashloop back off for k3s-server release #63

Add crashloop back off for k3s-server release #63

gberche-orange commented Aug 1, 2024 •

edited

Loading

Add crashloop back off for k3s-server release #63

Add crashloop back off for k3s-server release #63

Comments

gberche-orange commented Aug 1, 2024 • edited Loading

Expected behavior

Observed behavior

Possible fix

gberche-orange commented Aug 1, 2024 •

edited

Loading