Daemonset sleep period/restart causes confusing logs #970

johnSchnake · 2019-10-21T15:57:21Z

What steps did you take and what happened:
Run a e2e+systemdlogs run. If the e2e tests take a long time (multiple hours) the systemdlogs will gather logs, sleep for an hour, shutdown (causing a restart since DS has to have restart policy=Always), then they gather logs again and the server rejects them as duplicate.

This can be really confusing in error cases since the logs make it unclear where results came from/when good results got processed, why retries were occuring, etc.

That was one issue hitting #969

What did you expect to happen:
I want the daemonsets to be able to be run-once with restart Never. E.g. a job that starts on every node. K8s just doesnt have that as of now so we've fallen back to do work && sleep 3600. Need to reconsider how that integrates with the aggregator and logging to avoid the confusion.

Maybe even if it just starts up it can check if it has already reported results itself somehow and exit if it has. Could that cause worse problems at some point?

Anything else you would like to add:
Problem in the logs is that the flow is like this:

worker submits results & aggregator records "got results..."
worker/plugin restart at some point and try again and get a 409 duplicate and aggregator records the retry/rejection
repeats N times
final pod logs show the plugin/worker starting up hours late, only getting a 409 response and exiting.

It makes it unclear when/who ever submitted results.

The text was updated successfully, but these errors were encountered:

johnSchnake · 2019-10-21T16:32:51Z

Multiple k8s issues regarding run-once daemonset behavior but nothing happening it seems? Issues pointed that they may have a new supported restart policy of OnFailure but that issue has just gone stale: kubernetes/kubernetes#64623

johnSchnake · 2019-10-21T16:43:34Z

I think that we can just loop sleep forever instead of just sleep for 1h.

Is there a downside to that? It just changes the systemd startup command from

      - /get_systemd_logs.sh && sleep 3600

to

      - /get_systemd_logs.sh && while true; do echo "Sleeping to avoid daemonset restart"; sleep 3600; done

@zubron what do you think? Any downside to that?

zubron · 2019-10-21T17:17:57Z

@johnSchnake I just took a look through that issue and some of the other duplicate issues. This approach seems fine and looks like what most other people are doing anyway. I don't know how else we could work around this. Even if support for a new restart policy was introduced, it would be quite a few release cycles before we could ever make use of it given that we need to support older cluster versions.

We currently sleep for 1h after the logs are gathered but some test runs take multiple hours which leads to this plugin getting restarted and trying to resubmit logs. This causes confusing messages in the logs since the original plugin pod is gone and the only one that exists gets a 409 error from submitting duplicate results. Fixes #970 Signed-off-by: John Schnake <[email protected]>

johnSchnake added the p2-moderate label Oct 21, 2019

This was referenced Oct 21, 2019

Sleep forever after systemd logs plugin finishes #971

Merged

ErrImagePull should be a retryable error #969

Closed

johnSchnake closed this as completed in #971 Oct 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daemonset sleep period/restart causes confusing logs #970

Daemonset sleep period/restart causes confusing logs #970

johnSchnake commented Oct 21, 2019 •

edited

Loading

johnSchnake commented Oct 21, 2019

johnSchnake commented Oct 21, 2019

zubron commented Oct 21, 2019

Daemonset sleep period/restart causes confusing logs #970

Daemonset sleep period/restart causes confusing logs #970

Comments

johnSchnake commented Oct 21, 2019 • edited Loading

johnSchnake commented Oct 21, 2019

johnSchnake commented Oct 21, 2019

zubron commented Oct 21, 2019

johnSchnake commented Oct 21, 2019 •

edited

Loading