Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemonset sleep period/restart causes confusing logs #970

Closed
johnSchnake opened this issue Oct 21, 2019 · 3 comments · Fixed by #971
Closed

Daemonset sleep period/restart causes confusing logs #970

johnSchnake opened this issue Oct 21, 2019 · 3 comments · Fixed by #971

Comments

@johnSchnake
Copy link
Contributor

johnSchnake commented Oct 21, 2019

What steps did you take and what happened:
Run a e2e+systemdlogs run. If the e2e tests take a long time (multiple hours) the systemdlogs will gather logs, sleep for an hour, shutdown (causing a restart since DS has to have restart policy=Always), then they gather logs again and the server rejects them as duplicate.

This can be really confusing in error cases since the logs make it unclear where results came from/when good results got processed, why retries were occuring, etc.

That was one issue hitting #969

What did you expect to happen:
I want the daemonsets to be able to be run-once with restart Never. E.g. a job that starts on every node. K8s just doesnt have that as of now so we've fallen back to do work && sleep 3600. Need to reconsider how that integrates with the aggregator and logging to avoid the confusion.

Maybe even if it just starts up it can check if it has already reported results itself somehow and exit if it has. Could that cause worse problems at some point?

Anything else you would like to add:
Problem in the logs is that the flow is like this:

  • worker submits results & aggregator records "got results..."
  • worker/plugin restart at some point and try again and get a 409 duplicate and aggregator records the retry/rejection
  • repeats N times
  • final pod logs show the plugin/worker starting up hours late, only getting a 409 response and exiting.

It makes it unclear when/who ever submitted results.

@johnSchnake
Copy link
Contributor Author

Multiple k8s issues regarding run-once daemonset behavior but nothing happening it seems? Issues pointed that they may have a new supported restart policy of OnFailure but that issue has just gone stale: kubernetes/kubernetes#64623

@johnSchnake
Copy link
Contributor Author

I think that we can just loop sleep forever instead of just sleep for 1h.

Is there a downside to that? It just changes the systemd startup command from

      - /get_systemd_logs.sh && sleep 3600

to

      - /get_systemd_logs.sh && while true; do echo "Sleeping to avoid daemonset restart"; sleep 3600; done

@zubron what do you think? Any downside to that?

@zubron
Copy link
Contributor

zubron commented Oct 21, 2019

@johnSchnake I just took a look through that issue and some of the other duplicate issues. This approach seems fine and looks like what most other people are doing anyway. I don't know how else we could work around this. Even if support for a new restart policy was introduced, it would be quite a few release cycles before we could ever make use of it given that we need to support older cluster versions.

johnSchnake added a commit that referenced this issue Oct 21, 2019
We currently sleep for 1h after the logs are gathered but some
test runs take multiple hours which leads to this plugin getting
restarted and trying to resubmit logs. This causes confusing messages
in the logs since the original plugin pod is gone and the only one that
exists gets a 409 error from submitting duplicate results.

Fixes #970

Signed-off-by: John Schnake <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants