Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

kiam-agent random errors on node startup #358

Closed
caiohasouza opened this issue Jan 14, 2020 · 10 comments · Fixed by #367
Closed

kiam-agent random errors on node startup #358

caiohasouza opened this issue Jan 14, 2020 · 10 comments · Fixed by #367

Comments

@caiohasouza
Copy link
Contributor

Hello,

I'm have some random errors when the new nodes are added to cluster, randomly the kiam agent fail on start with error

{"level":"fatal","msg":"error creating server gateway: error dialing grpc server: context deadline exceeded","time":"2020-01-14T17:53:00Z"}

After some restarts the kiam-agent works normally, this happen randomly but only in some nodes:

kiam-agent-4rxlg 1/1 Running 0 2m37s
kiam-agent-4whq4 1/1 Running 1 2m46s

This behavior is normal? I need fix/change some parameter?

Thank you!

@nirnanaaa
Copy link
Contributor

nirnanaaa commented Jan 15, 2020

we're experiencing the exact same behavior when rolling out new kiam servers. What we have seen is that for most agents the grpc connection is updated to those new servers. but for some reason sometimes it just stays as is.

for us the KIAM agent stays alive, but does only answer with the above message

@caiohasouza
Copy link
Contributor Author

When i enabled the GRPC debub i see this error:

grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.kiam-server on 100.64.0.10:53: no such host.

So i my case the problem is with dns resolution name, i created a PR to be possible add a kiam-agent initContainer (#367). With initContainer it will possible check the dns resolution before kiam-agent start:

initContainers:
- name: check-dns
image: busybox:1.31
imagePullPolicy: IfNotPresent
command: ['sh', '-c', 'until nslookup kiam-server; do sleep 1; done;']
resources:
requests:
cpu: 20m
memory: 16Mi
limits:
cpu: 20m
memory: 16Mi

I tested in my local environment and the problem was fixed.

@nirnanaaa
Copy link
Contributor

I think such a change only works against the theories behind k8s mainly the exponential backoff and fail fast theorem. The agent should just die in case it gets to a state where it can't recover from, right ?

@Joseph-Irving
Copy link
Contributor

The problem would appear to be that the Kiam node is starting on your nodes before DNS is working properly on them (maybe CNI still initialising something like that), in that case it's perfectly reasonable for Kiam to restart.

Is the kiam agent restarting causing you application level issues?
If so it might be worth implementing a fix for your setup to prevent this otherwise it's probably fine to leave it as is.

@nirnanaaa
Copy link
Contributor

nirnanaaa commented Jan 21, 2020

For us at least this is not happening on startup, but just during regular operations, when the KIAM servers are rotated

@caiohasouza
Copy link
Contributor Author

@Joseph-Irving the problem happen after startup too, but on the startup is more frequentily. I implemented the initContainer solution with dns check on startup and all works fine, i'm waiting the aproval PR #367.

@Joseph-Irving
Copy link
Contributor

what you're describing sounds like #217 @nirnanaaa, there's a potential fix in v3.5, if you're not on that

@caiohasouza
Copy link
Contributor Author

@Joseph-Irving i'm already using 3.5

@jpugliesi
Copy link

@nirnanaaa Did you find that updating to 3.5 fixed this for you? We're running 3.5 and still seeing the same issue on node startup. We're going to try @caiohasouza 's initContainer check for DNS, but it seems like a less-than-ideal fix for this problem

@jkassis
Copy link

jkassis commented Jul 29, 2020

also seeing this on 3.5

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants