-
Notifications
You must be signed in to change notification settings - Fork 238
kiam-agent random errors on node startup #358
Comments
we're experiencing the exact same behavior when rolling out new kiam servers. What we have seen is that for most agents the grpc connection is updated to those new servers. but for some reason sometimes it just stays as is. for us the KIAM agent stays alive, but does only answer with the above message |
When i enabled the GRPC debub i see this error:
So i my case the problem is with dns resolution name, i created a PR to be possible add a kiam-agent initContainer (#367). With initContainer it will possible check the dns resolution before kiam-agent start:
I tested in my local environment and the problem was fixed. |
I think such a change only works against the theories behind k8s mainly the exponential backoff and fail fast theorem. The agent should just die in case it gets to a state where it can't recover from, right ? |
The problem would appear to be that the Kiam node is starting on your nodes before DNS is working properly on them (maybe CNI still initialising something like that), in that case it's perfectly reasonable for Kiam to restart. Is the kiam agent restarting causing you application level issues? |
For us at least this is not happening on startup, but just during regular operations, when the KIAM servers are rotated |
@Joseph-Irving the problem happen after startup too, but on the startup is more frequentily. I implemented the initContainer solution with dns check on startup and all works fine, i'm waiting the aproval PR #367. |
what you're describing sounds like #217 @nirnanaaa, there's a potential fix in v3.5, if you're not on that |
@Joseph-Irving i'm already using 3.5 |
@nirnanaaa Did you find that updating to 3.5 fixed this for you? We're running 3.5 and still seeing the same issue on node startup. We're going to try @caiohasouza 's initContainer check for DNS, but it seems like a less-than-ideal fix for this problem |
also seeing this on 3.5 |
Hello,
I'm have some random errors when the new nodes are added to cluster, randomly the kiam agent fail on start with error
After some restarts the kiam-agent works normally, this happen randomly but only in some nodes:
This behavior is normal? I need fix/change some parameter?
Thank you!
The text was updated successfully, but these errors were encountered: