-
Notifications
You must be signed in to change notification settings - Fork 238
kiam server HA. Context deadline exceeded while rotating servers. #217
Comments
Same issue here, KIAM v.3.2 |
similar issue, kiam v3.3 |
Just been burned by this myself - v3.2 |
Same issue for us on kiam v3.0 |
We run Kiam in relatively large clusters and we regularly terminate our master nodes (we don't allow them to live longer than 7 days) and we've never seen this behaviour. Kiam should be using a service with I'd be interested to know what networking stack you're using, incase there's some commonality there which could be what's causing this behaviour. |
We are using the amazon-k8s-cni plugin for networking. No other service mesh/network policy currently. |
Same here |
Using calico over aws ec2. |
Ok. So I have been doing some more testing of this issue. From inside a pod, when I run a cURL to request the credentials from the metadata service, it works fine before killing a KIAM server, as expected: As soon as I kill a KIAM server, the same cURL command fails immediately: The matching log line from the KIAM agent shows similar information: I ran a nslookup from inside a pod at the point of killing the KIAM server too. The record for the killed KIAM server disappears almost instantaneously. (For reference, we are using CoreDNS) . It feels very much like something is being cached on the KIAM agents in relation to the IP addresses available to gRPC load balance with. DNS is very quickly removing the retired IP from the pool, but it feels like the KIAM agent is still trying to route requests to it. The access key/secret/token above is nonsensical, it is just an example |
I am struggling to reproduce the issue myself - when killing Kiam servers in various different ways, the Kiam agent is correctly load balancing between the remaining Kiam servers. In order to observe the agent's behaviour, I set the following env vars in the agent pod to increase the log verbosity of grpc:
The following events occur on 1 agent that load balances between 2 different servers: The agent starts up and performs some service discovery to find the servers:
The agent connects to the servers successfully (below) - note that each server is associated with a hex value, in this instance we have the following 2 servers:
^ here, one of the grpc connections becomes ready before the other and the 'roundrobinPicker' lists this as the only server available in its pool to load balance with.
^ the other connection becomes ready and the load balancing pool is updated to include the second server. At this point, I kill one of the Kiam servers (10.244.1.3 - 0xc4203c1d40):
^ the agent attempts to reconnect to the server I killed. The agent will continue to try and re-connect with the server that has gone away over time (I have omitted these logs)
^ immediately after the connection error, the pool of servers to load balance with is updated to just include the 1 remaining server. Throughout this, my curls from another pod to the meta-data API have been successful - there hasn't been any sign of the issue reported (these curls will also show in the agent logs but have been omitted above for clarity). Would somebody be able to set the grpc log level env vars mentioned above in their agent deployments and provide agent logs? Including examples of agent startup and when a server is killed, so we can compare. Thanks. |
Hi. I have collected logs as requested. Kiam Agent startup logs:
We have two KIAM servers running,
I then kill the EC2 host running one of the KIAM servers (10-2-118-49.kiam-server.iam-proxy.svc.cluster.local), these are specific worker nodes that only run the KIAM servers.
The KIAM agent logs during this period are as follows:
|
Hi @mehstg - thanks for providing those logs. It is strange that you aren't seeing any gRPC logs (the |
No unfortunately not. I just tested again and I only really see the info messages during startup. |
OK - thanks for double checking. A few questions: The IP addresses that are resolved in your logs above, are these the same as your kiam server pod IPs? Also, could you please try running kiam v3.4 (released today), as this includes some gRPC updates, and let us know if this changes any behaviour. |
I actually am not setting any resources on the pods. Deploying from the stable helm charts (v2.2.0). |
I'd recommend setting some resource requests, at least for the agent pods so they are not starved of resources. We use the following values for the agents which works fine (but will depend on the scale of your cluster)
and the server resources are set to the following:
|
I have set resource requests on my KIAM pods and also upgraded to KIAM v3.4 on one of my test cluster. Previously, every request would die after killing one of the KIAM servers, so I wonder if the load balancing is working better in v3.4 but the dead server is still being cached somewhere. Log attached
|
I have been bitten by this as well. I haven't had a chance to enable gRPC logs to provide, but I am wondering if this has to do with My knowledge of gRPC is lacking, but perhaps when you manually terminate a I'll try and get some time to do further testing on my end and let you know how I go anyway. As a side-note I have a PR open to convert the |
Here's the agent logs when running as a
At I think this is down to the I'm off to test this when running the |
OK testing with
I rebooted 5 nodes none of my I'd love for someone to double-check these findings though? The PR for I don't think this solves the problem entirely though, as the same "pull the plug" event could happen at any time and agents might still serve |
Hi @daviddyball, thanks for adding your findings. I have been trying to replicate the issue and I am still struggling - killing the Kiam server in various ways (deleting pod, forcefully deleting pod, terminating node, rebooting node) whilst sending requests to a Kiam agent has not reproduced the From looking at the logs you've provided when Kiam server is running as a daemonset, the gRPC client on the agent does seem to be behaving normally when a server goes away; it notices the gRPC server connection failing as the master nodes are recycled:
^ The connection to a server enters failure state
^ The loadbalancing pool updates immediately to drop the failing server (the This occurs with another Kiam server at the same time
So I am not entirely sure what is causing the issue yet. I am still trying to replicate this issue, though, and if you are happy to share it would be useful to me if you could confirm the following: How are you deploying Kiam to your cluster (e.g, Helm/manifest files)? |
Hey @rhysemmas, Here's my setup: How are you deploying Kiam to your cluster (e.g, Helm/manifest files)?Flat YAML manifests Which version of Kiam are you running?
What DNS service are you running in your cluster (e.g, kube-dns, coreDNS)?CoreDNS: |
Out of interest David. What CNI are you using? I am using the
amazon-vpc-cni. Wondering if there is another similarity in our
configurations that we are missing.
…On Tue, 24 Sep 2019 at 17:45, David Dyball ***@***.***> wrote:
Hey @rhysemmas <https://github.com/rhysemmas>,
Here's my setup:
How are you deploying Kiam to your cluster (e.g, Helm/manifest files)?
Flat YAML manifests
Which version of Kiam are you running?
quay.io/uswitch/kiam:v3.3
What DNS service are you running in your cluster (e.g, kube-dns, coreDNS)?
CoreDNS: k8s.gcr.io/coredns:1.1.3
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#217?email_source=notifications&email_token=ABDWBU6EFHVVVSTOUOKZARTQLI7Z3A5CNFSM4GYFD7D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7PAROA#issuecomment-534644920>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABDWBU5AHXDZYV62UMVK7HLQLI7Z3ANCNFSM4GYFD7DQ>
.
|
I'm seeing this on a kops update and rolling of the cluster. I'm seeing kops try to drain the nodes, but fails to take any notice of daemon-sets. Kops then terminates the node via the aws autoscaling group api. We are using |
@mehstg I'm using Calcio CNI although I've also recreated this with Cilium on our development cluster |
Ha. Ok, so all on different CNI's
…On Wed, 25 Sep 2019 at 10:21, David Dyball ***@***.***> wrote:
@mehstg <https://github.com/mehstg> I'm using Calcio CNI
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#217?email_source=notifications&email_token=ABDWBU6WB5P4QCJY6LQ7EPDQLMURVA5CNFSM4GYFD7D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7RG75Q#issuecomment-534933494>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABDWBU6C7B4DEGKPPN3345TQLMURVANCNFSM4GYFD7DQ>
.
|
Saw this (again) today on a rolling update of 1.13 masters (kops) quay.io/uswitch/kiam:v3.3 |
I should report that since moving While this doesn't solve the issue of random server failures causing the same issue, it does mean we can roll our control plane without fear of breaking our workloads IAM credentials. |
If you need to evict a DS before a node dies you can actually do so by applying a
This will remove any/all pods from the node -- this should be done after a drain (as usually the running non-ds pods have deps on the ds pods). If you (for some reason) still need a ds instead of a deployment. |
Nice to know @jacksontj. We're using |
Ok. So I have just been burned by this again on our prod system. Cert-manager replaced the TLS certificates used by KIAM and the kiam-servers rebooted. |
My root cause theory is based on the following facts:
|
@mechpen - Great spot. I have been doing some testing today and inserted the Keepalive in to a local branch of Kiam - https://github.com/mehstg/kiam/tree/grpc-keepalive-testing I have just spun it up on one of my test clusters, and cannot get it to fail now, whether I ungracefully terminate the instance, or drain it correctly. |
Kiam v3.5 https://github.com/uswitch/kiam/releases/tag/v3.5 has a potential fix for this issue, has anyone been able to reproduce it on 3.5? |
how should cli arguments be tuned to fix this issue from happening @Joseph-Irving ? |
@nirnanaaa the default values should be good. |
The problem is that I want to migrate from |
i'm seeing this now in KIAM v3.5 with OpenShift 4.5 in AWS. Comes right up when i install the helm chart. can't get the kiam-server stable. |
in my case... the server was continuously and consistently failing the first liveness / readiness probes. it was 100% due to running the health check against 127.0.0.1 rather than localhost... here are my current working okd4.5 overrides for v3.5 chart...
|
I am still getting the same error even after upgrading to v3.5. Whenever a new node comes up, the agent restarts with the following error: |
I have the same issue. I'm running v3.2, installed with the helm chart v2.3.0. I am not sure what triggers the issue between the rotation of the certificates generated by cert-manager or a rotation of kiam-server itself. When the issue is occurring, changing the --server-address=kiam-server:443 to --server-address=localhost:443 in the agent's daemonset seems to fix it. Sadly enough, that value is hardcoded in the helm chart. |
I linked to #425 but it's potentially related to the way the gRPC load-balancer manages connection failures. We've updated the version of gRPC and it (according to the gRPC team) can be managed through controlling the MaxConnectionAge setting. We'll hopefully get this into v4 so this issue ought to be resolvable through that. |
We've updated gRPC and the changes in #433 should make it possible to mitigate the situation, according to the advice from the gRPC team: controlling the connection age to be shorter forces clients to re-poll for servers frequently. I'm going to close this issue for now and people should follow that advice to manage. If we see other problems we can re-open. Thanks for everyone in contributing and helping us to close this down. It was also reported in #425. |
In my test AWS lab I run 3 masters with KIAM servers (
quay.io/uswitch/kiam:v3.0
) and 6 nodes with KIAM agents (same images). I'm getting theseevery time I'm rotating masters. The error starts to appear on every one of 6 agents, once I remove at least 1 master. To stop it I need to restart all agents, when they restart they start to work normally. This does not happen if I rotate kiam server itself (delete the pod), only when I rotate the whole master instance.
I'm using this chart https://github.com/helm/charts/tree/master/stable/kiam to run kiam. No special tuning, except for I'm setting
gatewayTimeoutCreation: 500ms
on both agent and server.Am I missing smth important in docs, some flag? Please suggest
The text was updated successfully, but these errors were encountered: