kiam server HA. Context deadline exceeded while rotating servers. #217

max-lobur · 2019-02-18T16:59:42Z

In my test AWS lab I run 3 masters with KIAM servers (quay.io/uswitch/kiam:v3.0) and 6 nodes with KIAM agents (same images). I'm getting these

{"addr":"100.112.154.80:56850","level":"error","method":"GET","msg":"error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-02-18T16:48:17Z"}

every time I'm rotating masters. The error starts to appear on every one of 6 agents, once I remove at least 1 master. To stop it I need to restart all agents, when they restart they start to work normally. This does not happen if I rotate kiam server itself (delete the pod), only when I rotate the whole master instance.

I'm using this chart https://github.com/helm/charts/tree/master/stable/kiam to run kiam. No special tuning, except for I'm setting gatewayTimeoutCreation: 500ms on both agent and server.

Am I missing smth important in docs, some flag? Please suggest

The text was updated successfully, but these errors were encountered:

complex64 · 2019-04-23T15:20:36Z

Same issue here, KIAM v.3.2

MilanDasek · 2019-07-17T09:14:10Z

similar issue, kiam v3.3

mehstg · 2019-07-17T15:35:29Z

Just been burned by this myself - v3.2

mseiwald · 2019-07-24T11:06:11Z

Same issue for us on kiam v3.0

Joseph-Irving · 2019-07-30T08:35:15Z

We run Kiam in relatively large clusters and we regularly terminate our master nodes (we don't allow them to live longer than 7 days) and we've never seen this behaviour.

Kiam should be using a service with clusterIP: none this is so that the agents can use grpc loadbalancing, they go and discover the pod ips of each kiam-server and will loadbalance between them, if one goes down it will just try another. So I'm very puzzled as to how you're seeing this.

I'd be interested to know what networking stack you're using, incase there's some commonality there which could be what's causing this behaviour.

mehstg · 2019-07-30T09:02:45Z

We are using the amazon-k8s-cni plugin for networking. No other service mesh/network policy currently.

denispesikov · 2019-07-30T16:00:38Z

Same here
error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded
KIAM v3.2
Any workaround?

max-lobur · 2019-07-30T17:18:22Z

Using calico over aws ec2.
Also using this chart to deploy https://github.com/helm/charts/tree/master/stable/kiam
It seems that it uses proper service as you mentioned https://github.com/helm/charts/blob/master/stable/kiam/templates/server-service.yaml#L26

mehstg · 2019-07-31T09:34:40Z

Ok. So I have been doing some more testing of this issue.

From inside a pod, when I run a cURL to request the credentials from the metadata service, it works fine before killing a KIAM server, as expected:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api {"Code":"Success","Type":"AWS-HMAC","AccessKeyId":"ASIAYXSSZXR7IBB2RQMP","SecretAccessKey":"O3h+vJPsu8vhWtyWOCkiV9rwh4F/3HMF+jHpMpCP","Token":"FQoGZXIvYXdzEDkaDIL6Kpd6o56IKCaqdyLtAXPxNnhIs6lvNrJwANlEupVSSwGqUzOxOhry10nANFU5VB28YPNc8BYnSy6B8+tMkw10d9ubPwfFj63sdGpuHAS+4wtrpPRbb5p2ma/qvyUu8p33t9+NGZ6BNGYUga43E9+erGSHjDuvD9I4Lwjcz831D3lAYidCAjb99Q+iOWMyScklGfH3YELKGpEtzg55KYPTH16xc+QtOUtdmcUZZsbGnvw04zb6gxhctUCneknWY9+ZfiIGRU2pXvUlavGPCtekyrTp4DnCN2IbDQnmNCfwsZfG0LZZ1yQw41FuZ3igPzzb6TM6a6LlU9gM+Cj4/oTqBQ==","Expiration":"2019-07-31T07:27:56Z","LastUpdated":"2019-07-31T07:12:56Z"}

As soon as I kill a KIAM server, the same cURL command fails immediately:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I can then repeatedly cURL the same request and sometimes it returns OK, sometimes it fails.

The matching log line from the KIAM agent shows similar information:
{"addr":"10.1.113.124:49850","level":"error","method":"HEAD","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api","status":500,"time":"2019-07-31T07:22:00Z"}`

I ran a nslookup from inside a pod at the point of killing the KIAM server too. The record for the killed KIAM server disappears almost instantaneously. (For reference, we are using CoreDNS) .

It feels very much like something is being cached on the KIAM agents in relation to the IP addresses available to gRPC load balance with. DNS is very quickly removing the retired IP from the pool, but it feels like the KIAM agent is still trying to route requests to it.

The access key/secret/token above is nonsensical, it is just an example

rhysemmas · 2019-08-16T08:38:02Z

I am struggling to reproduce the issue myself - when killing Kiam servers in various different ways, the Kiam agent is correctly load balancing between the remaining Kiam servers.

In order to observe the agent's behaviour, I set the following env vars in the agent pod to increase the log verbosity of grpc:

GRPC_GO_LOG_SEVERITY_LEVEL=info
GRPC_GO_LOG_VERBOSITY_LEVEL=8

The following events occur on 1 agent that load balances between 2 different servers:

The agent starts up and performs some service discovery to find the servers:

INFO: 2019/08/15 14:01:25 parsed scheme: "dns"
INFO: 2019/08/15 14:01:25 ccResolverWrapper: got new service config:
INFO: 2019/08/15 14:01:25 ccResolverWrapper: sending new addresses to cc: [{10.244.2.2:443 1 10-244-2-2.kiam-server.kube-system.svc.cluster.local. <nil>} {10.244.1.3:443 1 10-244-1-3.kiam-server.kube-system.svc.cluster.local. <nil>} {10.244.1.3:443 0  <nil>} {10.244.2.2:443 0  <nil>}]

The agent connects to the servers successfully (below) - note that each server is associated with a hex value, in this instance we have the following 2 servers:
10.244.1.3 - 0xc4203c1d40
10.244.2.2 - 0xc4203c1d60

INFO: 2019/08/15 14:01:25 base.baseBalancer: got new ClientConn state:  {{[{10.244.1.3:443 0  <nil>} {10.244.2.2:443 0  <nil>}] <nil>} <nil>}
INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d40, CONNECTING
INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d60, CONNECTING
INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d60, READY
INFO: 2019/08/15 14:01:25 roundrobinPicker: newPicker called with readySCs: map[{10.244.2.2:443 0  <nil>}:0xc4203c1d60]

^ here, one of the grpc connections becomes ready before the other and the 'roundrobinPicker' lists this as the only server available in its pool to load balance with.

INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d40, READY
INFO: 2019/08/15 14:01:25 roundrobinPicker: newPicker called with readySCs: map[{10.244.1.3:443 0  <nil>}:0xc4203c1d40 {10.244.2.2:443 0  <nil>}:0xc4203c1d60]

^ the other connection becomes ready and the load balancing pool is updated to include the second server.

At this point, I kill one of the Kiam servers (10.244.1.3 - 0xc4203c1d40):

INFO: 2019/08/15 14:15:32 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/08/15 14:15:32 base.baseBalancer: handle SubConn state change: 0xc4203c1d40, CONNECTING
WARNING: 2019/08/15 14:15:32 grpc: addrConn.createTransport failed to connect to {10.244.1.3:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.244.1.3:443: connect: connection refused". Reconnecting...`

^ the agent attempts to reconnect to the server I killed. The agent will continue to try and re-connect with the server that has gone away over time (I have omitted these logs)

INFO: 2019/08/15 14:15:32 roundrobinPicker: newPicker called with readySCs: map[{10.244.2.2:443 0  <nil>}:0xc4203c1d60]

^ immediately after the connection error, the pool of servers to load balance with is updated to just include the 1 remaining server.

Throughout this, my curls from another pod to the meta-data API have been successful - there hasn't been any sign of the issue reported (these curls will also show in the agent logs but have been omitted above for clarity).

Would somebody be able to set the grpc log level env vars mentioned above in their agent deployments and provide agent logs? Including examples of agent startup and when a server is killed, so we can compare. Thanks.

mehstg · 2019-08-16T13:17:03Z

Hi. I have collected logs as requested.

Kiam Agent startup logs:

`{"level":"info","msg":"configuring iptables","time":"2019-08-16T12:47:59Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2019-08-16T12:47:59Z"}
INFO: 2019/08/16 12:47:59 parsed scheme: "dns"
INFO: 2019/08/16 12:47:59 ccResolverWrapper: got new service config: 
INFO: 2019/08/16 12:47:59 ccResolverWrapper: sending new addresses to cc: [{10.2.81.71:443 1 10-2-81-71.kiam-server.iam-proxy.svc.cluster.local. <nil>} {10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>} {10.2.118.49:443 0  <nil>} {10.2.81.71:443 0  <nil>}]
INFO: 2019/08/16 12:47:59 base.baseBalancer: got new resolved addresses:  [{10.2.81.71:443 1 10-2-81-71.kiam-server.iam-proxy.svc.cluster.local. <nil>} {10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>} {10.2.118.49:443 0  <nil>} {10.2.81.71:443 0  <nil>}]
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891a0, CONNECTING
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891c0, CONNECTING
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891e0, CONNECTING
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc420189210, CONNECTING
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891c0, READY
INFO: 2019/08/16 12:47:59 roundrobinPicker: newPicker called with readySCs: map[{10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891c0]
{"level":"info","msg":"listening :8181","time":"2019-08-16T12:47:59Z"}
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891e0, READY
INFO: 2019/08/16 12:47:59 roundrobinPicker: newPicker called with readySCs: map[{10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891c0 {10.2.118.49:443 0  <nil>}:0xc4201891e0]
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc420189210, READY
INFO: 2019/08/16 12:47:59 roundrobinPicker: newPicker called with readySCs: map[{10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891c0 {10.2.118.49:443 0  <nil>}:0xc4201891e0 {10.2.81.71:443 0  <nil>}:0xc420189210]
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891a0, READY
INFO: 2019/08/16 12:47:59 roundrobinPicker: newPicker called with readySCs: map[{10.2.81.71:443 1 10-2-81-71.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891a0 {10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891c0 {10.2.118.49:443 0  <nil>}:0xc4201891e0 {10.2.81.71:443 0  <nil>}:0xc420189210]
{"addr":"10.2.45.198:41134","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:03Z"}
{"addr":"10.2.45.198:41172","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:06Z"}
{"addr":"10.2.45.198:41188","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:09Z"}
{"addr":"10.2.45.198:41198","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:12Z"}
{"addr":"10.2.45.198:41240","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:15Z"}
{"addr":"10.2.45.198:41264","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:18Z"}
{"addr":"10.2.45.198:41276","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:21Z"}
{"addr":"10.2.45.198:41310","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:24Z"}
{"addr":"10.2.45.198:41336","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:27Z"}
{"addr":"10.2.45.198:41348","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:30Z"}`

We have two KIAM servers running,
1 10-2-81-71.kiam-server.iam-proxy.svc.cluster.local 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local
If I run a curl from one of my containers, it works successfully:

`root@channel-service-588c949b66-tvxrr:/opt/app# curl http://169.254.169.254/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service
{"Code":"Success","Type":"AWS-HMAC","AccessKeyId":"redacted","SecretAccessKey":"redacted","Token":"redacted","Expiration":"2019-08-16T13:19:47Z","LastUpdated":"2019-08-16T13:04:47Z"}`

I then kill the EC2 host running one of the KIAM servers (10-2-118-49.kiam-server.iam-proxy.svc.cluster.local), these are specific worker nodes that only run the KIAM servers.
Almost immediately the curl begins to fail:

`root@channel-service-588c949b66-tvxrr:/opt/app# curl http://169.254.169.254/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service
error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded`

The KIAM agent logs during this period are as follows:

`{"addr":"10.2.39.230:52366","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:50:52Z"}
{"addr":"10.2.39.230:52366","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:50:52Z"}
{"addr":"10.2.45.198:42582","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:50:54Z"}
{"addr":"10.2.39.230:52422","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:50:56Z"}
{"addr":"10.2.39.230:52422","duration":2407,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:50:56Z"}
{"addr":"10.2.45.198:42612","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:50:57Z"}
{"addr":"10.2.45.198:42620","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:00Z"}
{"addr":"10.2.39.230:52466","duration":2,"headers":{"Content-Type":["application/json"]},"level":"debug","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":200,"time":"2019-08-16T12:51:01Z"}
{"addr":"10.2.45.198:42650","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:03Z"}
{"addr":"10.2.39.230:52488","duration":1,"headers":{"Content-Type":["application/json"]},"level":"debug","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":200,"time":"2019-08-16T12:51:03Z"}
{"addr":"10.2.45.198:42688","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:06Z"}
{"addr":"10.2.39.230:52504","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:51:09Z"}
{"addr":"10.2.39.230:52504","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:51:09Z"}
{"addr":"10.2.45.198:42700","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:09Z"}
{"addr":"10.2.45.198:42710","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:12Z"}
{"addr":"10.2.45.198:42748","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:15Z"}
{"addr":"10.2.45.198:42772","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:18Z"}
{"addr":"10.2.45.198:42790","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:21Z"}
{"addr":"10.2.39.230:52602","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:51:21Z"}
{"addr":"10.2.39.230:52602","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:51:21Z"}
{"addr":"10.2.45.198:42824","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:24Z"}
{"addr":"10.2.45.198:42850","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:27Z"}
{"addr":"10.2.45.198:42858","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:30Z"}
`

rhysemmas · 2019-08-16T13:45:03Z

Hi @mehstg - thanks for providing those logs. It is strange that you aren't seeing any gRPC logs (the INFO and WARNING events) in the agent logs after killing the kiam server - are there definitely none of these entries in the agent logs when the cURLs begin to fail?

mehstg · 2019-08-16T13:59:41Z

No unfortunately not. I just tested again and I only really see the info messages during startup.

rhysemmas · 2019-08-16T14:27:21Z

OK - thanks for double checking. A few questions:

The IP addresses that are resolved in your logs above, are these the same as your kiam server pod IPs?
Could you confirm what your cpu and memory resource requests/limits are for the agent pods?
Are you also using the helm chart to deploy kiam?

Also, could you please try running kiam v3.4 (released today), as this includes some gRPC updates, and let us know if this changes any behaviour.

mehstg · 2019-08-16T14:55:49Z

I actually am not setting any resources on the pods. Deploying from the stable helm charts (v2.2.0).
IP addresses resolved are correct.
Will try upgrading to v3.4 on one of the test clusters on Monday and test again

rhysemmas · 2019-08-16T15:15:48Z

I'd recommend setting some resource requests, at least for the agent pods so they are not starved of resources. We use the following values for the agents which works fine (but will depend on the scale of your cluster)

resources:
            requests:
              cpu: 5m
              memory: 50Mi
            limits:
              memory: 100Mi

and the server resources are set to the following:

resources:
            requests:
              cpu: 500m
              memory: 256Mi
            limits:
              memory: 512Mi

mehstg · 2019-08-19T12:59:29Z

I have set resource requests on my KIAM pods and also upgraded to KIAM v3.4 on one of my test cluster.
Running the same test, I am seeing the same issues, however it seems to be alternate requests fail now after killing one of my KIAM servers (This cluster has two KIAM servers running).

Previously, every request would die after killing one of the KIAM servers, so I wonder if the load balancing is working better in v3.4 but the dead server is still being cached somewhere. Log attached

{"addr":"10.2.67.14:59876","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-19T12:54:33Z"}
{"addr":"10.2.71.87:54470","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-19T12:54:33Z"}
{"addr":"10.2.71.87:54470","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-19T12:54:33Z"}
{"addr":"10.2.71.87:54516","duration":1,"headers":{"Content-Type":["application/json"]},"level":"debug","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":200,"time":"2019-08-19T12:54:35Z"}
{"addr":"10.2.67.14:59892","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-19T12:54:36Z"}
{"addr":"10.2.67.14:59914","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-19T12:54:39Z"}
{"addr":"10.2.71.87:54524","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-19T12:54:39Z"}
{"addr":"10.2.71.87:54524","duration":2004,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-19T12:54:39Z"}

daviddyball · 2019-09-11T20:37:54Z

I have been bitten by this as well. I haven't had a chance to enable gRPC logs to provide, but I am wondering if this has to do with kiam-server running as a DaemonSet and not receiving proper signals when a master terminates (e.g. when running a kops rolling-update cluster on masters).

My knowledge of gRPC is lacking, but perhaps when you manually terminate a kiam-server pod by running kubectl delete pod kiam-server-xxxxxx it sends some kind of message to the gRPC client in kiam-agent?

I'll try and get some time to do further testing on my end and let you know how I go anyway.

As a side-note I have a PR open to convert the DaemonSet for kiam-server to a Deployment so that it handles termination more gracefully. Perhaps the two are linked?

daviddyball · 2019-09-12T10:30:59Z

Here's the agent logs when running as a DaemonSet while using kops rolling-update cluster to recycle our master nodes:

INFO: 2019/09/12 10:12:33 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4201779f0, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.9.121:443 0  <nil>}:0xc4203d5b10 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.10.125:443 0  <nil>}:0xc420177aa0]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4201779f0, CONNECTING
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, CONNECTING
INFO: 2019/09/12 10:12:33 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:33 ccResolverWrapper: sending new addresses to cc: [{100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: got new resolved addresses:  [{100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d4790, CONNECTING
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d47d0, CONNECTING
INFO: 2019/09/12 10:12:33 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:33 ccResolverWrapper: sending new addresses to cc: [{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: got new resolved addresses:  [{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d4790, READY
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4790 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d47d0, READY
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.0.236:443 0  <nil>}:0xc4203d47d0 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4790 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.12.99:443 0  <nil>}:0xc4203d4e80]

{"level":"warning","msg":"error finding role for pod: rpc error: code = DeadlineExceeded desc = context deadline exceeded","pod.ip":"100.96.14.178","time":"2019-09-12T10:12:34Z"}
{"addr":"100.96.14.178:35122","level":"error","method":"GET","msg":"error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-09-12T10:12:34Z"}
{"addr":"100.96.14.178:35122","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-09-12T10:12:34Z"}

INFO: 2019/09/12 10:12:34 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

WARNING: 2019/09/12 10:12:53 grpc: addrConn.createTransport failed to connect to {100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.9.121:443: i/o timeout". Reconnecting...
WARNING: 2019/09/12 10:12:53 grpc: addrConn.createTransport failed to connect to {100.96.9.121:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.9.121:443: i/o timeout". Reconnecting...
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4201779f0, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, CONNECTING
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4201779f0, CONNECTING
INFO: 2019/09/12 10:12:53 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:53 ccResolverWrapper: sending new addresses to cc: [{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>}]
INFO: 2019/09/12 10:12:53 base.baseBalancer: got new resolved addresses:  [{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>}]
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, SHUTDOWN
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4201779f0, SHUTDOWN
WARNING: 2019/09/12 10:12:53 grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
WARNING: 2019/09/12 10:12:53 grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
INFO: 2019/09/12 10:12:53 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:53 ccResolverWrapper: sending new addresses to cc: [{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>}]

At 10:12:34Z you can see the agent returns context deadline exceeded to the client, which is what is causing issues. Our client apps receive this as the response from the meta-data API and this causes further issues (confirmed by running a pod curling the meta-data endpoint every second. At least on request came back with context deadline exceeded as the response).

I think this is down to the kiam-server being available right up until the power is switched off (EC2 API call to TerminateInstance). I think there's a TCP wait/timeout thing going on, where kiam-agent was either 1/2 way through a request to the now dead server, or just thinks the server is still there because it never got notified it was going away.

I'm off to test this when running the kiam-server pods as part of a Deployment. My hypothesis is that they will receive graceful shutdowns and notify any gRPC clients it's talking to that it is going away. We'll see

daviddyball · 2019-09-12T10:51:53Z

OK testing with kiam-server as a Deployment completed and it looks a lot healthier:

INFO: 2019/09/12 10:47:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/09/12 10:47:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, TRANSIENT_FAILURE
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.10.42:443 0  <nil>}:0xc4203b6e70 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, CONNECTING
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, TRANSIENT_FAILURE
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80 {100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, CONNECTING
WARNING: 2019/09/12 10:47:05 grpc: addrConn.createTransport failed to connect to {100.96.10.42:443 1 100-96-10-42.kiam-server.kube-system.svc.cluster.local. <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.10.42:443: connect: connection refused". Reconnecting...
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, TRANSIENT_FAILURE
WARNING: 2019/09/12 10:47:05 grpc: addrConn.createTransport failed to connect to {100.96.10.42:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.10.42:443: connect: connection refused". Reconnecting...
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, TRANSIENT_FAILURE
INFO: 2019/09/12 10:47:05 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:47:05 ccResolverWrapper: sending new addresses to cc: [{100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.42:443 1 100-96-10-42.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: got new resolved addresses:  [{100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.42:443 1 100-96-10-42.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792a0, CONNECTING
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792c0, CONNECTING
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, SHUTDOWN
WARNING: 2019/09/12 10:47:05 grpc: addrConn.transportMonitor exits due to: context canceled
INFO: 2019/09/12 10:47:05 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:47:05 ccResolverWrapper: sending new addresses to cc: [{100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: got new resolved addresses:  [{100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, SHUTDOWN
WARNING: 2019/09/12 10:47:05 grpc: addrConn.transportMonitor exits due to: context canceled
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792c0, READY
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.4.29:443 0  <nil>}:0xc4209792c0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792a0, READY
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4209792a0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.4.29:443 0  <nil>}:0xc4209792c0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80]

I rebooted 5 nodes none of my kiam-agents returned context deadline exceeded in this period. I think this is because the pods in a Deployment are correctly terminated in a graceful manner and gRPC can cope with that better than the power cord being unplugged like it is with a DaemonSet.

I'd love for someone to double-check these findings though? The PR for kiam-server as a Deployment is here

I don't think this solves the problem entirely though, as the same "pull the plug" event could happen at any time and agents might still serve context deadline exceeded responses to clients.

rhysemmas · 2019-09-24T15:43:43Z

Hi @daviddyball, thanks for adding your findings. I have been trying to replicate the issue and I am still struggling - killing the Kiam server in various ways (deleting pod, forcefully deleting pod, terminating node, rebooting node) whilst sending requests to a Kiam agent has not reproduced the context deadline exceeded error for me that is being reported.

From looking at the logs you've provided when Kiam server is running as a daemonset, the gRPC client on the agent does seem to be behaving normally when a server goes away; it notices the gRPC server connection failing as the master nodes are recycled:

INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4201779f0, TRANSIENT_FAILURE

^ The connection to a server enters failure state

INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.9.121:443 0  <nil>}:0xc4203d5b10 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.10.125:443 0  <nil>}:0xc420177aa0]

^ The loadbalancing pool updates immediately to drop the failing server (the 0xc4201779f0 value does not appear in the list)

This occurs with another Kiam server at the same time 10:12:33:

INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10]

So I am not entirely sure what is causing the issue yet. I am still trying to replicate this issue, though, and if you are happy to share it would be useful to me if you could confirm the following:

How are you deploying Kiam to your cluster (e.g, Helm/manifest files)?
Which version of Kiam are you running?
What DNS service are you running in your cluster (e.g, kube-dns, coreDNS)?

daviddyball · 2019-09-24T16:45:14Z

Hey @rhysemmas,

Here's my setup:

How are you deploying Kiam to your cluster (e.g, Helm/manifest files)?

Flat YAML manifests

Which version of Kiam are you running?

quay.io/uswitch/kiam:v3.3

What DNS service are you running in your cluster (e.g, kube-dns, coreDNS)?

CoreDNS: k8s.gcr.io/coredns:1.1.3

mehstg · 2019-09-24T18:59:25Z

Out of interest David. What CNI are you using? I am using the amazon-vpc-cni. Wondering if there is another similarity in our configurations that we are missing.

…

On Tue, 24 Sep 2019 at 17:45, David Dyball ***@***.***> wrote: Hey @rhysemmas <https://github.com/rhysemmas>, Here's my setup: How are you deploying Kiam to your cluster (e.g, Helm/manifest files)? Flat YAML manifests Which version of Kiam are you running? quay.io/uswitch/kiam:v3.3 What DNS service are you running in your cluster (e.g, kube-dns, coreDNS)? CoreDNS: k8s.gcr.io/coredns:1.1.3 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#217?email_source=notifications&email_token=ABDWBU6EFHVVVSTOUOKZARTQLI7Z3A5CNFSM4GYFD7D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7PAROA#issuecomment-534644920>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDWBU5AHXDZYV62UMVK7HLQLI7Z3ANCNFSM4GYFD7DQ> .

rgardam · 2019-09-25T05:43:56Z

I'm seeing this on a kops update and rolling of the cluster.

I'm seeing kops try to drain the nodes, but fails to take any notice of daemon-sets.
WARNING: Ignoring DaemonSet-managed pods:

Kops then terminates the node via the aws autoscaling group api.

We are using
calico cni
k8s v1.13.10
quay.io/uswitch/kiam:v3.3

daviddyball · 2019-09-25T09:21:28Z

@mehstg I'm using Calcio CNI although I've also recreated this with Cilium on our development cluster

mehstg · 2019-09-25T12:40:17Z

Ha. Ok, so all on different CNI's

…

On Wed, 25 Sep 2019 at 10:21, David Dyball ***@***.***> wrote: @mehstg <https://github.com/mehstg> I'm using Calcio CNI — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#217?email_source=notifications&email_token=ABDWBU6WB5P4QCJY6LQ7EPDQLMURVA5CNFSM4GYFD7D2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7RG75Q#issuecomment-534933494>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDWBU6C7B4DEGKPPN3345TQLMURVANCNFSM4GYFD7DQ> .

wskulley · 2019-10-31T17:57:04Z

Saw this (again) today on a rolling update of 1.13 masters (kops)

quay.io/uswitch/kiam:v3.3
helm kiam-2.5.1
canal CNI
CoreDNS

daviddyball · 2019-11-01T09:45:55Z

I should report that since moving kiam-server to a Deployment instead of a DaemonSet we've not experienced any issues on our 6 clusters, including a migration from 1.11 -> 1.12 -> 1.13 -> 1.14 for all of them. I think this is down to each server component being gracefully terminated when the Deployment drains correctly due to kubectl drain correctly handling the Deployment and sending proper SIGTERM events to pods.

While this doesn't solve the issue of random server failures causing the same issue, it does mean we can roll our control plane without fear of breaking our workloads IAM credentials.

jacksontj · 2019-11-01T15:48:15Z

If you need to evict a DS before a node dies you can actually do so by applying a NoExecute taint like so:

kubectl taint nodes node1 key=value:NoExecute

This will remove any/all pods from the node -- this should be done after a drain (as usually the running non-ds pods have deps on the ds pods). If you (for some reason) still need a ds instead of a deployment.

daviddyball · 2019-11-01T15:49:53Z

Nice to know @jacksontj. We're using kops to roll our nodes and there's currently no option that I know of to apply an additional taint to a node before it gets killed. It does seem like a good way to get around this issue too while still using a Deployment.

mehstg · 2019-11-06T16:39:23Z

Ok. So I have just been burned by this again on our prod system. Cert-manager replaced the TLS certificates used by KIAM and the kiam-servers rebooted.
This caused all of the kiam-agents to immediately start logging error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded and we lost all pod IAM access.

mechpen · 2019-11-22T22:58:41Z

My root cause theory is based on the following facts:

for calico cni, when a node is deleted, all packets from the node is dropped.
when kiam-server is deployed as a daemonset, rolling nodes does not kill the kiam-server pods before deleting the nodes.
this gRPC issue: detect and close dead TCP connection grpc/grpc-go#3206

mehstg · 2019-11-24T13:18:37Z

@mechpen - Great spot. I have been doing some testing today and inserted the Keepalive in to a local branch of Kiam - https://github.com/mehstg/kiam/tree/grpc-keepalive-testing

I have just spun it up on one of my test clusters, and cannot get it to fail now, whether I ungracefully terminate the instance, or drain it correctly.

When a TCP connection dies, i.e. no more packet from peer, gRPC does not close the connection by default. We need to enable gRPC keepalive to detect and close dead TCP connections.

Joseph-Irving · 2020-01-21T13:35:29Z

Kiam v3.5 https://github.com/uswitch/kiam/releases/tag/v3.5 has a potential fix for this issue, has anyone been able to reproduce it on 3.5?

nirnanaaa · 2020-01-21T13:37:49Z

how should cli arguments be tuned to fix this issue from happening @Joseph-Irving ?

mechpen · 2020-01-21T21:44:51Z

@nirnanaaa the default values should be good.

mbelang · 2020-03-06T16:28:35Z

~~I'm at 3.5 with both server and agent as daemonset and I have the problem.~~

The problem is that I want to migrate from kube2iam to kiam and the kiam-agent and kube2iam conflict on the port. Removing kube2iam makes everthing OK.

jkassis · 2020-07-29T02:56:37Z

i'm seeing this now in KIAM v3.5 with OpenShift 4.5 in AWS. Comes right up when i install the helm chart. can't get the kiam-server stable.

jkassis · 2020-07-29T03:41:15Z

in my case... the server was continuously and consistently failing the first liveness / readiness probes. it was 100% due to running the health check against 127.0.0.1 rather than localhost... here are my current working okd4.5 overrides for v3.5 chart...

server: assumeRoleArn: "arn:aws:iam::807125168235:role/k8s-kiam-server" useHostNetwork: true log: level: debug nodeSelector: node-role.kubernetes.io/master: "" tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" sslCertHostPath: /etc/ssl/certs probes: serverAddress: localhost agent: host: iptables: true log: level: debug extraHostPathMounts: - name: ssl-certs mountPath: /etc/ssl/certs hostPath: /etc/ssl/certs readOnly: true nodeSelector: node-role.kubernetes.io/worker: ""

shrutilamba · 2020-08-19T09:32:04Z

I am still getting the same error even after upgrading to v3.5. Whenever a new node comes up, the agent restarts with the following error:
"msg":"error creating server gateway: error dialing grpc server: context deadline exceeded"
Anyone else also facing this issue?

vtremblay · 2020-10-14T01:44:55Z

I have the same issue. I'm running v3.2, installed with the helm chart v2.3.0. I am not sure what triggers the issue between the rotation of the certificates generated by cert-manager or a rotation of kiam-server itself.

When the issue is occurring, changing the --server-address=kiam-server:443 to --server-address=localhost:443 in the agent's daemonset seems to fix it. Sadly enough, that value is hardcoded in the helm chart.

pingles · 2020-11-11T18:31:02Z

I linked to #425 but it's potentially related to the way the gRPC load-balancer manages connection failures. We've updated the version of gRPC and it (according to the gRPC team) can be managed through controlling the MaxConnectionAge setting.

We'll hopefully get this into v4 so this issue ought to be resolvable through that.

pingles · 2020-12-16T10:39:41Z

We've updated gRPC and the changes in #433 should make it possible to mitigate the situation, according to the advice from the gRPC team: controlling the connection age to be shorter forces clients to re-poll for servers frequently.

I'm going to close this issue for now and people should follow that advice to manage. If we see other problems we can re-open.

Thanks for everyone in contributing and helping us to close this down. It was also reported in #425.

max-lobur changed the title ~~kiam server rotation: error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded~~ kiam server HA. Context deadline exceeded while rotating servers. Apr 5, 2019

daviddyball mentioned this issue Sep 17, 2019

Move Server to Deployment instead of DaemonSet #296

Merged

rhysemmas self-assigned this Sep 23, 2019

nirnanaaa mentioned this issue Oct 28, 2019

Ensuring the KIAM Endpoint is ready when a pod is scheduled #277

Open

jaygorrell mentioned this issue Nov 23, 2019

[helm] adds option to run server as deployment #332

Merged

mehstg mentioned this issue Nov 25, 2019

Issue #217 - Add GRPC keepalive to server/client #333

Closed

Joseph-Irving mentioned this issue Jan 21, 2020

kiam-agent random errors on node startup #358

Closed

pingles mentioned this issue Oct 20, 2020

Agent stuck with log "resolver returned no addresses" #425

Closed

pingles added this to the v4 milestone Nov 11, 2020

pingles mentioned this issue Dec 1, 2020

gRPC Resolution #441

Closed

This was referenced Dec 10, 2020

Update gRPC version #449

Closed

Create tests to prove recover of gRPC keep-alive #450

Open

Update gRPC to latest release (#449) #451

Merged

pingles closed this as completed Dec 16, 2020

kiam server HA. Context deadline exceeded while rotating servers. #217

kiam server HA. Context deadline exceeded while rotating servers. #217

Comments

max-lobur commented Feb 18, 2019 • edited Loading

complex64 commented Apr 23, 2019

MilanDasek commented Jul 17, 2019

mehstg commented Jul 17, 2019 • edited Loading

mseiwald commented Jul 24, 2019

Joseph-Irving commented Jul 30, 2019

mehstg commented Jul 30, 2019

denispesikov commented Jul 30, 2019

max-lobur commented Jul 30, 2019

mehstg commented Jul 31, 2019 • edited Loading

rhysemmas commented Aug 16, 2019

mehstg commented Aug 16, 2019 • edited by Joseph-Irving Loading

rhysemmas commented Aug 16, 2019

mehstg commented Aug 16, 2019

rhysemmas commented Aug 16, 2019

mehstg commented Aug 16, 2019

rhysemmas commented Aug 16, 2019

mehstg commented Aug 19, 2019 • edited by Joseph-Irving Loading

daviddyball commented Sep 11, 2019

daviddyball commented Sep 12, 2019

daviddyball commented Sep 12, 2019

rhysemmas commented Sep 24, 2019

daviddyball commented Sep 24, 2019

How are you deploying Kiam to your cluster (e.g, Helm/manifest files)?

Which version of Kiam are you running?

What DNS service are you running in your cluster (e.g, kube-dns, coreDNS)?

mehstg commented Sep 24, 2019 via email

rgardam commented Sep 25, 2019

daviddyball commented Sep 25, 2019 • edited Loading

mehstg commented Sep 25, 2019 via email

wskulley commented Oct 31, 2019

daviddyball commented Nov 1, 2019

jacksontj commented Nov 1, 2019

daviddyball commented Nov 1, 2019

mehstg commented Nov 6, 2019

mechpen commented Nov 22, 2019

mehstg commented Nov 24, 2019

Joseph-Irving commented Jan 21, 2020

nirnanaaa commented Jan 21, 2020

mechpen commented Jan 21, 2020

mbelang commented Mar 6, 2020 • edited Loading

jkassis commented Jul 29, 2020

jkassis commented Jul 29, 2020

shrutilamba commented Aug 19, 2020

vtremblay commented Oct 14, 2020 • edited Loading

pingles commented Nov 11, 2020

pingles commented Dec 16, 2020

max-lobur commented Feb 18, 2019 •

edited

Loading

mehstg commented Jul 17, 2019 •

edited

Loading

mehstg commented Jul 31, 2019 •

edited

Loading

mehstg commented Aug 16, 2019 •

edited by Joseph-Irving

Loading

mehstg commented Aug 19, 2019 •

edited by Joseph-Irving

Loading

daviddyball commented Sep 25, 2019 •

edited

Loading

mbelang commented Mar 6, 2020 •

edited

Loading

vtremblay commented Oct 14, 2020 •

edited

Loading