Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

kiam server HA. Context deadline exceeded while rotating servers. #217

Closed
max-lobur opened this issue Feb 18, 2019 · 43 comments
Closed

kiam server HA. Context deadline exceeded while rotating servers. #217

max-lobur opened this issue Feb 18, 2019 · 43 comments
Assignees
Milestone

Comments

@max-lobur
Copy link
Contributor

max-lobur commented Feb 18, 2019

In my test AWS lab I run 3 masters with KIAM servers (quay.io/uswitch/kiam:v3.0) and 6 nodes with KIAM agents (same images). I'm getting these

{"addr":"100.112.154.80:56850","level":"error","method":"GET","msg":"error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-02-18T16:48:17Z"}

every time I'm rotating masters. The error starts to appear on every one of 6 agents, once I remove at least 1 master. To stop it I need to restart all agents, when they restart they start to work normally. This does not happen if I rotate kiam server itself (delete the pod), only when I rotate the whole master instance.

I'm using this chart https://github.com/helm/charts/tree/master/stable/kiam to run kiam. No special tuning, except for I'm setting gatewayTimeoutCreation: 500ms on both agent and server.

Am I missing smth important in docs, some flag? Please suggest

@max-lobur max-lobur changed the title kiam server rotation: error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded kiam server HA. Context deadline exceeded while rotating servers. Apr 5, 2019
@complex64
Copy link

Same issue here, KIAM v.3.2

@MilanDasek
Copy link

similar issue, kiam v3.3

@mehstg
Copy link

mehstg commented Jul 17, 2019

Just been burned by this myself - v3.2

@mseiwald
Copy link

Same issue for us on kiam v3.0

@Joseph-Irving
Copy link
Contributor

We run Kiam in relatively large clusters and we regularly terminate our master nodes (we don't allow them to live longer than 7 days) and we've never seen this behaviour.

Kiam should be using a service with clusterIP: none this is so that the agents can use grpc loadbalancing, they go and discover the pod ips of each kiam-server and will loadbalance between them, if one goes down it will just try another. So I'm very puzzled as to how you're seeing this.

I'd be interested to know what networking stack you're using, incase there's some commonality there which could be what's causing this behaviour.

@mehstg
Copy link

mehstg commented Jul 30, 2019

We are using the amazon-k8s-cni plugin for networking. No other service mesh/network policy currently.

@denispesikov
Copy link

Same here
error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded
KIAM v3.2
Any workaround?

@max-lobur
Copy link
Contributor Author

Using calico over aws ec2.
Also using this chart to deploy https://github.com/helm/charts/tree/master/stable/kiam
It seems that it uses proper service as you mentioned https://github.com/helm/charts/blob/master/stable/kiam/templates/server-service.yaml#L26

@mehstg
Copy link

mehstg commented Jul 31, 2019

Ok. So I have been doing some more testing of this issue.

From inside a pod, when I run a cURL to request the credentials from the metadata service, it works fine before killing a KIAM server, as expected:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api {"Code":"Success","Type":"AWS-HMAC","AccessKeyId":"ASIAYXSSZXR7IBB2RQMP","SecretAccessKey":"O3h+vJPsu8vhWtyWOCkiV9rwh4F/3HMF+jHpMpCP","Token":"FQoGZXIvYXdzEDkaDIL6Kpd6o56IKCaqdyLtAXPxNnhIs6lvNrJwANlEupVSSwGqUzOxOhry10nANFU5VB28YPNc8BYnSy6B8+tMkw10d9ubPwfFj63sdGpuHAS+4wtrpPRbb5p2ma/qvyUu8p33t9+NGZ6BNGYUga43E9+erGSHjDuvD9I4Lwjcz831D3lAYidCAjb99Q+iOWMyScklGfH3YELKGpEtzg55KYPTH16xc+QtOUtdmcUZZsbGnvw04zb6gxhctUCneknWY9+ZfiIGRU2pXvUlavGPCtekyrTp4DnCN2IbDQnmNCfwsZfG0LZZ1yQw41FuZ3igPzzb6TM6a6LlU9gM+Cj4/oTqBQ==","Expiration":"2019-07-31T07:27:56Z","LastUpdated":"2019-07-31T07:12:56Z"}

As soon as I kill a KIAM server, the same cURL command fails immediately:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I can then repeatedly cURL the same request and sometimes it returns OK, sometimes it fails.

The matching log line from the KIAM agent shows similar information:
{"addr":"10.1.113.124:49850","level":"error","method":"HEAD","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/dev-eu-west-1-dev01-pod-api","status":500,"time":"2019-07-31T07:22:00Z"}`

I ran a nslookup from inside a pod at the point of killing the KIAM server too. The record for the killed KIAM server disappears almost instantaneously. (For reference, we are using CoreDNS) .

It feels very much like something is being cached on the KIAM agents in relation to the IP addresses available to gRPC load balance with. DNS is very quickly removing the retired IP from the pool, but it feels like the KIAM agent is still trying to route requests to it.

The access key/secret/token above is nonsensical, it is just an example

@rhysemmas
Copy link
Contributor

I am struggling to reproduce the issue myself - when killing Kiam servers in various different ways, the Kiam agent is correctly load balancing between the remaining Kiam servers.

In order to observe the agent's behaviour, I set the following env vars in the agent pod to increase the log verbosity of grpc:

GRPC_GO_LOG_SEVERITY_LEVEL=info
GRPC_GO_LOG_VERBOSITY_LEVEL=8

The following events occur on 1 agent that load balances between 2 different servers:

The agent starts up and performs some service discovery to find the servers:

INFO: 2019/08/15 14:01:25 parsed scheme: "dns"
INFO: 2019/08/15 14:01:25 ccResolverWrapper: got new service config:
INFO: 2019/08/15 14:01:25 ccResolverWrapper: sending new addresses to cc: [{10.244.2.2:443 1 10-244-2-2.kiam-server.kube-system.svc.cluster.local. <nil>} {10.244.1.3:443 1 10-244-1-3.kiam-server.kube-system.svc.cluster.local. <nil>} {10.244.1.3:443 0  <nil>} {10.244.2.2:443 0  <nil>}]

The agent connects to the servers successfully (below) - note that each server is associated with a hex value, in this instance we have the following 2 servers:
10.244.1.3 - 0xc4203c1d40
10.244.2.2 - 0xc4203c1d60

INFO: 2019/08/15 14:01:25 base.baseBalancer: got new ClientConn state:  {{[{10.244.1.3:443 0  <nil>} {10.244.2.2:443 0  <nil>}] <nil>} <nil>}
INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d40, CONNECTING
INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d60, CONNECTING
INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d60, READY
INFO: 2019/08/15 14:01:25 roundrobinPicker: newPicker called with readySCs: map[{10.244.2.2:443 0  <nil>}:0xc4203c1d60]

^ here, one of the grpc connections becomes ready before the other and the 'roundrobinPicker' lists this as the only server available in its pool to load balance with.

INFO: 2019/08/15 14:01:25 base.baseBalancer: handle SubConn state change: 0xc4203c1d40, READY
INFO: 2019/08/15 14:01:25 roundrobinPicker: newPicker called with readySCs: map[{10.244.1.3:443 0  <nil>}:0xc4203c1d40 {10.244.2.2:443 0  <nil>}:0xc4203c1d60]

^ the other connection becomes ready and the load balancing pool is updated to include the second server.

At this point, I kill one of the Kiam servers (10.244.1.3 - 0xc4203c1d40):

INFO: 2019/08/15 14:15:32 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/08/15 14:15:32 base.baseBalancer: handle SubConn state change: 0xc4203c1d40, CONNECTING
WARNING: 2019/08/15 14:15:32 grpc: addrConn.createTransport failed to connect to {10.244.1.3:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.244.1.3:443: connect: connection refused". Reconnecting...`

^ the agent attempts to reconnect to the server I killed. The agent will continue to try and re-connect with the server that has gone away over time (I have omitted these logs)

INFO: 2019/08/15 14:15:32 roundrobinPicker: newPicker called with readySCs: map[{10.244.2.2:443 0  <nil>}:0xc4203c1d60]

^ immediately after the connection error, the pool of servers to load balance with is updated to just include the 1 remaining server.

Throughout this, my curls from another pod to the meta-data API have been successful - there hasn't been any sign of the issue reported (these curls will also show in the agent logs but have been omitted above for clarity).

Would somebody be able to set the grpc log level env vars mentioned above in their agent deployments and provide agent logs? Including examples of agent startup and when a server is killed, so we can compare. Thanks.

@mehstg
Copy link

mehstg commented Aug 16, 2019

Hi. I have collected logs as requested.

Kiam Agent startup logs:

`{"level":"info","msg":"configuring iptables","time":"2019-08-16T12:47:59Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2019-08-16T12:47:59Z"}
INFO: 2019/08/16 12:47:59 parsed scheme: "dns"
INFO: 2019/08/16 12:47:59 ccResolverWrapper: got new service config: 
INFO: 2019/08/16 12:47:59 ccResolverWrapper: sending new addresses to cc: [{10.2.81.71:443 1 10-2-81-71.kiam-server.iam-proxy.svc.cluster.local. <nil>} {10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>} {10.2.118.49:443 0  <nil>} {10.2.81.71:443 0  <nil>}]
INFO: 2019/08/16 12:47:59 base.baseBalancer: got new resolved addresses:  [{10.2.81.71:443 1 10-2-81-71.kiam-server.iam-proxy.svc.cluster.local. <nil>} {10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>} {10.2.118.49:443 0  <nil>} {10.2.81.71:443 0  <nil>}]
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891a0, CONNECTING
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891c0, CONNECTING
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891e0, CONNECTING
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc420189210, CONNECTING
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891c0, READY
INFO: 2019/08/16 12:47:59 roundrobinPicker: newPicker called with readySCs: map[{10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891c0]
{"level":"info","msg":"listening :8181","time":"2019-08-16T12:47:59Z"}
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891e0, READY
INFO: 2019/08/16 12:47:59 roundrobinPicker: newPicker called with readySCs: map[{10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891c0 {10.2.118.49:443 0  <nil>}:0xc4201891e0]
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc420189210, READY
INFO: 2019/08/16 12:47:59 roundrobinPicker: newPicker called with readySCs: map[{10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891c0 {10.2.118.49:443 0  <nil>}:0xc4201891e0 {10.2.81.71:443 0  <nil>}:0xc420189210]
INFO: 2019/08/16 12:47:59 base.baseBalancer: handle SubConn state change: 0xc4201891a0, READY
INFO: 2019/08/16 12:47:59 roundrobinPicker: newPicker called with readySCs: map[{10.2.81.71:443 1 10-2-81-71.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891a0 {10.2.118.49:443 1 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local. <nil>}:0xc4201891c0 {10.2.118.49:443 0  <nil>}:0xc4201891e0 {10.2.81.71:443 0  <nil>}:0xc420189210]
{"addr":"10.2.45.198:41134","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:03Z"}
{"addr":"10.2.45.198:41172","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:06Z"}
{"addr":"10.2.45.198:41188","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:09Z"}
{"addr":"10.2.45.198:41198","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:12Z"}
{"addr":"10.2.45.198:41240","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:15Z"}
{"addr":"10.2.45.198:41264","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:18Z"}
{"addr":"10.2.45.198:41276","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:21Z"}
{"addr":"10.2.45.198:41310","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:24Z"}
{"addr":"10.2.45.198:41336","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:27Z"}
{"addr":"10.2.45.198:41348","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:48:30Z"}`

We have two KIAM servers running,
1 10-2-81-71.kiam-server.iam-proxy.svc.cluster.local 10-2-118-49.kiam-server.iam-proxy.svc.cluster.local
If I run a curl from one of my containers, it works successfully:

`root@channel-service-588c949b66-tvxrr:/opt/app# curl http://169.254.169.254/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service
{"Code":"Success","Type":"AWS-HMAC","AccessKeyId":"redacted","SecretAccessKey":"redacted","Token":"redacted","Expiration":"2019-08-16T13:19:47Z","LastUpdated":"2019-08-16T13:04:47Z"}`

I then kill the EC2 host running one of the KIAM servers (10-2-118-49.kiam-server.iam-proxy.svc.cluster.local), these are specific worker nodes that only run the KIAM servers.
Almost immediately the curl begins to fail:

`root@channel-service-588c949b66-tvxrr:/opt/app# curl http://169.254.169.254/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service
error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded`

The KIAM agent logs during this period are as follows:

`{"addr":"10.2.39.230:52366","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:50:52Z"}
{"addr":"10.2.39.230:52366","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:50:52Z"}
{"addr":"10.2.45.198:42582","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:50:54Z"}
{"addr":"10.2.39.230:52422","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:50:56Z"}
{"addr":"10.2.39.230:52422","duration":2407,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:50:56Z"}
{"addr":"10.2.45.198:42612","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:50:57Z"}
{"addr":"10.2.45.198:42620","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:00Z"}
{"addr":"10.2.39.230:52466","duration":2,"headers":{"Content-Type":["application/json"]},"level":"debug","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":200,"time":"2019-08-16T12:51:01Z"}
{"addr":"10.2.45.198:42650","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:03Z"}
{"addr":"10.2.39.230:52488","duration":1,"headers":{"Content-Type":["application/json"]},"level":"debug","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":200,"time":"2019-08-16T12:51:03Z"}
{"addr":"10.2.45.198:42688","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:06Z"}
{"addr":"10.2.39.230:52504","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:51:09Z"}
{"addr":"10.2.39.230:52504","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:51:09Z"}
{"addr":"10.2.45.198:42700","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:09Z"}
{"addr":"10.2.45.198:42710","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:12Z"}
{"addr":"10.2.45.198:42748","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:15Z"}
{"addr":"10.2.45.198:42772","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:18Z"}
{"addr":"10.2.45.198:42790","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:21Z"}
{"addr":"10.2.39.230:52602","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:51:21Z"}
{"addr":"10.2.39.230:52602","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-16T12:51:21Z"}
{"addr":"10.2.45.198:42824","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:24Z"}
{"addr":"10.2.45.198:42850","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:27Z"}
{"addr":"10.2.45.198:42858","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-16T12:51:30Z"}
`

@rhysemmas
Copy link
Contributor

Hi @mehstg - thanks for providing those logs. It is strange that you aren't seeing any gRPC logs (the INFO and WARNING events) in the agent logs after killing the kiam server - are there definitely none of these entries in the agent logs when the cURLs begin to fail?

@mehstg
Copy link

mehstg commented Aug 16, 2019

No unfortunately not. I just tested again and I only really see the info messages during startup.

@rhysemmas
Copy link
Contributor

OK - thanks for double checking. A few questions:

The IP addresses that are resolved in your logs above, are these the same as your kiam server pod IPs?
Could you confirm what your cpu and memory resource requests/limits are for the agent pods?
Are you also using the helm chart to deploy kiam?

Also, could you please try running kiam v3.4 (released today), as this includes some gRPC updates, and let us know if this changes any behaviour.

@mehstg
Copy link

mehstg commented Aug 16, 2019

I actually am not setting any resources on the pods. Deploying from the stable helm charts (v2.2.0).
IP addresses resolved are correct.
Will try upgrading to v3.4 on one of the test clusters on Monday and test again

@rhysemmas
Copy link
Contributor

I'd recommend setting some resource requests, at least for the agent pods so they are not starved of resources. We use the following values for the agents which works fine (but will depend on the scale of your cluster)

resources:
            requests:
              cpu: 5m
              memory: 50Mi
            limits:
              memory: 100Mi

and the server resources are set to the following:

resources:
            requests:
              cpu: 500m
              memory: 256Mi
            limits:
              memory: 512Mi

@mehstg
Copy link

mehstg commented Aug 19, 2019

I have set resource requests on my KIAM pods and also upgraded to KIAM v3.4 on one of my test cluster.
Running the same test, I am seeing the same issues, however it seems to be alternate requests fail now after killing one of my KIAM servers (This cluster has two KIAM servers running).

Previously, every request would die after killing one of the KIAM servers, so I wonder if the load balancing is working better in v3.4 but the dead server is still being cached somewhere. Log attached

{"addr":"10.2.67.14:59876","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-19T12:54:33Z"}
{"addr":"10.2.71.87:54470","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-19T12:54:33Z"}
{"addr":"10.2.71.87:54470","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-19T12:54:33Z"}
{"addr":"10.2.71.87:54516","duration":1,"headers":{"Content-Type":["application/json"]},"level":"debug","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":200,"time":"2019-08-19T12:54:35Z"}
{"addr":"10.2.67.14:59892","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-19T12:54:36Z"}
{"addr":"10.2.67.14:59914","duration":0,"headers":{},"level":"debug","method":"GET","msg":"processed request","path":"/ping","status":200,"time":"2019-08-19T12:54:39Z"}
{"addr":"10.2.71.87:54524","level":"error","method":"GET","msg":"error processing request: error fetching credentials: rpc error: code = Canceled desc = context canceled","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-19T12:54:39Z"}
{"addr":"10.2.71.87:54524","duration":2004,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/stage-eu-west-1-stage01-pod-connectt-api-channel-service","status":500,"time":"2019-08-19T12:54:39Z"}

@daviddyball
Copy link
Contributor

I have been bitten by this as well. I haven't had a chance to enable gRPC logs to provide, but I am wondering if this has to do with kiam-server running as a DaemonSet and not receiving proper signals when a master terminates (e.g. when running a kops rolling-update cluster on masters).

My knowledge of gRPC is lacking, but perhaps when you manually terminate a kiam-server pod by running kubectl delete pod kiam-server-xxxxxx it sends some kind of message to the gRPC client in kiam-agent?

I'll try and get some time to do further testing on my end and let you know how I go anyway.

As a side-note I have a PR open to convert the DaemonSet for kiam-server to a Deployment so that it handles termination more gracefully. Perhaps the two are linked?

@daviddyball
Copy link
Contributor

Here's the agent logs when running as a DaemonSet while using kops rolling-update cluster to recycle our master nodes:

INFO: 2019/09/12 10:12:33 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4201779f0, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.9.121:443 0  <nil>}:0xc4203d5b10 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.10.125:443 0  <nil>}:0xc420177aa0]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4201779f0, CONNECTING
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, CONNECTING
INFO: 2019/09/12 10:12:33 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:33 ccResolverWrapper: sending new addresses to cc: [{100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: got new resolved addresses:  [{100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d4790, CONNECTING
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d47d0, CONNECTING
INFO: 2019/09/12 10:12:33 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:33 ccResolverWrapper: sending new addresses to cc: [{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: got new resolved addresses:  [{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>} {100.96.9.121:443 0  <nil>}]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d4790, READY
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4790 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0]
INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d47d0, READY
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.0.236:443 0  <nil>}:0xc4203d47d0 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4790 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.12.99:443 0  <nil>}:0xc4203d4e80]

{"level":"warning","msg":"error finding role for pod: rpc error: code = DeadlineExceeded desc = context deadline exceeded","pod.ip":"100.96.14.178","time":"2019-09-12T10:12:34Z"}
{"addr":"100.96.14.178:35122","level":"error","method":"GET","msg":"error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-09-12T10:12:34Z"}
{"addr":"100.96.14.178:35122","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2019-09-12T10:12:34Z"}

INFO: 2019/09/12 10:12:34 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

WARNING: 2019/09/12 10:12:53 grpc: addrConn.createTransport failed to connect to {100.96.9.121:443 1 100-96-9-121.kiam-server.kube-system.svc.cluster.local. <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.9.121:443: i/o timeout". Reconnecting...
WARNING: 2019/09/12 10:12:53 grpc: addrConn.createTransport failed to connect to {100.96.9.121:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.9.121:443: i/o timeout". Reconnecting...
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4201779f0, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, CONNECTING
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4201779f0, CONNECTING
INFO: 2019/09/12 10:12:53 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:53 ccResolverWrapper: sending new addresses to cc: [{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>}]
INFO: 2019/09/12 10:12:53 base.baseBalancer: got new resolved addresses:  [{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>}]
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, SHUTDOWN
INFO: 2019/09/12 10:12:53 base.baseBalancer: handle SubConn state change: 0xc4201779f0, SHUTDOWN
WARNING: 2019/09/12 10:12:53 grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
WARNING: 2019/09/12 10:12:53 grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
INFO: 2019/09/12 10:12:53 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:12:53 ccResolverWrapper: sending new addresses to cc: [{100.96.0.236:443 1 100-96-0-236.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.236:443 0  <nil>} {100.96.10.125:443 0  <nil>} {100.96.11.93:443 0  <nil>} {100.96.12.99:443 0  <nil>}]

At 10:12:34Z you can see the agent returns context deadline exceeded to the client, which is what is causing issues. Our client apps receive this as the response from the meta-data API and this causes further issues (confirmed by running a pod curling the meta-data endpoint every second. At least on request came back with context deadline exceeded as the response).

I think this is down to the kiam-server being available right up until the power is switched off (EC2 API call to TerminateInstance). I think there's a TCP wait/timeout thing going on, where kiam-agent was either 1/2 way through a request to the now dead server, or just thinks the server is still there because it never got notified it was going away.

I'm off to test this when running the kiam-server pods as part of a Deployment. My hypothesis is that they will receive graceful shutdowns and notify any gRPC clients it's talking to that it is going away. We'll see

@daviddyball
Copy link
Contributor

OK testing with kiam-server as a Deployment completed and it looks a lot healthier:

INFO: 2019/09/12 10:47:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/09/12 10:47:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, TRANSIENT_FAILURE
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.10.42:443 0  <nil>}:0xc4203b6e70 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, CONNECTING
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, TRANSIENT_FAILURE
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80 {100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, CONNECTING
WARNING: 2019/09/12 10:47:05 grpc: addrConn.createTransport failed to connect to {100.96.10.42:443 1 100-96-10-42.kiam-server.kube-system.svc.cluster.local. <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.10.42:443: connect: connection refused". Reconnecting...
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, TRANSIENT_FAILURE
WARNING: 2019/09/12 10:47:05 grpc: addrConn.createTransport failed to connect to {100.96.10.42:443 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 100.96.10.42:443: connect: connection refused". Reconnecting...
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, TRANSIENT_FAILURE
INFO: 2019/09/12 10:47:05 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:47:05 ccResolverWrapper: sending new addresses to cc: [{100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.42:443 1 100-96-10-42.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: got new resolved addresses:  [{100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.10.42:443 1 100-96-10-42.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792a0, CONNECTING
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792c0, CONNECTING
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6e70, SHUTDOWN
WARNING: 2019/09/12 10:47:05 grpc: addrConn.transportMonitor exits due to: context canceled
INFO: 2019/09/12 10:47:05 ccResolverWrapper: got new service config: 
INFO: 2019/09/12 10:47:05 ccResolverWrapper: sending new addresses to cc: [{100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: got new resolved addresses:  [{100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>} {100.96.0.159:443 0  <nil>} {100.96.11.137:443 0  <nil>} {100.96.3.77:443 0  <nil>} {100.96.4.29:443 0  <nil>}]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4203b6df0, SHUTDOWN
WARNING: 2019/09/12 10:47:05 grpc: addrConn.transportMonitor exits due to: context canceled
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792c0, READY
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.4.29:443 0  <nil>}:0xc4209792c0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80]
INFO: 2019/09/12 10:47:05 base.baseBalancer: handle SubConn state change: 0xc4209792a0, READY
INFO: 2019/09/12 10:47:05 roundrobinPicker: newPicker called with readySCs: map[{100.96.3.77:443 0  <nil>}:0xc420520af0 {100.96.4.29:443 1 100-96-4-29.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4209792a0 {100.96.11.137:443 0  <nil>}:0xc4203b6e90 {100.96.3.77:443 1 100-96-3-77.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420520ad0 {100.96.11.137:443 1 100-96-11-137.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6da0 {100.96.4.29:443 0  <nil>}:0xc4209792c0 {100.96.0.159:443 0  <nil>}:0xc4203b6e30 {100.96.0.159:443 1 100-96-0-159.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203b6d80]

I rebooted 5 nodes none of my kiam-agents returned context deadline exceeded in this period. I think this is because the pods in a Deployment are correctly terminated in a graceful manner and gRPC can cope with that better than the power cord being unplugged like it is with a DaemonSet.

I'd love for someone to double-check these findings though? The PR for kiam-server as a Deployment is here

I don't think this solves the problem entirely though, as the same "pull the plug" event could happen at any time and agents might still serve context deadline exceeded responses to clients.

@rhysemmas
Copy link
Contributor

Hi @daviddyball, thanks for adding your findings. I have been trying to replicate the issue and I am still struggling - killing the Kiam server in various ways (deleting pod, forcefully deleting pod, terminating node, rebooting node) whilst sending requests to a Kiam agent has not reproduced the context deadline exceeded error for me that is being reported.

From looking at the logs you've provided when Kiam server is running as a daemonset, the gRPC client on the agent does seem to be behaving normally when a server goes away; it notices the gRPC server connection failing as the master nodes are recycled:

INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4201779f0, TRANSIENT_FAILURE

^ The connection to a server enters failure state

INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.9.121:443 0  <nil>}:0xc4203d5b10 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.10.125:443 0  <nil>}:0xc420177aa0]

^ The loadbalancing pool updates immediately to drop the failing server (the 0xc4201779f0 value does not appear in the list)

This occurs with another Kiam server at the same time 10:12:33:

INFO: 2019/09/12 10:12:33 base.baseBalancer: handle SubConn state change: 0xc4203d5b10, TRANSIENT_FAILURE
INFO: 2019/09/12 10:12:33 roundrobinPicker: newPicker called with readySCs: map[{100.96.12.99:443 1 100-96-12-99.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc4203d4db0 {100.96.12.99:443 0  <nil>}:0xc4203d4e80 {100.96.10.125:443 0  <nil>}:0xc420177aa0 {100.96.10.125:443 1 100-96-10-125.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a30 {100.96.11.93:443 0  <nil>}:0xc420177ac0 {100.96.11.93:443 1 100-96-11-93.kiam-server.kube-system.svc.cluster.local. <nil>}:0xc420177a10]

So I am not entirely sure what is causing the issue yet. I am still trying to replicate this issue, though, and if you are happy to share it would be useful to me if you could confirm the following:

How are you deploying Kiam to your cluster (e.g, Helm/manifest files)?
Which version of Kiam are you running?
What DNS service are you running in your cluster (e.g, kube-dns, coreDNS)?

@daviddyball
Copy link
Contributor

Hey @rhysemmas,

Here's my setup:

How are you deploying Kiam to your cluster (e.g, Helm/manifest files)?

Flat YAML manifests

Which version of Kiam are you running?

quay.io/uswitch/kiam:v3.3

What DNS service are you running in your cluster (e.g, kube-dns, coreDNS)?

CoreDNS: k8s.gcr.io/coredns:1.1.3

@mehstg
Copy link

mehstg commented Sep 24, 2019 via email

@rgardam
Copy link

rgardam commented Sep 25, 2019

I'm seeing this on a kops update and rolling of the cluster.

I'm seeing kops try to drain the nodes, but fails to take any notice of daemon-sets.
WARNING: Ignoring DaemonSet-managed pods:

Kops then terminates the node via the aws autoscaling group api.

We are using
calico cni
k8s v1.13.10
quay.io/uswitch/kiam:v3.3

@daviddyball
Copy link
Contributor

daviddyball commented Sep 25, 2019

@mehstg I'm using Calcio CNI although I've also recreated this with Cilium on our development cluster

@mehstg
Copy link

mehstg commented Sep 25, 2019 via email

@wskulley
Copy link

Saw this (again) today on a rolling update of 1.13 masters (kops)

quay.io/uswitch/kiam:v3.3
helm kiam-2.5.1
canal CNI
CoreDNS

@daviddyball
Copy link
Contributor

I should report that since moving kiam-server to a Deployment instead of a DaemonSet we've not experienced any issues on our 6 clusters, including a migration from 1.11 -> 1.12 -> 1.13 -> 1.14 for all of them. I think this is down to each server component being gracefully terminated when the Deployment drains correctly due to kubectl drain correctly handling the Deployment and sending proper SIGTERM events to pods.

While this doesn't solve the issue of random server failures causing the same issue, it does mean we can roll our control plane without fear of breaking our workloads IAM credentials.

@jacksontj
Copy link

If you need to evict a DS before a node dies you can actually do so by applying a NoExecute taint like so:

kubectl taint nodes node1 key=value:NoExecute

This will remove any/all pods from the node -- this should be done after a drain (as usually the running non-ds pods have deps on the ds pods). If you (for some reason) still need a ds instead of a deployment.

@daviddyball
Copy link
Contributor

Nice to know @jacksontj. We're using kops to roll our nodes and there's currently no option that I know of to apply an additional taint to a node before it gets killed. It does seem like a good way to get around this issue too while still using a Deployment.

@mehstg
Copy link

mehstg commented Nov 6, 2019

Ok. So I have just been burned by this again on our prod system. Cert-manager replaced the TLS certificates used by KIAM and the kiam-servers rebooted.
This caused all of the kiam-agents to immediately start logging error processing request: rpc error: code = DeadlineExceeded desc = context deadline exceeded and we lost all pod IAM access.

@mechpen
Copy link
Contributor

mechpen commented Nov 22, 2019

My root cause theory is based on the following facts:

  1. for calico cni, when a node is deleted, all packets from the node is dropped.
  2. when kiam-server is deployed as a daemonset, rolling nodes does not kill the kiam-server pods before deleting the nodes.
  3. this gRPC issue: detect and close dead TCP connection grpc/grpc-go#3206

@mehstg
Copy link

mehstg commented Nov 24, 2019

@mechpen - Great spot. I have been doing some testing today and inserted the Keepalive in to a local branch of Kiam - https://github.com/mehstg/kiam/tree/grpc-keepalive-testing

I have just spun it up on one of my test clusters, and cannot get it to fail now, whether I ungracefully terminate the instance, or drain it correctly.

Joseph-Irving pushed a commit that referenced this issue Dec 6, 2019
When a TCP connection dies, i.e. no more packet from peer, gRPC
does not close the connection by default.  We need to enable gRPC
keepalive to detect and close dead TCP connections.
@Joseph-Irving
Copy link
Contributor

Kiam v3.5 https://github.com/uswitch/kiam/releases/tag/v3.5 has a potential fix for this issue, has anyone been able to reproduce it on 3.5?

@nirnanaaa
Copy link
Contributor

how should cli arguments be tuned to fix this issue from happening @Joseph-Irving ?

@mechpen
Copy link
Contributor

mechpen commented Jan 21, 2020

@nirnanaaa the default values should be good.

@mbelang
Copy link

mbelang commented Mar 6, 2020

I'm at 3.5 with both server and agent as daemonset and I have the problem.

The problem is that I want to migrate from kube2iam to kiam and the kiam-agent and kube2iam conflict on the port. Removing kube2iam makes everthing OK.

@jkassis
Copy link

jkassis commented Jul 29, 2020

i'm seeing this now in KIAM v3.5 with OpenShift 4.5 in AWS. Comes right up when i install the helm chart. can't get the kiam-server stable.

@jkassis
Copy link

jkassis commented Jul 29, 2020

in my case... the server was continuously and consistently failing the first liveness / readiness probes. it was 100% due to running the health check against 127.0.0.1 rather than localhost... here are my current working okd4.5 overrides for v3.5 chart...

server: assumeRoleArn: "arn:aws:iam::807125168235:role/k8s-kiam-server" useHostNetwork: true log: level: debug nodeSelector: node-role.kubernetes.io/master: "" tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" sslCertHostPath: /etc/ssl/certs probes: serverAddress: localhost agent: host: iptables: true log: level: debug extraHostPathMounts: - name: ssl-certs mountPath: /etc/ssl/certs hostPath: /etc/ssl/certs readOnly: true nodeSelector: node-role.kubernetes.io/worker: ""

@shrutilamba
Copy link

I am still getting the same error even after upgrading to v3.5. Whenever a new node comes up, the agent restarts with the following error:
"msg":"error creating server gateway: error dialing grpc server: context deadline exceeded"
Anyone else also facing this issue?

@vtremblay
Copy link

vtremblay commented Oct 14, 2020

I have the same issue. I'm running v3.2, installed with the helm chart v2.3.0. I am not sure what triggers the issue between the rotation of the certificates generated by cert-manager or a rotation of kiam-server itself.

When the issue is occurring, changing the --server-address=kiam-server:443 to --server-address=localhost:443 in the agent's daemonset seems to fix it. Sadly enough, that value is hardcoded in the helm chart.

@pingles
Copy link
Contributor

pingles commented Nov 11, 2020

I linked to #425 but it's potentially related to the way the gRPC load-balancer manages connection failures. We've updated the version of gRPC and it (according to the gRPC team) can be managed through controlling the MaxConnectionAge setting.

We'll hopefully get this into v4 so this issue ought to be resolvable through that.

@pingles
Copy link
Contributor

pingles commented Dec 16, 2020

We've updated gRPC and the changes in #433 should make it possible to mitigate the situation, according to the advice from the gRPC team: controlling the connection age to be shorter forces clients to re-poll for servers frequently.

I'm going to close this issue for now and people should follow that advice to manage. If we see other problems we can re-open.

Thanks for everyone in contributing and helping us to close this down. It was also reported in #425.

@pingles pingles closed this as completed Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests