Unstable cluster #55

mitar · 2018-10-03T22:54:49Z

When running it locally on my machine the cluster seems much more unstable than on our CI. So now cluster is created inside a privileged container, but then I am getting strange errors:

$ kubectl cluster-info
Unable to connect to the server: unexpected EOF

$ kubectl cluster-info
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding (get services)

$ kubectl cluster-info 
error: the server doesn't have a resource type "services"

The text was updated successfully, but these errors were encountered:

BenTheElder · 2018-10-03T23:32:22Z

Can you provide more details about how you're running this?

Creating the cluster locally should not require anything more than:

have docker running
kind create

I've tested regularly both on docker for mac and on docker-ce on linux. Currently it doesn't depend on any recent or advanced functionality. Sticking it in another layer of docker will likely make things less debuggable.

mitar · 2018-10-03T23:35:55Z

So I am running inside a Docker container to simulate how my CI (gitlab) works. So it seems it just dies after some time. I am running it more or less like this.

So I am not sure what to inspect when this errors start happening? I can exec into kind-1-control-plane?

BenTheElder · 2018-10-03T23:39:41Z

right now docker exec kind-1-control-plane journalctl > logs.txt works, but I'm working on a nice kind command for this soon... stuck on go1.11 upgrade issues for kubernetes/kubernetes right now, xref kubernetes/test-infra#9695

BenTheElder · 2018-10-03T23:42:51Z

Those errors look like the API server is not actually running or the networking is not pointing at it correctly again. It's not clear from that output what else is going on beyond that yet.

docker exec (not the prettiest since the container names are technically not really exposed yet) can get you onto the "node" after which normal debian debugging tools should generally work (ps, journalctl, etc.)

mitar · 2018-10-03T23:52:08Z

OK, it is failing even if I run it directly on my laptop/host.

mitar · 2018-10-04T00:01:46Z

See log: log.txt

BenTheElder · 2018-10-04T00:34:52Z

Can you provide more details about your laptop/host? Docker version? Any special network settings?

I think we'll need the API server logs as well. That will be in a location like /var/log/containers/kube-apiserver-kind-1-control-plane.*.log Again, I'll be adding a tool to collect these shortly.. 😬

mitar · 2018-10-04T03:02:04Z

Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal laptop. No fancy configuration or network settings.

BenTheElder · 2018-10-04T07:22:38Z

Hmm.. I have a very similar setup running kind at home both on the host docker and in DinD from replicating the gitlab setup myself, I'm not sure what this would be... Do you mind if we dig into this more once we have something to scoop up the logs? It's a bit hard to pin down otherwise.

…

On Wed, Oct 3, 2018, 20:02 Mitar ***@***.***> wrote: Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal laptop. No fancy configuration or network settings. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#55 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA4Bq5lamxw33Ggs3qBbELJLu6k4h7aGks5uhXougaJpZM4XHFb7> .

BenTheElder · 2018-10-04T07:24:19Z

I have started a fresh prototype of collecting up debug info today.

…

On Thu, Oct 4, 2018, 00:22 Benjamin Elder ***@***.***> wrote: Hmm.. I have a very similar setup running kind at home both on the host docker and in DinD from replicating the gitlab setup myself, I'm not sure what this would be... Do you mind if we dig into this more once we have something to scoop up the logs? It's a bit hard to pin down otherwise. On Wed, Oct 3, 2018, 20:02 Mitar ***@***.***> wrote: > Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal > laptop. No fancy configuration or network settings. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#55 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AA4Bq5lamxw33Ggs3qBbELJLu6k4h7aGks5uhXougaJpZM4XHFb7> > . >

mitar · 2018-10-04T16:08:07Z

Sure. It seems pretty reproducible on my side, so once you get things going, feel free to ping me and I can retry.

BenTheElder · 2018-10-09T07:48:59Z

Still getting to this. The volume change may(?) help. I expect to get more of the tooling improvements around debugging in tomorrow hopefully...

mitar · 2018-10-09T14:46:43Z

I tried with yesterday's version and there was not much difference.

BTW, I am noticing it is using aufs inside, not ovelray2. Is there a way to control this? It might be the reason.

Edit: It seems go get sigs.k8s.io/kind does not yet give you the version with volume change.

BenTheElder · 2018-10-10T17:58:26Z

BTW, I am noticing it is using aufs inside, not ovelray2. Is there a way to control this? It might be the reason.

There's not an option now, but we can and should configure this.

Edit: It seems go get sigs.k8s.io/kind does not yet give you the version with volume change.

it won't if you already have a copy unless you do go get -u sigs.k8s.io/kind, otherwise it should.

mitar · 2018-10-26T04:50:48Z

Any progress on logging?

BenTheElder · 2018-10-26T05:35:13Z

I put this on the backburner to make a few more breaking changes prior to putting up an 0.1.0, but I found something else to debug that was causing instability on a few machines I know about: local disk space!

The kubelet can see how much space is left on the disk that the docker graph volume is on and will start evicting everything if it's low.

I expect to have this in soon though, #77 and #75 were some of the remaining changes I wanted to get in.

BenTheElder · 2018-11-21T04:48:11Z

So after absurd delay (I'm sorry!) I finally filed a PR with an initial implementation, after rewriting it a few times inbetween other refactors. see #123.

There've been a lot of other changes to kind though.

mitar · 2018-11-21T04:54:20Z

Thanks. I am currently busy with some other projects, so maybe after this gets merged in I can try to see if this makes it easier to debug issues with kind on my machine (or even, maybe issues are gone with other updates).

BenTheElder · 2018-11-21T05:19:43Z

If you do get some time, I'll be happy to poke around the logs and see if I can find anything. The implementation also needs some improvement still but it should provide at least some useful information.

No rush if you don't have time though, apologies again for the large delay in getting this out. Getting things very stable and very debuggable is a major priority.

danielepolencic · 2019-01-10T15:54:08Z

I'm experiencing the same error:

$ kubectl get pods
Unable to connect to the server: EOF

If you let me know what I should do, I can try to debug it.

BenTheElder · 2019-01-10T17:28:44Z

@danielepolencic kind export logs (possibly with--name to match whatever you supplied when creating the cluster) can help to debug this. I'd guess that this is the workloads being evicted due to disk pressure / memory pressure which is a common issue. see: #156

if you're on docker for mac / windows, it's common that the docker disk runs out of space, docker system prune can help

davidewatson · 2019-01-11T03:12:43Z

Fwiw, I struggled with this problem today. Initially, I tried increasing the amount of memory given to Docker (on MacOS) as well as freeing what I thought ought to be enough disk space.

Then, after finding this issue, I ran docker system prune and I was able to create and use a kind cluster. 🎉 Thanks for the tip @BenTheElder!

mitar · 2019-02-07T20:05:31Z

I have tried today new updated version and I have still the same issues as I had when I opened this issue. Creating a cluster on my laptop is unstable. Sometimes it does not even create it. Sometimes it does, but it is not really working.

Attaching an exported log if it helps for when it did create a cluster.

664736857.zip

BenTheElder · 2019-02-07T20:11:25Z

Thanks, looks like eviction thresholds -> evicting API server. I'm going to go poke a SIG-Node expert about thoughts on us just setting the thresholds to the limits.

mitar · 2019-02-07T20:27:02Z

I cannot determine why sometimes cluster creation itself does not work. If I run kind create --loglevel debug cluster it always succeeds, if I run kind create cluster it fails with:

Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.13.2) 🖼
 ✓ [control-plane] Creating node container 📦 
 ✓ [control-plane] Fixing mounts 🗻 
 ✓ [control-plane] Starting systemd 🖥 
 ✓ [control-plane] Waiting for docker to be ready 🐋 
 ✓ [control-plane] Pre-loading images 🐋 
 ✓ [control-plane] Creating the kubeadm config file ⛵ 
ERRO[12:26:05] failed to remove master taint: exit status 1 te) ☸ 
 ✗ [control-plane] Starting Kubernetes (this may take a minute) ☸
ERRO[12:26:05] failed to remove master taint: exit status 1 
Error: failed to create cluster: failed to remove master taint: exit status 1

Not sure why with debug logging it works (in the sense that it starts the container, then it is unstable as logs attached above show).

neolit123 · 2019-02-07T20:33:32Z

could you please provide your complete system spec.

but debug vs non-debug could mean memory corruption or hitting some sort of a resource cap.

mitar · 2019-02-07T20:36:56Z

but debug vs non-debug could mean memory corruption or hitting some sort of a resource cap.

It is very reproducible (tried multiple times).

Docker otherwise runs perfectly.

I think I have pretty standard specs:

Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.1 LTS
Release:	18.04
Codename:	bionic

Docker version 18.09.1, build 4c52b90

16 GB memory, 4 core Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz. Enough disk space.

Not sure what other specs should be relevant.

BenTheElder · 2019-02-07T20:40:11Z

This is due to the api server being evicted by kubelet during bringup, which is visible in the kubelet logs.

We just need to tune the eviction limits, had a chat with a SIG-Node expert just now to confirm that this is sane :^)

mitar · 2019-02-07T20:45:36Z

But why debugging logging influences if a cluster gets created or not?

neolit123 · 2019-02-07T20:55:58Z

we don't override e.g. pod-eviction-timeout for the CM from kubeadm because the default is sufficient for the regular use case - 5 minutes.

try clearing your images and try again.
related?
#156 (comment)

mitar · 2019-02-07T20:57:24Z

I do not think I have disk space issues.

BenTheElder · 2019-02-07T21:17:14Z

But why debugging logging influences if a cluster gets created or not?

it shouldn't be, it's racy though. I can't tell from that kubelet log which threshold is being passed, but one of the eviction thresholds is being passed and the api server is being evicted, that will prevent bootup.

BenTheElder · 2019-02-07T21:18:18Z

also @neolit123 system spec is in the log zip. I would guess it's the memory threshold.

mitar · 2019-02-11T21:16:17Z

BTW, any ETA on this? Asking so that I can better plan my work (should I wait for this so that I can develop on my laptop, or should I invest time into deploying kind somewhere else so that I can work there).

BenTheElder · 2019-02-12T03:13:46Z

I just filed the PR. It needs more testing but it should also be possible to do this with a patch targeting the KubeletConfiguration on a recent cluster with:

evictionHard:
  memory.available: "1"
  nodefs.available: "0%"
  nodefs.inodesFree: "0%"
  imagefs.available: "0%"

mitar · 2019-02-12T03:15:55Z

I can test this out if you help me a bit how to do so? I just do go get -u <path to git branch somehow>?

BenTheElder · 2019-02-12T03:19:02Z

it should be possible to install with this:

cd "$(go env GOPATH)/src/sigs.k8s.io/kind"
git fetch origin pull/293/head:pr293
git checkout pr293
go install .

mitar · 2019-02-12T04:41:02Z

I can confirm that it works for me. Thanks so much!

mitar · 2019-02-12T05:03:44Z

Things do work, but I am seeing some strange events in my case:

[watch event] namespace=default, reason=Starting, message=Starting kubelet., for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeHasSufficientMemory, message=Node kind-1-control-plane status is now: NodeHasSufficientMemory, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeHasNoDiskPressure, message=Node kind-1-control-plane status is now: NodeHasNoDiskPressure, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeHasSufficientPID, message=Node kind-1-control-plane status is now: NodeHasSufficientPID, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=NodeAllocatableEnforced, message=Updated Node Allocatable limit across pods, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=Starting, message=Starting kube-proxy., for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=FreeDiskSpaceFailed, message=failed to garbage collect required amount of images. Wanted to free 306389538406 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=ImageGCFailed, message=failed to garbage collect required amount of images. Wanted to free 306395731558 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=ImageGCFailed, message=failed to garbage collect required amount of images. Wanted to free 306400130662 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=RegisteredNode, message=Node kind-1-control-plane event: Registered Node kind-1-control-plane in Controller, for={kind=Node, name=kind-1-control-plane}
[watch event] namespace=default, reason=FreeDiskSpaceFailed, message=failed to garbage collect required amount of images. Wanted to free 306529232486 bytes, but freed 0 bytes, for={kind=Node, name=kind-1-control-plane}

Not sure why it wants to free 300 GB of images.

mitar · 2019-02-12T05:06:11Z

I do have 197.6GB of images and 44.61GB in local volumes on my laptop. Maybe it is trying to empty that? But that is on my host, not inside Docker inside kind.

BenTheElder · 2019-02-12T05:09:34Z

But that is on my host, not inside Docker inside kind.

ah so, kubelet inside docker can see which disk the storage is on and find the resource usage. It can find that back through the mounts etc.

mitar · 2019-02-12T05:11:31Z

Can we disable those attempts at garbage collection?

BenTheElder · 2019-02-12T05:19:38Z

Yes we can tune that threshold too 👍

BenTheElder · 2019-02-12T05:22:07Z

imageGCHighThresholdPercent: 100 in addition to #293 (comment)

# config.yaml
kind: Config
apiVersion: kind.sigs.k8s.io/v1alpha2
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    imageGCHighThresholdPercent: 100
    evictionHard:
      memory.available:  "100Mi"
      nodefs.available:  "10%"
      nodefs.inodesFree: "5%"
      imagefs.available: "0%"

BenTheElder · 2019-02-12T05:24:29Z

correct yaml is updated, that role is a single node in a list of nodes, the nodes key was missing 😅

mitar · 2019-02-12T05:33:37Z

This didn't work. But I also tried and it works:

# config.yaml
kind: Config
apiVersion: kind.sigs.k8s.io/v1alpha2
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    imageGCHighThresholdPercent: 100
    evictionHard:
      memory.available:  "100Mi"
      nodefs.available:  "0%"
      nodefs.inodesFree: "0%"
      imagefs.available: "0%"

Not sure why we would not just leave those values on 0?

BenTheElder · 2019-02-12T05:34:57Z

yep, those make sense. disk management is not going to work well in kind right now. thanks for testing!

BenTheElder · 2019-02-12T18:23:15Z

🤞 please re-open or file a new bug if this continues. Apologies for the long time frame on these early issues, we're getting things better spun up now 😅

mitar · 2019-02-12T19:31:18Z

No worries. It was a perfect timing for when I needed this working.

BenTheElder added the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 22, 2018

BenTheElder mentioned this issue Feb 12, 2019

disable disk eviction by default #293

Merged

BenTheElder closed this as completed in #293 Feb 12, 2019

neolit123 mentioned this issue Feb 22, 2019

kubetest/kind: odd bug caused by enabling --loglevel=debug for kind kubernetes/test-infra#11441

Closed

ymolists mentioned this issue Mar 9, 2020

kind create cluster fails with failed to remove master #1387

Closed

Unstable cluster #55

Unstable cluster #55

Comments

mitar commented Oct 3, 2018

BenTheElder commented Oct 3, 2018

mitar commented Oct 3, 2018

BenTheElder commented Oct 3, 2018

BenTheElder commented Oct 3, 2018

mitar commented Oct 3, 2018

mitar commented Oct 4, 2018

BenTheElder commented Oct 4, 2018

mitar commented Oct 4, 2018

BenTheElder commented Oct 4, 2018 via email

BenTheElder commented Oct 4, 2018 via email

mitar commented Oct 4, 2018

BenTheElder commented Oct 9, 2018

mitar commented Oct 9, 2018 • edited Loading

BenTheElder commented Oct 10, 2018

mitar commented Oct 26, 2018

BenTheElder commented Oct 26, 2018

BenTheElder commented Nov 21, 2018

mitar commented Nov 21, 2018

BenTheElder commented Nov 21, 2018

danielepolencic commented Jan 10, 2019

BenTheElder commented Jan 10, 2019 • edited Loading

davidewatson commented Jan 11, 2019 • edited Loading

mitar commented Feb 7, 2019 • edited Loading

BenTheElder commented Feb 7, 2019

mitar commented Feb 7, 2019

neolit123 commented Feb 7, 2019

mitar commented Feb 7, 2019

BenTheElder commented Feb 7, 2019

mitar commented Feb 7, 2019

neolit123 commented Feb 7, 2019

mitar commented Feb 7, 2019

BenTheElder commented Feb 7, 2019

BenTheElder commented Feb 7, 2019

mitar commented Feb 11, 2019

BenTheElder commented Feb 12, 2019

mitar commented Feb 12, 2019 • edited Loading

BenTheElder commented Feb 12, 2019 • edited Loading

mitar commented Feb 12, 2019

mitar commented Feb 12, 2019

mitar commented Feb 12, 2019

BenTheElder commented Feb 12, 2019

mitar commented Feb 12, 2019

BenTheElder commented Feb 12, 2019

BenTheElder commented Feb 12, 2019 • edited Loading

BenTheElder commented Feb 12, 2019

mitar commented Feb 12, 2019 • edited Loading

BenTheElder commented Feb 12, 2019

BenTheElder commented Feb 12, 2019

mitar commented Feb 12, 2019

mitar commented Oct 9, 2018 •

edited

Loading

BenTheElder commented Jan 10, 2019 •

edited

Loading

davidewatson commented Jan 11, 2019 •

edited

Loading

mitar commented Feb 7, 2019 •

edited

Loading

mitar commented Feb 12, 2019 •

edited

Loading

BenTheElder commented Feb 12, 2019 •

edited

Loading

BenTheElder commented Feb 12, 2019 •

edited

Loading

mitar commented Feb 12, 2019 •

edited

Loading