-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable cluster #55
Comments
Can you provide more details about how you're running this? Creating the cluster locally should not require anything more than:
I've tested regularly both on docker for mac and on docker-ce on linux. Currently it doesn't depend on any recent or advanced functionality. Sticking it in another layer of docker will likely make things less debuggable. |
So I am running inside a Docker container to simulate how my CI (gitlab) works. So it seems it just dies after some time. I am running it more or less like this. So I am not sure what to inspect when this errors start happening? I can exec into |
right now |
Those errors look like the API server is not actually running or the networking is not pointing at it correctly again. It's not clear from that output what else is going on beyond that yet.
|
OK, it is failing even if I run it directly on my laptop/host. |
See log: log.txt |
Can you provide more details about your laptop/host? Docker version? Any special network settings? I think we'll need the API server logs as well. That will be in a location like |
|
Hmm.. I have a very similar setup running kind at home both on the host
docker and in DinD from replicating the gitlab setup myself, I'm not sure
what this would be...
Do you mind if we dig into this more once we have something to scoop up the
logs? It's a bit hard to pin down otherwise.
…On Wed, Oct 3, 2018, 20:02 Mitar ***@***.***> wrote:
Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal
laptop. No fancy configuration or network settings.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#55 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA4Bq5lamxw33Ggs3qBbELJLu6k4h7aGks5uhXougaJpZM4XHFb7>
.
|
I have started a fresh prototype of collecting up debug info today.
…On Thu, Oct 4, 2018, 00:22 Benjamin Elder ***@***.***> wrote:
Hmm.. I have a very similar setup running kind at home both on the host
docker and in DinD from replicating the gitlab setup myself, I'm not sure
what this would be...
Do you mind if we dig into this more once we have something to scoop up
the logs? It's a bit hard to pin down otherwise.
On Wed, Oct 3, 2018, 20:02 Mitar ***@***.***> wrote:
> Docker version 18.06.1-ce, build e68fc7a. Host is Ubuntu 18.04 personal
> laptop. No fancy configuration or network settings.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#55 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AA4Bq5lamxw33Ggs3qBbELJLu6k4h7aGks5uhXougaJpZM4XHFb7>
> .
>
|
Sure. It seems pretty reproducible on my side, so once you get things going, feel free to ping me and I can retry. |
Still getting to this. The volume change may(?) help. I expect to get more of the tooling improvements around debugging in tomorrow hopefully... |
I tried with yesterday's version and there was not much difference. BTW, I am noticing it is using aufs inside, not ovelray2. Is there a way to control this? It might be the reason. Edit: It seems |
There's not an option now, but we can and should configure this.
it won't if you already have a copy unless you do |
Any progress on logging? |
I put this on the backburner to make a few more breaking changes prior to putting up an 0.1.0, but I found something else to debug that was causing instability on a few machines I know about: local disk space! The kubelet can see how much space is left on the disk that the docker graph volume is on and will start evicting everything if it's low. I expect to have this in soon though, #77 and #75 were some of the remaining changes I wanted to get in. |
So after absurd delay (I'm sorry!) I finally filed a PR with an initial implementation, after rewriting it a few times inbetween other refactors. see #123. There've been a lot of other changes to kind though. |
Thanks. I am currently busy with some other projects, so maybe after this gets merged in I can try to see if this makes it easier to debug issues with kind on my machine (or even, maybe issues are gone with other updates). |
If you do get some time, I'll be happy to poke around the logs and see if I can find anything. The implementation also needs some improvement still but it should provide at least some useful information. No rush if you don't have time though, apologies again for the large delay in getting this out. Getting things very stable and very debuggable is a major priority. |
I'm experiencing the same error: $ kubectl get pods
Unable to connect to the server: EOF If you let me know what I should do, I can try to debug it. |
@danielepolencic if you're on docker for mac / windows, it's common that the docker disk runs out of space, |
Fwiw, I struggled with this problem today. Initially, I tried increasing the amount of memory given to Docker (on MacOS) as well as freeing what I thought ought to be enough disk space. Then, after finding this issue, I ran |
I have tried today new updated version and I have still the same issues as I had when I opened this issue. Creating a cluster on my laptop is unstable. Sometimes it does not even create it. Sometimes it does, but it is not really working. Attaching an exported log if it helps for when it did create a cluster. |
Thanks, looks like eviction thresholds -> evicting API server. I'm going to go poke a SIG-Node expert about thoughts on us just setting the thresholds to the limits. |
I cannot determine why sometimes cluster creation itself does not work. If I run
Not sure why with debug logging it works (in the sense that it starts the container, then it is unstable as logs attached above show). |
could you please provide your complete system spec. but debug vs non-debug could mean memory corruption or hitting some sort of a resource cap. |
It is very reproducible (tried multiple times). Docker otherwise runs perfectly. I think I have pretty standard specs:
16 GB memory, 4 core Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz. Enough disk space. Not sure what other specs should be relevant. |
This is due to the api server being evicted by kubelet during bringup, which is visible in the kubelet logs. We just need to tune the eviction limits, had a chat with a SIG-Node expert just now to confirm that this is sane :^) |
But why debugging logging influences if a cluster gets created or not? |
we don't override e.g. try clearing your images and try again. |
I do not think I have disk space issues. |
it shouldn't be, it's racy though. I can't tell from that kubelet log which threshold is being passed, but one of the eviction thresholds is being passed and the api server is being evicted, that will prevent bootup. |
also @neolit123 system spec is in the log zip. I would guess it's the memory threshold. |
BTW, any ETA on this? Asking so that I can better plan my work (should I wait for this so that I can develop on my laptop, or should I invest time into deploying kind somewhere else so that I can work there). |
I just filed the PR. It needs more testing but it should also be possible to do this with a patch targeting the KubeletConfiguration on a recent cluster with: evictionHard:
memory.available: "1"
nodefs.available: "0%"
nodefs.inodesFree: "0%"
imagefs.available: "0%" |
I can test this out if you help me a bit how to do so? I just do |
it should be possible to install with this:
|
I can confirm that it works for me. Thanks so much! |
Things do work, but I am seeing some strange events in my case:
Not sure why it wants to free 300 GB of images. |
I do have 197.6GB of images and 44.61GB in local volumes on my laptop. Maybe it is trying to empty that? But that is on my host, not inside Docker inside kind. |
ah so, kubelet inside docker can see which disk the storage is on and find the resource usage. It can find that back through the mounts etc. |
Can we disable those attempts at garbage collection? |
Yes we can tune that threshold too 👍 |
# config.yaml
kind: Config
apiVersion: kind.sigs.k8s.io/v1alpha2
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
imageGCHighThresholdPercent: 100
evictionHard:
memory.available: "100Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "0%" |
correct yaml is updated, that role is a single node in a list of nodes, the |
This didn't work. But I also tried and it works:
Not sure why we would not just leave those values on 0? |
yep, those make sense. disk management is not going to work well in kind right now. thanks for testing! |
🤞 please re-open or file a new bug if this continues. Apologies for the long time frame on these early issues, we're getting things better spun up now 😅 |
No worries. It was a perfect timing for when I needed this working. |
When running it locally on my machine the cluster seems much more unstable than on our CI. So now cluster is created inside a privileged container, but then I am getting strange errors:
The text was updated successfully, but these errors were encountered: