-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-Grained Resource Allocation for Spegel DaemonSets #718
Comments
Hey there @ugur99! We had some pretty wild growth in my own deployment a while back and posted a similar issue ( #546 ) as well, but after some pretty intense digging in discovered that you can safely set the request/limit to a pretty low number (such as 256MB) and spegel should continue to operate without issue. It appears this is more of an OS/kernel level memory utilization that will eventually just take what it is given and continue to utilize that, while the spegel binary itself in memory isn't truly needing it to operate. We have been setup in this 256MB manner in a 50+ node cluster since September and have not really had an issue of OOM or container crashes. Hope this helps! |
@ugur99 this is an issue that I have been trying to track down for a while now. There is something that consumes the memory when no limits have been set. Maybe it is time to set some defaults in the Helm chart. It would be really helpful if you could share profiler data from one of the pods consuming 12 GB of data. This would really help me pinpoint what is consuming the memory. If you port forward to port 9090 to one of the pods consuming the memory you can run the following command to run the profiler.
|
Thank you both for the answers! @phillebaba I sent it to you on the kubernetes slack channel dm :) Hope it helps. |
I was hoping the pprof data would point me in the right direction but gave me nothing. I have a hunch about where the problem is. I am going to spin up a AKS cluster tomorrow and try to reproduce this by throwing a bunch of parallel requests to a large layer. If I can reproduce it and then show that #725 does not have this problem we can close this issue with that PR. |
I have tried a lot of different things today to put pressure on Spegel. While I have found some interesting behavior I have not been able to trigger the memory leak. Looking at your metrics there has to be something that triggers memory usage from 100 MB to 12GB which I am guessing is the memory available on the node. |
I managed to reproduce this finally in an AKS cluster. I still dont know what the source of this memory leak is because pprof is not giving anything. What surprised me is that the consumption is not happening in the proxy but when serving blob? I can clearly see that memory usage increases each time I pull an image and is never released. Now starts the difficult process of figuring out what is leaking memory. |
It has been very educational for me to research this issue 😄 I now have an explanation for why this is happening when no memory limit is set. What is happening is that the page cache is counted as part of the memory usage of the container. Meaning that when we are streaming large blobs we populate the page cache with large amounts of data. As we close the files after serving them they are technically not needed but are not cleaned up until memory pressure occurs. When there is no limit set this pressure is assumed to be the nodes available memory. As more and more layers are served the memory usage will just keep on growing until memory pressure occurs and the kernel kicks in to free the page cache. Here is an issue which discusses things in more detail. The goal of Spegel should be to give a good UX right out of the box. I do not think that this is something that is being considered by the majority of users. Right now unless I find some other solution I think the best way forward is to set a default memory request and limit in the Helm chart. |
Spegel version
v0.0.28
Kubernetes distribution
Kubeadm
Kubernetes version
v1.31.3
CNI
Cilium
Describe the bug
We are deploying Spegel across a Kubernetes cluster with multiple node groups, each with different resource availability and workload patterns. A single DaemonSet doesn’t work efficiently due to varying resource needs.
Additionally, Spegel’s memory usage differs significantly even within the same node group, making it hard to set uniform resource requests/limits without over- or under-provisioning.
You can see the spegel memory usage differences for a one specific node group in this image:
Is there a way to manage spegel daemonset's resource request/limits more efficiently for the same node group?
The text was updated successfully, but these errors were encountered: