-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calendar scaler does not work for data100 - also I suspect it caused a hub outage #3935
Comments
I suspect what needs to happen with node scaler is it needs to be made smarter. It probably needs to check values for the node pool it receives events for to decide how large the ram requests for node placeholder pods need to be. As for the outage, I'm not sure what caused that. |
Looking at this now, it looks like two of the three core nodes went away at the same time and were replaced by new nodes, causing this outage:
the node-autoscaler doesn't actually touch the core pool itself, so possibly not directly related? I'm not 100% sure. |
Bug description
My limited understanding of this tool is that it polls a calendar every minute, checks to see if there are events scheduled, and if there are it will provision placeholder pods which request a large number of resources in order to get the autoscaler to scale up more nodes. Unfortunately for data100, we increased the size of the nodes which breaks this mechanism.
Basically the placeholder pods request ~48GB of RAM which with our typical node configuration causes a new node to come up. However, the data100 nodes have ~200GB of ram which means these placeholder nodes can now have multiples of themselves placed on a single node. As a result it may...or may not cause additional nodes to come up.
Additionally, when conducting a test for scaling over a 15m period, something bad happened. Once the event was over the node-scaler crashed and all the hubs went down. I suspect the node-scaler was not happy about having multiple pods scheduled on the same node. How this translates into an outage I'm not sure.
During this period, the hub pods terminated, and then got stuck on start up with the following log:
After several minutes this issue resolved itself.
Environment & setup
data100 specifically, but all hubs were briefly impacted by an outage
How to reproduce
I suspect scheduling additional node scaling events for data100 will cause this, but only if bad luck results in multiple placeholders ending up on the same node. Could probably get it to happen by increasing the node size for data100 by 2-3 over the current.
The text was updated successfully, but these errors were encountered: