Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] Managed worker nodes #139

Closed
tabern opened this issue Jan 30, 2019 · 49 comments
Closed

[EKS] Managed worker nodes #139

tabern opened this issue Jan 30, 2019 · 49 comments
Labels
EKS Amazon Elastic Kubernetes Service

Comments

@tabern
Copy link
Contributor

tabern commented Jan 30, 2019

Managed Kubernetes worker nodes will allow you to provision, scale, and update groups of EC2 worker nodes through EKS.

This feature fulfills #57


EKS Managed Node Groups are now GA!

@tabern tabern added the EKS Amazon Elastic Kubernetes Service label Jan 30, 2019
@dcherman
Copy link

As a quick note, can we make sure that this will interact/play nicely with cluster-autoscaler? If we can get managed, autoscaling worker nodes, this would be amazing.

@jaredeis
Copy link

jaredeis commented Feb 5, 2019

Along with draining nodes in an upgrade situation.

@dnobre
Copy link

dnobre commented Feb 23, 2019

Shouldn't this one be solved as part of implementing Fargate for EKS? #32

@danmx
Copy link

danmx commented Jul 29, 2019

You could use Virtual Kubelet

@whereisaaron
Copy link

Fargate for EKS is a different thing @dnobre, with Fargate there are no worker nodes to manage. So this issue about managing worker nodes is not relevant to Fargate.

@tabern being able to support cluster-autoscaler is important to people. At the moment it expects to manipulate ASGs through that API. But if you add an EKS API they please contribute a patch or fork to cluster-autoscaler to be able to use the new API.

If you provide your own autoscaling instead, it has to be aware on the cluster workload and all ASGs, you’d need to a k8s service or daemonset to provide custom metrics to ASG. And in some way, when there are many ASGs, to choose which one to scale up/down next, as cluster-autoscaler does.

cluster-autoscaler has to use some tricks to scale to/from zero nodes in an ASG, because it don’t know what a node would look like when there are none. The improvement the EKS API could provide would be to expose what a node would be (instance type, AZ, node labels, node taints, tags) when the node group is scaled to zero.

cluster-autoscaler also has trouble when scaling up multi-AZ ASGs, because it can’t specify which AZ the new node will be in (e.g. when the un-scheduled workload is AZ-specific). The ability to specify an AZ when scaling up and EKS node group would be great.

@cdenneen
Copy link

@whereisaaron interesting that's exactly what I would consider Fargate for EKS to be. Since it was never released 2 years ago it's implementation is pretty hypothetical but theoretically I would expect an endpoint for your kubeconfig and be able to deploy via kubectl apply or helm install.
I wouldn't expect things like "task" definitions because that's what Fargate for ECS already is.
How would "managed worker nodes" be any different?

Granted the fact that "Fargate for EKS" was never released means we are all just spit balling here.

@whereisaaron
Copy link

With Fargate, whether ECS Fargate or EKS Fargate there are no worker nodes. That’s why you use a Fargate solution, so you do not have to manage worker nodes. So this issue has no overlap with a Fargate product.

@cdenneen not sure I understand, but what you describe sounds correct, just like EKS endpoint, except no (real) worker nodes, just a virtual-kubelet running as a sidecar to the Pod. The virtual-kubelet and Pod run who-knows-where, because the instances they run on are not our problem with a Fargate solution.

@groodt
Copy link

groodt commented Oct 1, 2019

Any updates on this issue?

@tabern
Copy link
Contributor Author

tabern commented Oct 1, 2019

@groodt coming soon.... we'll be sure to update when there are updates to share!

@ejlp12
Copy link

ejlp12 commented Oct 6, 2019

Does this features will add capability to create worker nodes group from the AWS console (UI)?

@tabern
Copy link
Contributor Author

tabern commented Oct 6, 2019

@ejlp12 yes.

@lilley2412
Copy link

@tabern will there be an option to add a userdata script or otherwise modify the instances?

@pfremm
Copy link

pfremm commented Oct 17, 2019

I am curious about logging aggregation as well for managed workers. Any details on how we can aggregate logs as part of this feature?

@tabern
Copy link
Contributor Author

tabern commented Oct 18, 2019

@lilley2412 not at launch, but we plan to add this in the future.

@pfremm yes. You'll be able to use EC2 Autoscaling for reporting group-level metrics. Since managed nodes are standard EC2 instances that run in your account, you will be able to implement any log forwarding/aggregation tooling that you are using today, such as FluentBit/S3 and Fluentd/CloudWatch.

@bcmedeiros
Copy link

@tabern will this support windows worker nodes?

@mikestef9 mikestef9 changed the title Managed worker nodes [EKS] Managed worker nodes Oct 30, 2019
@hrushig
Copy link

hrushig commented Nov 13, 2019

Who manages security patches or addresses CVEs on these managed worker nodes. Will this still fall under "Security in the Cloud" customer responsibility?

@virgofx
Copy link

virgofx commented Nov 18, 2019

Released GA 11/18 👍

@pkoch
Copy link

pkoch commented Nov 18, 2019

Can we have a link to the docs?

@nrdlngr
Copy link

nrdlngr commented Nov 18, 2019

Hi! The documentation is deploying now. It should be available shortly, and I'll update with a link here when it is.

@tabern
Copy link
Contributor Author

tabern commented Nov 18, 2019

We're excited to announce that Amazon EKS Managed Node Groups are now generally available!

With Amazon EKS managed node groups you don’t need to separately provision or connect the EC2 instances that provide compute capacity to run your Kubernetes applications. You can create, update, or terminate nodes for your cluster with a single command. Nodes run using the latest EKS-optimized AMIs in your AWS account while node updates and terminations gracefully drain nodes to ensure your applications stay available.

Today, EKS managed node groups are available for new Amazon EKS clusters running Kubernetes version 1.14 with platform version eks.3. You can also update clusters (1.13 or lower) to version 1.14 to take advantage of this feature. Support for existing version 1.14 clusters is coming soon.

Learn more

@tabern tabern closed this as completed Nov 18, 2019
@robgott
Copy link

robgott commented Nov 18, 2019

@tabern congrats on the release!
Is CF support in a future release or is doco just pending updates?
Ready to use this but can't use without CF support :/

@MarcusNoble
Copy link

Does something need to be done to enable this on existing clusters?

Latest EKS 1.14

CleanShot 2019-11-19 at 06 20 07

Also, as @nxf5025 mentions, doesn't look like any ability to pass in userdata or kubelet flags?

Also, will there be support for spot instances?

@tabern
Copy link
Contributor Author

tabern commented Nov 19, 2019

Thanks all! We're pretty excited to introduce this new feature.

@robgott @pc-rshetty CloudFormation support for managed node groups is there today, its just that the documentation is taking a bit longer to publish than we had originally expected.

Specifically, EKS Managed node group introduces a new resource type ”AWS::EKS::Nodegroup“ and an update to existing resource type ”AWS::EKS::Cluster“ to add ClusterSecurityGroupId in Cloudformation. The documentation updates for these changes will be published by 11/21.

@pc-rshetty Cluster Autoscaler should continue to work just like it does today. The biggest change from our end is that we tag every node for auto discovery by cluster autoscaler. Overprovisioner should work. Seems like a helm chart that basically implements the method described here?

@nxf5025 @MarcusNoble today you cannot pass this to managed node groups. However! we're planning to add this in the future as part of support for EC2 Launch Templates #585

Yes, we also will be working on spot support - tracking in #583

The other feature we're currently tracking on the roadmap is Windows Support (#584) but feel free to add more if there are important features you think we should be looking at.

@JanneEN
Copy link

JanneEN commented Nov 19, 2019

Are managed Ubuntu node groups also being worked on or should that be added to the roadmap? That was mentioned in the blog post when comparing EKS API with eksctl, it's a feature we need.

@drewhemm
Copy link

drewhemm commented Nov 19, 2019

In addition to spot instances, being able to utilise mixed instances policy, as per kubernetes/autoscaler#1886, i.e. t3.large and t3a.large or m5.large and m5d.large etc. This is to increase the probability of a successful instance fulfilment. We are currently using this functionality to good effect and would need to have the same ability with managed worker nodes, along with the ability to specify userdata.

In the UI, this would be simply represented by being able to select multiple instance types and preferably being able to sort them in order of preference. This is how launch template mixed instances policy and overrides currently work:

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-launchtemplateoverrides.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-launchtemplate.html

@omerfsen
Copy link

omerfsen commented Nov 19, 2019 via email

@tabern
Copy link
Contributor Author

tabern commented Nov 21, 2019

@omerfsen we're tracking spot support in #583

@drewhemm we're considering mixed instance groups part of spot support, agree that without them spot will be difficult to use properly.

@JanneEN that's a good call out, we'd love to make this happen. Thanks for adding as #588

@tabern
Copy link
Contributor Author

tabern commented Nov 22, 2019

Cloudformation Documentation for EKS Managed Node Groups is now published - https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-eks-nodegroup.html

@stevenoctopus
Copy link

@tabern Doesn't look like docs have been updated still. That link redirects me to: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html

@drewhemm
Copy link

Interesting. The link worked when I clicked it 12 hours ago...

@stevenoctopus
Copy link

It's working for me now! 👍

@splieth
Copy link

splieth commented Nov 27, 2019

@tabern How are rolling updates supposed to work? Draining nodes actually works, but apparently leads to a downtime.

Let's say I have an existing node group and want to rotate the nodes. To do this (manually), I would replace the node group by creating a new one, waiting for it to become available and then delete the old one afterwards. When doing this, I can actually see that the nodes get drained before the instances get terminated. However, the running pods are more or less terminated simultaneously which leads to a downtime.

In terraform, the mechanism is basically the same, leading to the same result.

Am I doing something wrong?

Edit:
I can also see this behavior when just scaling an existing node group (e.g. by scaling from 3 nodes to 6 and back from 6 to 3).

2nd edit:
Downtime in this case means that I can see some failing requests.

@reegnz
Copy link

reegnz commented Nov 28, 2019

@splieth look into pod disruption budgets, that's what you need to avoid all pods terminate at once.

@groodt
Copy link

groodt commented Dec 1, 2019

When will we see support for existing 1.14 clusters? My clusters are currently stuck on platform version eks.2.

@reegnz
Copy link

reegnz commented Dec 2, 2019

@groodt I don't think there's much hope there. You could just create a new cluster and move your workloads there.
Hopefully everyone realizes that you're not supposed to build pet clusters either.

@MarcusNoble
Copy link

I don't see why it couldn't support existing clusters. A little more involved maybe but a cluster can have multiple ASGs associated with it so the new managed nodes could be brought up alongside the existing self-managed and them remove the self-managed when the new nodes are stable.

@groodt
Copy link

groodt commented Dec 3, 2019

This isn't a new Kubernetes version. Presumably it's some additional process running in the control plane that are aware of the ASGs and that's it.

I saw this in the original announcement:

Today, EKS managed node groups are available for new Amazon EKS clusters running Kubernetes version 1.14 with platform version eks.3. You can also update clusters (1.13 or lower) to version 1.14 to take advantage of this feature. Support for existing version 1.14 clusters is coming soon.

So presumably they do plan to upgrade existing clusters, I'm curious on the timelines. If it's too long, sure I can create a new cluster and migrate workloads easily enough, but it's still annoying to do without downtime.

@PerGon
Copy link

PerGon commented Jan 3, 2020

My 1.14 clusters are still stuck in platform version eks.2. The newest platform version is eks.7 - seems that the rollout of new platform versions for existing clusters is really slow.
While it's fair to assume the creation of new clusters for new K8s versions, seems a bit excessive creating new clusters for new platform versions.
What expectations can we have on the timeline for updates of platform version for existing clusters?

@llamahunter
Copy link

FWIW, my clusters weren’t updating either, but when I updated all my workers to a newer AMI ahead of the control plane version, all my control planes updated within 48 hours. Coincidence?

@pfremm
Copy link

pfremm commented Feb 10, 2020

Been trying out managed worker nodes and unless I am missing something do I have no ability to see kubelet related logs unless I provision with an SSH key?

@splieth
Copy link

splieth commented Feb 27, 2020

@pfremm I didn't find another method apart from deploying the SSM agent as DaemonSet and accessing the logs via SSM rather than SSH. But imho that's the better option since the SSH key doesn't need to be shared

@reegnz
Copy link

reegnz commented Feb 27, 2020

@pfremm I suggest you set up container insights to ship logs and metrics into CloudWatch Logs, that worked great for me.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/deploy-container-insights-EKS.html

All logs from pods and kubelet + kube proxy are then shipped and viewable in cloudwatch. You can then ship that further into elasticsearch as well, so that's also an option 8f you don't like cloudwatch queries.

@pdonorio
Copy link

@tabern will there be an option to add a userdata script or otherwise modify the instances?

Is there any update on this?

I couldn't find any reference in https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-eks-nodegroup.html

This is quite critical to setup nodes for different purposes

@mikestef9
Copy link
Contributor

Hi @pdonorio, that feature request is being tracked in this issue #596

@dcherman
Copy link

Out of curiosity, does the current managed node group setup specify any kind of flags for --kube-reserved and friends? If so, what are the values based on? I know we'll be able to control those values once #596 has been addressed, but I'm wondering if managed nodegroups would be usable for us today.

@erez-rabih
Copy link

is there any support for setting taints on a nodegroup?
is there any support for customizing the bootstrap script that is being ran on the nodegroup instances?

@amitkatyal
Copy link

I am not able to scale down the Managed Node group. I had deployed the 2 nodes cluster (Managed node group) and was able to scale the number of nodes from 2 to 3 but scaling down doesn't work. I tried from AWS managed console and CF template but no luck.

On scaling down the nodes, the node does go into the scheduling disabled state but the work load is not evicted from that node and eventually, the node becomes ready.

For scaling up or down the managed node group, I am only updating the scaling config of the Managed node group.

NAME STATUS ROLES AGE VERSION
ip-10-0-24-220.ap-northeast-1.compute.internal Ready,SchedulingDisabled 101m v1.21.2-eks-c1718fb
ip-10-0-36-47.ap-northeast-1.compute.internal Ready 5h48m v1.21.2-eks-c1718fb
ip-10-0-5-244.ap-northeast-1.compute.internal Ready 25m v1.21.2-eks-c1718fb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service
Projects
None yet
Development

No branches or pull requests