Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] [request]: Speed up cluster autoscale by automatically creating new EKS Optimized AMIs with security patches #1712

Open
ssanders1449 opened this issue Apr 14, 2022 · 3 comments
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@ssanders1449
Copy link

ssanders1449 commented Apr 14, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request
Downloading and installing securty patches can add 45 seconds to the boot time when the Cluster Autoscale adds new nodes. This affect responsiveness to traffic spikes. Therefore, the request is twofold:

  1. Add a process which automatically releases new sub-versions of EKS Optimized AMIs containing security patches whenever a new security patch is released that would normally be installed during boot of the original AMI.
  2. Add an SSM API that receives an AMI ID and returns the ID of a new AMI that is exactly the same as the original, but has all security patches

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

One of the factors that affect the responsiveness of the Cluster autoscaler is how long it takes new nodes to initialize. Part of the node bootup is checking/download/installation of security patches. If all security patches are included in the AMI, the bootup time is reduced by as much as 45 seconds.

Even though new AMIs are released approximately every 10 days, it is not good enough to just always use the latest recommended AMI for two reasons:

  1. New AMI versions contain changes that are not just security related and therefore it is dangerous to automatically take them into production systems without testing (see [EKS] [request]: A reliable EKS AMI release process #319). However, it should be perfectly safe to take a new AMI that is identical to the original one except for the security patches, since even the original AMI will anyway download/install these same security patches during initial boot.

  2. Security patches are released more often than recommended AMIs. In a recent test, I took an AMI that was 6 days old, and when the node booted, it downloaded/installed 4 security patches. When I created a custom AMI that included these 4 patches, node bootup time was reduced by 45 seconds. See attached excerpts from from cloud-init-output.log for the recommended AMI versus the custom AMI

So the request is to:

  1. Create an automated procedure that will create new AMIs with a naming convention that includes the original AMI name, plus a patch version suffix. For example: amazon-eks-node-1.21-v20220406-p1, amazon-eks-node-1.21-v20220406-p2, etc

  2. Add an SSM API where we can pass in the original AMI name (e.g. amazon-eks-node-1.21-v20220406) and get back the ID/Name of the latest patched AMI which is based on the original AMI. This can be used to automate updating of AutoScaling Groups with the patched AMI ID.

Are you currently working around this issue?
We are considering using the techniques described in https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-ami-patching.html to create a lambda that will periodically check for security updates, generate new custom AMIs, and patch the ASG. However, we are probably not the only ones who can benefit from this.

Attachments

custom-cloud-init-output.log
recommended-cloud-init-output.log

@ssanders1449 ssanders1449 added the Proposed Community submitted issue label Apr 14, 2022
@stevehipwell
Copy link

@ssanders1449 why are you downloading patches on an optimised AMI where the combination of packages have been tested to make sure they work together? This doesn't seem to have any advantages and lots of downsides; AWS already releases new optimised AMIs when there are vulnerabilities.

Do you have any example where the response was too slow for an exploitable vulnerability?

@mikestef9 mikestef9 added the EKS Amazon Elastic Kubernetes Service label Apr 14, 2022
@ssanders1449
Copy link
Author

@ssanders1449 why are you downloading patches on an optimised AMI where the combination of packages have been tested to make sure they work together? This doesn't seem to have any advantages and lots of downsides; AWS already releases new optimised AMIs when there are vulnerabilities.

Do you have any example where the response was too slow for an exploitable vulnerability?

I am not explicitly downloading patches, the Optimized AMI itself is downloading the patches. This is because of the following line in /etc/cloud/cloud.cfg

  • package-update-upgrade-install

Removing this line from cloud.cfg significantly improves launch time
My suggestion is to remove this line from cloud.cfg and instead of installing the patches at launch time, that the patches be included in updated AMIs

@stevehipwell
Copy link

Thanks for the clarification @ssanders1449, this does seem like something that we should have control over. For Bottlerocket you can control if your AMI is upgraded or if you want to replace it.

Do you happen to have any numbers for packages installed from day 0 of an AL2 AMI release to when the next AMI is released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

3 participants