Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter 1.1.1 TLS Handshake Timeout in US-WEST-2 #7542

Closed
SKY-Mark-Rhoades-Brown opened this issue Dec 19, 2024 · 3 comments
Closed

Karpenter 1.1.1 TLS Handshake Timeout in US-WEST-2 #7542

SKY-Mark-Rhoades-Brown opened this issue Dec 19, 2024 · 3 comments
Labels
bug Something isn't working triage/needs-information Marks that the issue still needs more information to properly triage

Comments

@SKY-Mark-Rhoades-Brown
Copy link

Description

Observed Behavior:

We have multiple clusters that we are deploying using Terraform (to ensure that the clusters are identical). We have upgraded from Karpenter 1.0.0 to 1.1.1 and in US-WEST-2, the pods fail with the following log message:

error retrieving attempts data due to: no attempts initialized. NewAttempt() should be called first. Skipping Throughput metrics{"level":"ERROR","time":"2024-12-19T16:35:59.049Z","logger":"controller","message":"ec2 api connectivity check failed","commit":"0a85efb","error":"operation error EC2: DescribeInstanceTypes, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post \"https://sts.us-west-2.amazonaws.com/\": net/http: TLS handshake timeout"}

This does not happen in EU-WEST-1 with identical configuration. We tested and rolling back to 1.1.0 fails in the same way, 1.0.8 and below function perfectly fine.

Expected Behavior:

We would expect not to receive TLS errors.

Reproduction Steps (Please include YAML):

helm install karpenter oci://public.ecr.aws/karpenter/karpenter  --values /karpentervalues.yaml --version 1.1.0

This is the contents of karpentervalues.yaml

controller:
  resources:
    limits:
      cpu: 1
      memory: 1Gi
    requests:
      cpu: 1
      memory: 1Gi
ec2nodeclass:
  nodeRole: <my-node-role> 
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<my-account-id>:role/<my-iam-role>
settings:
  clusterCABundle: <my-cluster-ca-bundle>
  clusterEndpoint: <my-cluster-endpoint>
  clusterName: <my-cluster-name>
  eksControlPlane: true
  interruptionQueue: <my-interruption-queue>

Versions:

  • Chart Version: 1.1.0, 1.1.1
  • Kubernetes Version (kubectl version): 1.30
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@SKY-Mark-Rhoades-Brown SKY-Mark-Rhoades-Brown added bug Something isn't working needs-triage Issues that need to be triaged labels Dec 19, 2024
@talhermon
Copy link

talhermon commented Dec 22, 2024

Faced the same error and opened an issue in the Karpenter-sigs repo.
waiting for updates.
kubernetes-sigs/karpenter#1890

@jmdeal
Copy link
Contributor

jmdeal commented Dec 23, 2024

@SKY-Mark-Rhoades-Brown Is this consistently reproducible, or is it transient? I'm unable to reproduce with v1.1.1 in a cluster in us-west-2, I would suspect this has something to do with your specific environment rather than Karpenter. Are you able to reproduce in a different cluster?

@talhermon I believe your issue is slightly different, you don't experience a TLS handshake error - the connection is refused. It sounds like the controller eventually succeeds to communicate with STS at which point you encounter authorization failures. I recommend opening a separate issue.

@jmdeal jmdeal added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Dec 23, 2024
@SKY-Mark-Rhoades-Brown
Copy link
Author

Hi. I am sorry to have raised this for you. It is an issue with a firewall in our environment and the upgrade to the version of golang. We found that another component that we upgraded also suffered identical symptoms and also had it's golang version upgraded.

In the end we built ourselves a test and found. that the issue was introduced between 1.22.0 and 1.23.4. Further digging led us to golang/go#67061 - setting tlsskyber=0 fixes it for us and are planning on a firewall upgrade to resolve the problem.

I will close this ticket out as it is certainly not a Karpenter issue.

Thanks very much for taking a look, Karpenter is awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/needs-information Marks that the issue still needs more information to properly triage
Projects
None yet
Development

No branches or pull requests

3 participants