Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler getting into segmentation faults #4741

Closed
jkl373 opened this issue Mar 16, 2022 · 8 comments
Closed

Autoscaler getting into segmentation faults #4741

jkl373 opened this issue Mar 16, 2022 · 8 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@jkl373
Copy link

jkl373 commented Mar 16, 2022

Which component are you using?: cluster-autoscaler

What version of the component are you using?: v1.21.2

Component version: v1.21.2

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:08:39Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: GCP

What did you expect to happen?:

What happened instead?: Panicked and got into seg fault

How to reproduce it (as minimally and precisely as possible): Nil

Anything else we need to know?:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x30c0ca3]

goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).deleteCreatedNodesWithErrors(0xc000e023c0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:646 +0x2e3
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000e023c0, 0xc083fd7c206a37f6, 0xccd4488903, 0x62adf80, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:339 +0x1177
main.run(0xc00009e140)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:368 +0x361
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:408 +0x8b7

@jkl373 jkl373 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 16, 2022
@jkl373
Copy link
Author

jkl373 commented Apr 17, 2022

Another instance

`W0417 09:56:17.472161       1 clusterstate.go:447] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
I0417 09:56:17.668884       1 cache.go:244] Regenerating MIG information for <<project>>/us-east4-b/b-es-large-nodes-pg-005-b-us-east4-gcp-reai-io
W0417 09:56:17.811378       1 cache.go:198] instance <<project>>/us-east4-b/es-large-nodes-spot-n2tc belongs to unknown mig
W0417 09:56:17.811461       1 clusterstate.go:586] Nodegroup is nil for gce://<<project>>/us-east4-b/es-large-nodes-spot-n2tc
I0417 09:56:17.811614       1 clusterstate.go:994] Found 1 instances with errorCode OutOfResource.RESOURCE_POOL_EXHAUSTED in nodeGroup https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-gpu-nodes-pg-005-b-us-east4-gcp-reai-io
I0417 09:56:17.811641       1 clusterstate.go:1012] Failed adding 1 nodes (1 unseen previously) to group https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-gpu-nodes-pg-005-b-us-east4-gcp-reai-io due to OutOfResource.RESOURCE_POOL_EXHAUSTED; errorMessages=[]string{"Instance 'es-gpu-nodes-t9bf' creation failed: The zone 'projects/<<project>>/zones/us-east4-b' does not have enough resources available to fulfill the request.  '(resource type:compute)'."}
I0417 09:56:17.811699       1 clusterstate.go:994] Found 1 instances with errorCode OutOfResource.QUOTA_EXCEEDED in nodeGroup https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
I0417 09:56:17.811715       1 clusterstate.go:1012] Failed adding 1 nodes (1 unseen previously) to group https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io due to OutOfResource.QUOTA_EXCEEDED; errorMessages=[]string{"Instance 'es-large-nodes-spot-n2tc' creation failed: Quota 'C2_CPUS' exceeded.  Limit: 24.0 in region us-east4."}
W0417 09:56:17.811729       1 clusterstate.go:447] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
W0417 09:56:17.811801       1 clusterstate.go:621] Readiness for node group https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io not found
I0417 09:56:17.811877       1 static_autoscaler.go:319] 2 unregistered nodes present
W0417 09:56:17.812003       1 clusterstate.go:385] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
W0417 09:56:17.812018       1 clusterstate.go:447] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
W0417 09:56:17.812024       1 clusterstate.go:385] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x30c0ca3]

goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).deleteCreatedNodesWithErrors(0xc001e71b80)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:646 +0x2e3
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc001e71b80, 0xc08f16f051980039, 0xe8bea56622, 0x62adf80, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:339 +0x1177
main.run(0xc000094550)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:368 +0x361
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:408 +0x8b7`

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 15, 2022
@drmorr0
Copy link
Contributor

drmorr0 commented Sep 8, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 8, 2022
@drmorr0
Copy link
Contributor

drmorr0 commented Sep 8, 2022

We've also observed this same behaviour on 1.19.2, running on AWS.

@drmorr0
Copy link
Contributor

drmorr0 commented Sep 8, 2022

Aha. This issue was resolved in #4926, which is present in 1.21.3

@drmorr0
Copy link
Contributor

drmorr0 commented Sep 8, 2022

/close

@k8s-ci-robot
Copy link
Contributor

@drmorr0: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants