Autoscaler getting into segmentation faults #4741

jkl373 · 2022-03-16T12:53:08Z

Which component are you using?: cluster-autoscaler

What version of the component are you using?: v1.21.2

Component version: v1.21.2

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:08:39Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: GCP

What did you expect to happen?:

What happened instead?: Panicked and got into seg fault

How to reproduce it (as minimally and precisely as possible): Nil

Anything else we need to know?:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x30c0ca3]

goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).deleteCreatedNodesWithErrors(0xc000e023c0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:646 +0x2e3
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000e023c0, 0xc083fd7c206a37f6, 0xccd4488903, 0x62adf80, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:339 +0x1177
main.run(0xc00009e140)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:368 +0x361
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:408 +0x8b7

The text was updated successfully, but these errors were encountered:

jkl373 · 2022-04-17T10:08:44Z

Another instance

`W0417 09:56:17.472161       1 clusterstate.go:447] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
I0417 09:56:17.668884       1 cache.go:244] Regenerating MIG information for <<project>>/us-east4-b/b-es-large-nodes-pg-005-b-us-east4-gcp-reai-io
W0417 09:56:17.811378       1 cache.go:198] instance <<project>>/us-east4-b/es-large-nodes-spot-n2tc belongs to unknown mig
W0417 09:56:17.811461       1 clusterstate.go:586] Nodegroup is nil for gce://<<project>>/us-east4-b/es-large-nodes-spot-n2tc
I0417 09:56:17.811614       1 clusterstate.go:994] Found 1 instances with errorCode OutOfResource.RESOURCE_POOL_EXHAUSTED in nodeGroup https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-gpu-nodes-pg-005-b-us-east4-gcp-reai-io
I0417 09:56:17.811641       1 clusterstate.go:1012] Failed adding 1 nodes (1 unseen previously) to group https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-gpu-nodes-pg-005-b-us-east4-gcp-reai-io due to OutOfResource.RESOURCE_POOL_EXHAUSTED; errorMessages=[]string{"Instance 'es-gpu-nodes-t9bf' creation failed: The zone 'projects/<<project>>/zones/us-east4-b' does not have enough resources available to fulfill the request.  '(resource type:compute)'."}
I0417 09:56:17.811699       1 clusterstate.go:994] Found 1 instances with errorCode OutOfResource.QUOTA_EXCEEDED in nodeGroup https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
I0417 09:56:17.811715       1 clusterstate.go:1012] Failed adding 1 nodes (1 unseen previously) to group https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io due to OutOfResource.QUOTA_EXCEEDED; errorMessages=[]string{"Instance 'es-large-nodes-spot-n2tc' creation failed: Quota 'C2_CPUS' exceeded.  Limit: 24.0 in region us-east4."}
W0417 09:56:17.811729       1 clusterstate.go:447] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
W0417 09:56:17.811801       1 clusterstate.go:621] Readiness for node group https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io not found
I0417 09:56:17.811877       1 static_autoscaler.go:319] 2 unregistered nodes present
W0417 09:56:17.812003       1 clusterstate.go:385] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
W0417 09:56:17.812018       1 clusterstate.go:447] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
W0417 09:56:17.812024       1 clusterstate.go:385] Failed to find readiness information for https://content.googleapis.com/compute/v1/projects/<<project>>/zones/us-east4-b/instanceGroups/b-es-large-nodes-spot-pg-005-b-us-east4-gcp-reai-io
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x30c0ca3]

goroutine 1 [running]:
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).deleteCreatedNodesWithErrors(0xc001e71b80)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:646 +0x2e3
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc001e71b80, 0xc08f16f051980039, 0xe8bea56622, 0x62adf80, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:339 +0x1177
main.run(0xc000094550)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:368 +0x361
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:408 +0x8b7`

k8s-triage-robot · 2022-07-16T11:02:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-08-15T11:52:05Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

drmorr0 · 2022-09-08T20:33:29Z

/remove-lifecycle rotten

drmorr0 · 2022-09-08T20:33:56Z

We've also observed this same behaviour on 1.19.2, running on AWS.

drmorr0 · 2022-09-08T21:32:41Z

Aha. This issue was resolved in #4926, which is present in 1.21.3

drmorr0 · 2022-09-08T21:32:50Z

/close

k8s-ci-robot · 2022-09-08T21:32:55Z

@drmorr0: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jkl373 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 16, 2022

jbartosik added the area/cluster-autoscaler label Mar 22, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 15, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 8, 2022

k8s-ci-robot closed this as completed Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler getting into segmentation faults #4741

Autoscaler getting into segmentation faults #4741

jkl373 commented Mar 16, 2022

jkl373 commented Apr 17, 2022 •

edited

Loading

k8s-triage-robot commented Jul 16, 2022

k8s-triage-robot commented Aug 15, 2022

drmorr0 commented Sep 8, 2022

drmorr0 commented Sep 8, 2022

drmorr0 commented Sep 8, 2022

drmorr0 commented Sep 8, 2022

k8s-ci-robot commented Sep 8, 2022

Autoscaler getting into segmentation faults #4741

Autoscaler getting into segmentation faults #4741

Comments

jkl373 commented Mar 16, 2022

jkl373 commented Apr 17, 2022 • edited Loading

k8s-triage-robot commented Jul 16, 2022

k8s-triage-robot commented Aug 15, 2022

drmorr0 commented Sep 8, 2022

drmorr0 commented Sep 8, 2022

drmorr0 commented Sep 8, 2022

drmorr0 commented Sep 8, 2022

k8s-ci-robot commented Sep 8, 2022

jkl373 commented Apr 17, 2022 •

edited

Loading