Potential memory leak in OpenSSL #7647

Lyt99 · 2021-09-16T11:13:31Z

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 0.44.0 & 0.49.0

Kubernetes version (use kubectl version): 1.18.8

Environment:

Cloud provider or hardware configuration: AlibabaCloud
OS (e.g. from /etc/os-release): Alibaba Cloud Linux (Aliyun Linux)
Kernel (e.g. uname -a): 4.19.91-23.al7.x86_64
Install tools:
- From AlibabaCloud console

What happened:

We've encountered some memory issue both in 0.44.0 and 0.49.0
Some of the ingress pods get a high memory usage, but others are ina normal level

We did sone diagnose to the pod, and it shows that one of the nginx worker gained a large amount of memory.

the income traffic is balance, about 100 requests per second, and the connection count between pods is of the same order of magnitude (from 10k+ to 100k+).

And then, we use pmap -x <pid> to get details of the memory. There were lots of tiny anon blocks in the memory map.

Made a coredump and took a look at this memory area, most of its content seems to be related to TLS certs. And also we tried to run memleak on the process, and result here:

[16:18:49] Top 10 stacks with outstanding allocations:
	300580 bytes in 15029 allocations from stack
		CRYPTO_strdup+0x30 [libcrypto.so.1.1]
		[unknown]
	462706 bytes in 375 allocations from stack
		[unknown] [libcrypto.so.1.1]
	507864 bytes in 9069 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
		[unknown]
	536576 bytes in 131 allocations from stack
		[unknown] [libcrypto.so.1.1]
	848638 bytes in 333 allocations from stack
		ngx_alloc+0xf [nginx]
		[unknown]
	2100720 bytes in 22253 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
	3074792 bytes in 888 allocations from stack
		BUF_MEM_grow+0x81 [libcrypto.so.1.1]
	3496960 bytes in 4398 allocations from stack
		posix_memalign+0x1a [ld-musl-x86_64.so.1]
	5821440 bytes in 9096 allocations from stack
		[unknown] [libssl.so.1.1]
		[unknown]
	9060080 bytes in 22605 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
[16:18:58] Top 10 stacks with outstanding allocations:
	287280 bytes in 14364 allocations from stack
		CRYPTO_strdup+0x30 [libcrypto.so.1.1]
		[unknown]
	393216 bytes in 96 allocations from stack
		[unknown] [libcrypto.so.1.1]
	396428 bytes in 322 allocations from stack
		[unknown] [libcrypto.so.1.1]
	486080 bytes in 8680 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
		[unknown]
	724916 bytes in 286 allocations from stack
		ngx_alloc+0xf [nginx]
		[unknown]
	1949832 bytes in 20300 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
	2032380 bytes in 727 allocations from stack
		BUF_MEM_grow+0x81 [libcrypto.so.1.1]
	3760256 bytes in 5049 allocations from stack
		posix_memalign+0x1a [ld-musl-x86_64.so.1]
	5575680 bytes in 8712 allocations from stack
		[unknown] [libssl.so.1.1]
		[unknown]
	8525968 bytes in 20572 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
[16:19:06] Top 10 stacks with outstanding allocations:
	716420 bytes in 35821 allocations from stack
		CRYPTO_strdup+0x30 [libcrypto.so.1.1]
		[unknown]
	782336 bytes in 191 allocations from stack
		[unknown] [libcrypto.so.1.1]
	885218 bytes in 721 allocations from stack
		[unknown] [libcrypto.so.1.1]
	1233680 bytes in 22030 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
		[unknown]
	1761982 bytes in 775 allocations from stack
		ngx_alloc+0xf [nginx]
		[unknown]
	3814396 bytes in 1525 allocations from stack
		BUF_MEM_grow+0x81 [libcrypto.so.1.1]
	4298576 bytes in 48880 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
	11922816 bytes in 15455 allocations from stack
		posix_memalign+0x1a [ld-musl-x86_64.so.1]
	14005760 bytes in 21884 allocations from stack
		[unknown] [libssl.so.1.1]
		[unknown]
	21036912 bytes in 49333 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]

here are more samples m.log

Finally we moved the cert to the load balancer provided by cloud, and it's working fine now, but still have no clue about why could this happen.

The leak is happened on nginx and connection with TLS. We tried to rebuild the image to upgrade libraries to the newest version (for openssl, 1.1.1l-r0), but it doesn't work.

What you expected to happen:

no memory leak with TLS

How to reproduce it:

I have no idea what makes the issue happen, and I can't reproduce it on another cluster.

Anything else we need to know:

As far, we haven't met this issue with 0.30.0 (openssl 1.1.1d-r3), I don't know whether it's a problem in newer openssl.

/kind bug

The text was updated successfully, but these errors were encountered:

longwuyuan · 2021-09-16T13:26:59Z

/remove-kind bug
Hi, let us wait until we get some helpful information that hints at a bug.
Also, please provide the information asked in the issue template.

We have been making changes for performance and very soon we will be releasing a build that has changed components of the controller. But if you test the current latest release and update as per issue template, it will help get a better perspective.

/triage needs-information

lvauvillier · 2021-09-25T18:03:19Z

Hi, I have the same issue:

nginx -s reload temporary solves the issue.

Here is my infos:

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller
Release: v0.47.0
Build: 7201e37
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.20.1

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.9-gke.1001", GitCommit:"1fe18c314ed577f6047d2712a9d1c8e498e22381", GitTreeState:"clean", BuildDate:"2021-08-23T23:06:28Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration:
GCP
Kernel (e.g. uname -a):
Linux ingress-nginx-controller-788c5f7f88-d94pj 5.4.120+ Basic structure #1 SMP Tue Jun 22 14:53:20 PDT 2021 x86_64 Linux

Helm:
helm -n ingress-nginx get values ingress-nginx
USER-SUPPLIED VALUES:

controller:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - nginx-ingress
          topologyKey: kubernetes.io/hostname
        weight: 100
  config:
    use-gzip: true
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack
      enabled: true
      namespace: monitoring
  replicaCount: 2
  resources:
    requests:
      memory: 800Mi
  service:
    externalTrafficPolicy: Local

kubectl describe po -n ingress-nginx ingress-nginx-controller-788c5f7f88-d94pj

Name:         ingress-nginx-controller-788c5f7f88-d94pj
Namespace:    ingress-nginx
Priority:     0
Node:         gke-production-pool-1-66bb3111-sldn/10.132.0.4
Start Time:   Sat, 18 Sep 2021 17:17:13 +0200
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress-nginx
              app.kubernetes.io/name=ingress-nginx
              pod-template-hash=788c5f7f88
Annotations:  kubectl.kubernetes.io/restartedAt: 2021-09-18T17:17:13+02:00
Status:       Running
IP:           10.52.3.39
IPs:
  IP:           10.52.3.39
Controlled By:  ReplicaSet/ingress-nginx-controller-788c5f7f88
Containers:
  controller:
    Container ID:  containerd://74fb58bce33d84fb54fb61a3a16772d6edf8858cc14a05c21d0feb79a90e8157
    Image:         k8s.gcr.io/ingress-nginx/controller:v0.47.0@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b
    Image ID:      k8s.gcr.io/ingress-nginx/controller@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b
    Ports:         80/TCP, 443/TCP, 10254/TCP, 8443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller
      --election-id=ingress-controller-leader
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
    State:          Running
      Started:      Sat, 18 Sep 2021 17:17:14 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   800Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       ingress-nginx-controller-788c5f7f88-d94pj (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from ingress-nginx-token-cn2nx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-admission
    Optional:    false
  ingress-nginx-token-cn2nx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-token-cn2nx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

kubectl describe svc -n ingress-nginx ingress-nginx-controller

Name:                     ingress-nginx-controller
Namespace:                ingress-nginx
Labels:                   app.kubernetes.io/component=controller
                          app.kubernetes.io/instance=ingress-nginx
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=ingress-nginx
                          app.kubernetes.io/version=0.47.0
                          helm.sh/chart=ingress-nginx-3.34.0
Annotations:              cloud.google.com/neg: {"ingress":true}
                          meta.helm.sh/release-name: ingress-nginx
                          meta.helm.sh/release-namespace: ingress-nginx
Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
Type:                     LoadBalancer
IP Families:              <none>
IP:                       10.56.2.89
IPs:                      10.56.2.89
LoadBalancer Ingress:     xxx.xxx.xxx.xxx
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  31463/TCP
Endpoints:                10.52.3.39:80,10.52.4.31:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  30186/TCP
Endpoints:                10.52.3.39:443,10.52.4.31:443
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     30802
Events:                   <none>

rikatz · 2021-09-30T03:11:08Z

/priority critical-urgent
I will look with other possible "leak" that is happening.

I have received the suggestion to test using boringSSL instead of OpenSSL when building the image (for FIPS compliance, etc) maybe we can try that as well

lvauvillier · 2021-09-30T18:45:29Z

I have the same memory leak issue with latest version:

bash-5.1$ /nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.0.2
  Build:         2b8ed4511af75a7c41e52726b0644d600fc7961b
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.9

-------------------------------------------------------------------------------

rikatz · 2021-10-01T00:58:17Z

Folks,

in case I generate an image of 0.49.3 (to be released) with Openresty OpenSSL patch applied, are you able to test and provide some feedback on that?

strongjz · 2021-10-11T00:45:10Z

/kind bug
/triage accepted

k8s-triage-robot · 2022-01-09T01:32:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-02-08T02:27:23Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-03-10T03:06:01Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-03-10T03:06:19Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rosscdh · 2022-08-31T13:51:06Z

+1 still happening

strongjz · 2022-08-31T14:01:14Z

/reopen
/lifecycle frozen

k8s-ci-robot · 2022-08-31T14:01:36Z

@strongjz: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2023-02-07T23:18:48Z

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged.
Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority {important-soon, important-longterm, backlog}
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot · 2023-02-07T23:18:52Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rikatz · 2023-10-11T21:27:27Z

/close

k8s-ci-robot · 2023-10-11T21:27:32Z

@rikatz: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jozenstar · 2024-11-05T15:32:47Z

@rikatz So, how did this story end?

Lyt99 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 16, 2021

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Sep 16, 2021

k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Sep 16, 2021

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority labels Sep 30, 2021

rikatz mentioned this issue Sep 30, 2021

Worker Segmentation Fault (v0.44.0) #6896

Closed

rikatz mentioned this issue Oct 1, 2021

WIP: Apply openresty openssl patch before compiling NGINX #7732

Closed

4 tasks

LoonyRules mentioned this issue Oct 4, 2021

Sudden high memory usage. #7747

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 8, 2022

k8s-ci-robot closed this as completed Mar 10, 2022

k8s-ci-robot reopened this Aug 31, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 31, 2022

k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Feb 7, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 7, 2023

k8s-ci-robot closed this as completed Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak in OpenSSL #7647

Potential memory leak in OpenSSL #7647

Lyt99 commented Sep 16, 2021 •

edited

Loading

longwuyuan commented Sep 16, 2021

lvauvillier commented Sep 25, 2021

rikatz commented Sep 30, 2021

lvauvillier commented Sep 30, 2021

rikatz commented Oct 1, 2021

strongjz commented Oct 11, 2021

k8s-triage-robot commented Jan 9, 2022

k8s-triage-robot commented Feb 8, 2022

k8s-triage-robot commented Mar 10, 2022

k8s-ci-robot commented Mar 10, 2022

rosscdh commented Aug 31, 2022

strongjz commented Aug 31, 2022

k8s-ci-robot commented Aug 31, 2022

k8s-triage-robot commented Feb 7, 2023

k8s-ci-robot commented Feb 7, 2023

rikatz commented Oct 11, 2023

k8s-ci-robot commented Oct 11, 2023

jozenstar commented Nov 5, 2024

Potential memory leak in OpenSSL #7647

Potential memory leak in OpenSSL #7647

Comments

Lyt99 commented Sep 16, 2021 • edited Loading

longwuyuan commented Sep 16, 2021

lvauvillier commented Sep 25, 2021

rikatz commented Sep 30, 2021

lvauvillier commented Sep 30, 2021

rikatz commented Oct 1, 2021

strongjz commented Oct 11, 2021

k8s-triage-robot commented Jan 9, 2022

k8s-triage-robot commented Feb 8, 2022

k8s-triage-robot commented Mar 10, 2022

k8s-ci-robot commented Mar 10, 2022

rosscdh commented Aug 31, 2022

strongjz commented Aug 31, 2022

k8s-ci-robot commented Aug 31, 2022

k8s-triage-robot commented Feb 7, 2023

k8s-ci-robot commented Feb 7, 2023

rikatz commented Oct 11, 2023

k8s-ci-robot commented Oct 11, 2023

jozenstar commented Nov 5, 2024

Lyt99 commented Sep 16, 2021 •

edited

Loading