Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak in OpenSSL #7647

Closed
Lyt99 opened this issue Sep 16, 2021 · 18 comments
Closed

Potential memory leak in OpenSSL #7647

Lyt99 opened this issue Sep 16, 2021 · 18 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@Lyt99
Copy link
Contributor

Lyt99 commented Sep 16, 2021

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 0.44.0 & 0.49.0

Kubernetes version (use kubectl version): 1.18.8

Environment:

  • Cloud provider or hardware configuration: AlibabaCloud
  • OS (e.g. from /etc/os-release): Alibaba Cloud Linux (Aliyun Linux)
  • Kernel (e.g. uname -a): 4.19.91-23.al7.x86_64
  • Install tools:
    • From AlibabaCloud console

What happened:

We've encountered some memory issue both in 0.44.0 and 0.49.0
Some of the ingress pods get a high memory usage, but others are ina normal level

image

We did sone diagnose to the pod, and it shows that one of the nginx worker gained a large amount of memory.

image

the income traffic is balance, about 100 requests per second, and the connection count between pods is of the same order of magnitude (from 10k+ to 100k+).

And then, we use pmap -x <pid> to get details of the memory. There were lots of tiny anon blocks in the memory map.

image

Made a coredump and took a look at this memory area, most of its content seems to be related to TLS certs. And also we tried to run memleak on the process, and result here:

[16:18:49] Top 10 stacks with outstanding allocations:
	300580 bytes in 15029 allocations from stack
		CRYPTO_strdup+0x30 [libcrypto.so.1.1]
		[unknown]
	462706 bytes in 375 allocations from stack
		[unknown] [libcrypto.so.1.1]
	507864 bytes in 9069 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
		[unknown]
	536576 bytes in 131 allocations from stack
		[unknown] [libcrypto.so.1.1]
	848638 bytes in 333 allocations from stack
		ngx_alloc+0xf [nginx]
		[unknown]
	2100720 bytes in 22253 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
	3074792 bytes in 888 allocations from stack
		BUF_MEM_grow+0x81 [libcrypto.so.1.1]
	3496960 bytes in 4398 allocations from stack
		posix_memalign+0x1a [ld-musl-x86_64.so.1]
	5821440 bytes in 9096 allocations from stack
		[unknown] [libssl.so.1.1]
		[unknown]
	9060080 bytes in 22605 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
[16:18:58] Top 10 stacks with outstanding allocations:
	287280 bytes in 14364 allocations from stack
		CRYPTO_strdup+0x30 [libcrypto.so.1.1]
		[unknown]
	393216 bytes in 96 allocations from stack
		[unknown] [libcrypto.so.1.1]
	396428 bytes in 322 allocations from stack
		[unknown] [libcrypto.so.1.1]
	486080 bytes in 8680 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
		[unknown]
	724916 bytes in 286 allocations from stack
		ngx_alloc+0xf [nginx]
		[unknown]
	1949832 bytes in 20300 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
	2032380 bytes in 727 allocations from stack
		BUF_MEM_grow+0x81 [libcrypto.so.1.1]
	3760256 bytes in 5049 allocations from stack
		posix_memalign+0x1a [ld-musl-x86_64.so.1]
	5575680 bytes in 8712 allocations from stack
		[unknown] [libssl.so.1.1]
		[unknown]
	8525968 bytes in 20572 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
[16:19:06] Top 10 stacks with outstanding allocations:
	716420 bytes in 35821 allocations from stack
		CRYPTO_strdup+0x30 [libcrypto.so.1.1]
		[unknown]
	782336 bytes in 191 allocations from stack
		[unknown] [libcrypto.so.1.1]
	885218 bytes in 721 allocations from stack
		[unknown] [libcrypto.so.1.1]
	1233680 bytes in 22030 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
		[unknown]
	1761982 bytes in 775 allocations from stack
		ngx_alloc+0xf [nginx]
		[unknown]
	3814396 bytes in 1525 allocations from stack
		BUF_MEM_grow+0x81 [libcrypto.so.1.1]
	4298576 bytes in 48880 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]
	11922816 bytes in 15455 allocations from stack
		posix_memalign+0x1a [ld-musl-x86_64.so.1]
	14005760 bytes in 21884 allocations from stack
		[unknown] [libssl.so.1.1]
		[unknown]
	21036912 bytes in 49333 allocations from stack
		CRYPTO_zalloc+0xa [libcrypto.so.1.1]

here are more samples m.log

Finally we moved the cert to the load balancer provided by cloud, and it's working fine now, but still have no clue about why could this happen.

The leak is happened on nginx and connection with TLS. We tried to rebuild the image to upgrade libraries to the newest version (for openssl, 1.1.1l-r0), but it doesn't work.

What you expected to happen:

no memory leak with TLS

How to reproduce it:

I have no idea what makes the issue happen, and I can't reproduce it on another cluster.

Anything else we need to know:

As far, we haven't met this issue with 0.30.0 (openssl 1.1.1d-r3), I don't know whether it's a problem in newer openssl.

/kind bug

@Lyt99 Lyt99 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 16, 2021
@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Sep 16, 2021
@longwuyuan
Copy link
Contributor

/remove-kind bug
Hi, let us wait until we get some helpful information that hints at a bug.
Also, please provide the information asked in the issue template.

We have been making changes for performance and very soon we will be releasing a build that has changed components of the controller. But if you test the current latest release and update as per issue template, it will help get a better perspective.

/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Sep 16, 2021
@lvauvillier
Copy link

Hi, I have the same issue:

Capture d’écran 2021-09-25 à 19 50 09

nginx -s reload temporary solves the issue.

Here is my infos:

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):


NGINX Ingress controller
Release: v0.47.0
Build: 7201e37
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.20.1


Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.9-gke.1001", GitCommit:"1fe18c314ed577f6047d2712a9d1c8e498e22381", GitTreeState:"clean", BuildDate:"2021-08-23T23:06:28Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration:
    GCP
  • Kernel (e.g. uname -a):
    Linux ingress-nginx-controller-788c5f7f88-d94pj 5.4.120+ Basic structure  #1 SMP Tue Jun 22 14:53:20 PDT 2021 x86_64 Linux

Helm:
helm -n ingress-nginx get values ingress-nginx
USER-SUPPLIED VALUES:

controller:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - nginx-ingress
          topologyKey: kubernetes.io/hostname
        weight: 100
  config:
    use-gzip: true
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack
      enabled: true
      namespace: monitoring
  replicaCount: 2
  resources:
    requests:
      memory: 800Mi
  service:
    externalTrafficPolicy: Local

kubectl describe po -n ingress-nginx ingress-nginx-controller-788c5f7f88-d94pj

Name:         ingress-nginx-controller-788c5f7f88-d94pj
Namespace:    ingress-nginx
Priority:     0
Node:         gke-production-pool-1-66bb3111-sldn/10.132.0.4
Start Time:   Sat, 18 Sep 2021 17:17:13 +0200
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress-nginx
              app.kubernetes.io/name=ingress-nginx
              pod-template-hash=788c5f7f88
Annotations:  kubectl.kubernetes.io/restartedAt: 2021-09-18T17:17:13+02:00
Status:       Running
IP:           10.52.3.39
IPs:
  IP:           10.52.3.39
Controlled By:  ReplicaSet/ingress-nginx-controller-788c5f7f88
Containers:
  controller:
    Container ID:  containerd://74fb58bce33d84fb54fb61a3a16772d6edf8858cc14a05c21d0feb79a90e8157
    Image:         k8s.gcr.io/ingress-nginx/controller:v0.47.0@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b
    Image ID:      k8s.gcr.io/ingress-nginx/controller@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b
    Ports:         80/TCP, 443/TCP, 10254/TCP, 8443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller
      --election-id=ingress-controller-leader
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
    State:          Running
      Started:      Sat, 18 Sep 2021 17:17:14 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   800Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       ingress-nginx-controller-788c5f7f88-d94pj (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from ingress-nginx-token-cn2nx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-admission
    Optional:    false
  ingress-nginx-token-cn2nx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-token-cn2nx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

kubectl describe svc -n ingress-nginx ingress-nginx-controller

Name:                     ingress-nginx-controller
Namespace:                ingress-nginx
Labels:                   app.kubernetes.io/component=controller
                          app.kubernetes.io/instance=ingress-nginx
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=ingress-nginx
                          app.kubernetes.io/version=0.47.0
                          helm.sh/chart=ingress-nginx-3.34.0
Annotations:              cloud.google.com/neg: {"ingress":true}
                          meta.helm.sh/release-name: ingress-nginx
                          meta.helm.sh/release-namespace: ingress-nginx
Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
Type:                     LoadBalancer
IP Families:              <none>
IP:                       10.56.2.89
IPs:                      10.56.2.89
LoadBalancer Ingress:     xxx.xxx.xxx.xxx
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  31463/TCP
Endpoints:                10.52.3.39:80,10.52.4.31:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  30186/TCP
Endpoints:                10.52.3.39:443,10.52.4.31:443
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     30802
Events:                   <none>

@rikatz
Copy link
Contributor

rikatz commented Sep 30, 2021

/priority critical-urgent
I will look with other possible "leak" that is happening.

I have received the suggestion to test using boringSSL instead of OpenSSL when building the image (for FIPS compliance, etc) maybe we can try that as well

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority labels Sep 30, 2021
@lvauvillier
Copy link

I have the same memory leak issue with latest version:

bash-5.1$ /nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.0.2
  Build:         2b8ed4511af75a7c41e52726b0644d600fc7961b
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.9

-------------------------------------------------------------------------------

Capture d’écran 2021-09-30 à 20 42 18

Capture d’écran 2021-09-30 à 20 44 31

@rikatz
Copy link
Contributor

rikatz commented Oct 1, 2021

Folks,

in case I generate an image of 0.49.3 (to be released) with Openresty OpenSSL patch applied, are you able to test and provide some feedback on that?

@strongjz
Copy link
Member

/kind bug
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 11, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 8, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rosscdh
Copy link

rosscdh commented Aug 31, 2022

+1 still happening

@strongjz
Copy link
Member

/reopen
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot reopened this Aug 31, 2022
@k8s-ci-robot
Copy link
Contributor

@strongjz: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 31, 2022
@k8s-triage-robot
Copy link

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged.
Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority {important-soon, important-longterm, backlog}
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Feb 7, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 7, 2023
@rikatz
Copy link
Contributor

rikatz commented Oct 11, 2023

/close

@k8s-ci-robot
Copy link
Contributor

@rikatz: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jozenstar
Copy link

@rikatz So, how did this story end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants