Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: APISIX3.5 has high memory usage problems #10349

Closed
Nobilta opened this issue Oct 17, 2023 · 14 comments
Closed

bug: APISIX3.5 has high memory usage problems #10349

Nobilta opened this issue Oct 17, 2023 · 14 comments
Assignees

Comments

@Nobilta
Copy link

Nobilta commented Oct 17, 2023

Current Behavior

Since using APISIX3.5, the memory has been growing continuously, as shown in the figure below.
iShot_2023-10-17_11 48 09
Using the slabtop command, you can see that radixtree occupies a high amount of space.
image
There is no such problem on APISIX2.9 or APISIX3.3.

Expected Behavior

No response

Error Logs

No response

Steps to Reproduce

Freshly install and run APISIX3.5,then use ETCD as data surface.

Environment

  • APISIX version (run apisix version):3.5
  • Operating system (run uname -a):
  • OpenResty / Nginx version (run openresty -V or nginx -V):openresty/1.21.4.1
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@Sn0rt
Copy link
Contributor

Sn0rt commented Oct 17, 2023

thx, you Report. pls assgin to me .

reproduce step

Successfully reproduced

design

user@ubuntu -- http request --> APISIX(3.5) on centos -- http request --> upstream.

install APISIX 3.5.0

folllow cmd run@centos

sudo yum install -y https://repos.apiseven.com/packages/centos/apache-apisix-repo-1.0-1.noarch.rpm
sudo yum-config-manager --add-repo https://repos.apiseven.com/packages/centos/apache-apisix.repo
sudo yum install apisix-base-1.21.4.1.8
sudo yum install apisix-3.5.0

install route

[root@localhost ~]# cat httpbin-anything.json
{
  "methods": [
    "GET"
  ],
  "uri": "/anything",
  "upstream": {
    "type": "roundrobin",
    "nodes": {
      "192.168.31.191:38080": 1
    }
  }
}


[root@localhost ~]# curl http://127.0.0.1:9180/apisix/admin/routes/1 -H  'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d @/root/httpbin-anything.json |jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   419    0   276  100   143   4307   2231 --:--:-- --:--:-- --:--:--  4451
{
  "key": "/apisix/routes/1",
  "value": {
    "methods": [
      "GET"
    ],
    "update_time": 1697698947,
    "uri": "/anything",
    "upstream": {
      "type": "roundrobin",
      "nodes": {
        "192.168.31.191:38080": 1
      },
      "hash_on": "vars",
      "scheme": "http",
      "pass_host": "pass"
    },
    "create_time": 1697685866,
    "priority": 0,
    "status": 1,
    "id": "1"
  }
}

install benchmark tools

install https://github.com/giltene/wrk2 as benchmark tools

send request to /anythinga (not correct route)

./wrk -t2 -c100 -d10000s -R200000 --latency http://192.168.122.145:9080/anythinga

get the slab info from slabtop

11886  11886 100%    0.57K    849	 14	 6792K radix_tree_node
13692  13692 100%    0.57K    978	 14	 7824K radix_tree_node

and C+c

ubuntu@ubuntu-server:~/wrk2$ ./wrk -t2 -c100 -d10000s -R200000 --latency http://192.168.122.145:9080/anythinga
Running 167m test @ http://192.168.122.145:9080/anythinga
....
  166199.295     1.000000      5290366          inf
#[Mean    =    77988.014, StdDeviation   =    43686.962]
#[Max     =   166068.224, Total count    =      5290366]
#[Buckets =           27, SubBuckets     =         2048]
----------------------------------------------------------
  5634121 requests in 3.17m, 1.18GB read
  Non-2xx or 3xx responses: 5634121
Requests/sec:  29613.49
Transfer/sec:      6.35MB

to insepct the slab info . and found the slab not be recycle

13692  13692 100%    0.57K    978	 14	 7824K radix_tree_node

swith to correct route

ubuntu@ubuntu-server:~/wrk2$ ./wrk -t2 -c100 -d10000s -R200000 --latency http://192.168.122.145:9080/anything
Running 167m test @ http://192.168.122.145:9080/anything
  2 threads and 100 connections

inpsect

13720  13720 100%    0.57K    980	 14	 7840K radix_tree_node

stop benchmark with correct url. Found that OBJS has not become less

13748  13748 100%    0.57K    982	 14	 7856K radix_tree_node

The memory did not decrease quickly, so switch to the wrong URL for stress testing.

ubuntu@ubuntu-server:~/wrk2$ ./wrk -t2 -c100 -d10000s -R200000 --latency http://192.168.122.145:9080/anythinga
Running 167m test @ http://192.168.122.145:9080/anythinga
  2 threads and 100 connections
  Thread calibration: mean lat.: 4034.417ms, rate sampling interval: 14827ms
  Thread calibration: mean lat.: 4074.969ms, rate sampling interval: 14852ms

Memory increases very quickly

14574  14574 100%    0.57K   1041	 14	 8328K radix_tree_node

about 10 minutes later

16870  16870 100%    0.57K   1205	 14	 9640K radix_tree_node
16884  16884 100%    0.57K   1206	 14	 9648K radix_tree_node

@Sn0rt
Copy link
Contributor

Sn0rt commented Oct 20, 2023

Stage information:

  1. From the current perspective of slab leaks in Linux, it cannot be concluded that APISIX also has memory leaks.
  2. APISIX's radixtree-based route matching is completely in user mode and has nothing to do with the kernel data structure radix_tree_node.
  3. APISIX is implemented based on NGX. Because ngx has its own memory management, from the perspective of the operating system, the ngx memory has not been released.
  4. User feedback The reason why 3.3 APISIX is ok is that the operating system kernel is different. 3.3 runs on the tlinux kernel.

further processing

  1. Wait for the user to switch to tlinux and retest.

@shreemaan-abhishek
Copy link
Contributor

@Sn0rt thanks for the detailed information.

@shreemaan-abhishek shreemaan-abhishek added the wait for update wait for the author's response in this issue/PR label Oct 23, 2023
@Nobilta
Copy link
Author

Nobilta commented Oct 24, 2023

Stage information:

  1. From the current perspective of slab leaks in Linux, it cannot be concluded that APISIX also has memory leaks.
  2. APISIX's radixtree-based route matching is completely in user mode and has nothing to do with the kernel data structure radix_tree_node.
  3. APISIX is implemented based on NGX. Because ngx has its own memory management, from the perspective of the operating system, the ngx memory has not been released.
  4. User feedback The reason why 3.3 APISIX is ok is that the operating system kernel is different. 3.3 runs on the tlinux kernel.

further processing

  1. Wait for the user to switch to tlinux and retest.

After upgrading the kernel to 5.4.119-19.0009.32, the phenomenon is the same as before. The openresty worker memory keeps growing after receiving traffic. In the same environment as APISIX 3.3, using the same plug-in configuration (different data plane), version 3.5 has a memory leak problem, but version 3.3 works normally.
357c914aafdb1ed9d33a33ebf0243e6
2476bcb94d536ab1c0338042e28f79c

@github-actions github-actions bot added user responded and removed wait for update wait for the author's response in this issue/PR labels Oct 24, 2023
@Nobilta
Copy link
Author

Nobilta commented Oct 24, 2023

Stage information:

  1. From the current perspective of slab leaks in Linux, it cannot be concluded that APISIX also has memory leaks.
  2. APISIX's radixtree-based route matching is completely in user mode and has nothing to do with the kernel data structure radix_tree_node.
  3. APISIX is implemented based on NGX. Because ngx has its own memory management, from the perspective of the operating system, the ngx memory has not been released.
  4. User feedback The reason why 3.3 APISIX is ok is that the operating system kernel is different. 3.3 runs on the tlinux kernel.

further processing

  1. Wait for the user to switch to tlinux and retest.

After upgrading the kernel to 5.4.119-19.0009.32, the phenomenon is the same as before. The openresty worker memory keeps growing after receiving traffic. In the same environment as APISIX 3.3, using the same plug-in configuration (different data plane), version 3.5 has a memory leak problem, but version 3.3 works normally. 357c914aafdb1ed9d33a33ebf0243e6 2476bcb94d536ab1c0338042e28f79c

According to the memory data of the last dump, the largest address space occupied after string processing is the upstream node information related to the use of the traffic-split plugin.The upstream node information that appears in the dump file is all part 1 in the figure, and the node 2 almost does not exist in the dump file. so we suspect whether it is related to the modification of the traffic-split plugin in version 3.5.
image

@Sn0rt
Copy link
Contributor

Sn0rt commented Oct 31, 2023

Combining your information, I very much suspect that the problem was introduced in 3.4; combined with the changelog, I guess it is caused by the modification of etcd. Can you help me test it? (I tried to reproduce it on centos but failed. )

Use the 3.3 etcd file to replace the 3.5 one. Take a look at wget https://raw.githubusercontent.com/apache/apisix/release/3.3/apisix/core/config_etcd.lua

@TakiJoe
Copy link

TakiJoe commented Oct 31, 2023

Combining your information, I very much suspect that the problem was introduced in 3.4; combined with the changelog, I guess it is caused by the modification of etcd. Can you help me test it? (I tried to reproduce it on centos but failed. )

Use the 3.3 etcd file to replace the 3.5 one. Take a look at wget https://raw.githubusercontent.com/apache/apisix/release/3.3/apisix/core/config_etcd.lua

OK, I'll try to reproduce & test it

@deluxor
Copy link

deluxor commented Nov 3, 2023

Screenshot 2023-11-03 at 17 34 09 We also spotted the same issue with the recent update of version 3.5

We only change Apisix version, same kernel version, same OS version

Kept rising until OOM

@monkeyDluffy6017
Copy link
Contributor

@deluxor we are tracking this problem, do you have any more clues?

@TakiJoe
Copy link

TakiJoe commented Nov 20, 2023

Here is the captured flame graphs and memory usage.
iShot_2023-11-20_11 32 43

iShot_2023-11-20_11 13 29

iShot_2023-11-20_11 12 46

iShot_2023-11-20_11 15 38

apisix-leaks-202311201050

apisix-leaks-202311200930

@monkeyDluffy6017
Copy link
Contributor

monkeyDluffy6017 commented Dec 8, 2023

We have located this problem: #10614 and #10671, it will affect all versions between 3.4.0 and 3.7.0

@monkeyDluffy6017
Copy link
Contributor

monkeyDluffy6017 commented Dec 12, 2023

I will close this issue since it has been fixed

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Apache APISIX backlog Dec 12, 2023
@TakiJoe
Copy link

TakiJoe commented Dec 27, 2023

I did a grayscale test in the production environment and it looks fine so far.(apisix3.5 patched with this two fix #10614
#10671 )

@zsmlinux
Copy link

Here is the captured flame graphs and memory usage. iShot_2023-11-20_11 32 43

iShot_2023-11-20_11 13 29

iShot_2023-11-20_11 12 46

iShot_2023-11-20_11 15 38

apisix-leaks-202311201050

apisix-leaks-202311200930

Excute me, could you tell me the steps for how to generate memory flame graph?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

7 participants