Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help request: 3.2.1 memory leak #10618

Closed
wklken opened this issue Dec 8, 2023 · 24 comments
Closed

help request: 3.2.1 memory leak #10618

wklken opened this issue Dec 8, 2023 · 24 comments
Assignees
Labels

Comments

@wklken
Copy link

wklken commented Dec 8, 2023

Description

after deployed online for 2 weeks, we reschedule the pods, then got the chart below. from 3.7G to 6G.

We have no ext-plugins.

about 45000 routes

image


I have some suspicion that it is caused by the prometheus plugin, when all routes are all presented, the keys in prometheus is stable?

is there any tool to analysis this, while we don't have xray.

Environment

  • APISIX version (run apisix version): 3.2.1
  • Operating system (run uname -a):
  • OpenResty / Nginx version (run openresty -V or nginx -V): openresty/1.21.4.1
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@wklken
Copy link
Author

wklken commented Dec 8, 2023

the /metrics response data after online for about two week

  • size: 7.5M
  • lines: 49323
  • Data distribution
16098 api_request_duration_milliseconds_bucket
3674 api_request_duration_milliseconds_count
3674 api_request_duration_milliseconds_sum
4990 api_requests_total
7948 bapp_requests_total
12846 bandwidth
11 etcd_modify_indexes
1 etcd_reachable 1
1 http_requests_total 460517663
6 nginx_http_current_connections
1 nginx_metric_errors_total 0
1 node_info
24 shared_dict_capacity_bytes
24 shared_dict_free_space_bytes

@boekkooi-lengoo
Copy link
Contributor

boekkooi-lengoo commented Dec 8, 2023

Hey @wklken

I have noticed a similar issue and was able to remove it by forcing the consumer to always be empty.

I use the following in my dockerfile to patch the issue.

# Patch https://github.com/apache/apisix/blob/3.7.0/apisix/plugins/prometheus/exporter.lua#L228 to avoid metrics per consumer.
RUN sed -i \
    -e 's/ctx.consumer_name or ""/""/g' \
    /usr/local/apisix/apisix/plugins/prometheus/exporter.lua

Hope this helps.

@wklken
Copy link
Author

wklken commented Dec 8, 2023

Thanks @boekkooi-lengoo

we have patched some settings to disable official metrics, which will cause the cpu 100% if too much records present.

https://github.com/TencentBlueKing/blueking-apigateway-apisix/blob/master/src/build/patches/001_change_prometheus_default_buckets.patch

currently only the bandwidth left.


I’m not certain whether the increasing memory usage is caused by the Prometheus plugin or not, nor do I understand why it is consuming so much memory.

@moonming moonming moved this to 🏗 In progress in Apache APISIX backlog Dec 10, 2023
@monkeyDluffy6017
Copy link
Contributor

@wklken have you solved your problem?

@wklken
Copy link
Author

wklken commented Dec 28, 2023

image

Not yet; we are waiting for the line (memory usage) to stabilize (about 4 weeks). If it does not show an increase, then perhaps the Prometheus plugin is the cause. Otherwise, we will need to investigate other plugins.

Any advices or tools for detecting the memory usage of each part of apisix?

@monkeyDluffy6017

@monkeyDluffy6017
Copy link
Contributor

please check if the memory leak happens in lua or in c

curl http://127.0.0.1:9180/apisix/admin/routes/test  \\n-H 'X-API-KEY: edd1c9f034335f136f87ad84b625c8f1' -X PUT -d '\n{\n    "uri": "/lua_memory_stat",\n    "plugins": {\n        "serverless-pre-function": {\n            "phase": "rewrite",\n            "functions" : ["return function() local mem = collectgarbage(\"count\") ngx.say(\"the memory allocated by lua is \", mem, \" kb\"); end"]\n        }\n    }\n}'

@wklken
Copy link
Author

wklken commented Dec 28, 2023

@monkeyDluffy6017

  • the memory usage of pod is 4.12G
  • the /lua_memory_stat from the apisix of the pod: the memory allocated by lua is 291922.04101562 kb

@monkeyDluffy6017 monkeyDluffy6017 self-assigned this Dec 29, 2023
@Vacant2333
Copy link
Contributor

u can assign the issue to me , i will follow it

@Vacant2333
Copy link
Contributor

Vacant2333 commented Jan 2, 2024

Do you think it's possible that this is the issue? would you try this method to resoulve it? @wklken

@theweakgod
Copy link
Contributor

theweakgod commented Jan 15, 2024

@wklken I think this will help you.
#9545
nginx-lua-prometheus memory leak fix
You can solve this problem by upgrading the version.

@wklken
Copy link
Author

wklken commented Jan 15, 2024

Do you think it's possible that this is the issue? would you try this method to resoulve it? @wklken

@Vacant2333 we don't use the service discovery on production.

@wklken
Copy link
Author

wklken commented Jan 15, 2024

@wklken I think this will help you. https://github.com/apache/apisix/pull/9545 [https://github.com/knyar/nginx-lua-prometheus/pull/151](nginx-lua-prometheus memory leak fix) You can solve this problem by upgrading the version.

thanks @theweakgod , I will check that.(apisix 3.2.1 use the nginx-lua-prometheus = 0.20220527)

@Vacant2333
Copy link
Contributor

@wklken I think this will help you. https://github.com/apache/apisix/pull/9545 [https://github.com/knyar/nginx-lua-prometheus/pull/151](nginx-lua-prometheus memory leak fix) You can solve this problem by upgrading the version.

thanks @theweakgod , I will check that.(apisix 3.2.1 use the nginx-lua-prometheus = 0.20220527)

It does seem to be one of the reasons. Is it possible to test this possibility (upgrade nginx-lua-prometheus) and see if memory continues to grow? how londgwill this take?

@theweakgod
Copy link
Contributor

@wklken我想这会对你有帮助。https://github.com/apache/apisix/pull/9545 https://github.com/knyar/nginx-lua-prometheus/pull/151你可以解决这个问题升级版本出现问题。

谢谢@theweakgod,我会检查一下。(apisix 3.2.1 使用nginx-lua-prometheus = 0.20220527

Has the problem been solved?

@wenj91
Copy link

wenj91 commented Jan 22, 2024

@wklken 这个问题有什么进展不?

@wklken
Copy link
Author

wklken commented Jan 22, 2024

@wklken 这个问题有什么进展不?

不能在生产上验, 暂时没有同样的环境可以验证, 需要想办法复现生产一样的流量压一段时间(近期都不一定有时间处理; 验证后我会更新到这个issue)

It cannot be tested in production. We do not have the same environment to verify for the time being. We need to find a way to reproduce the same traffic pressure in production for a period of time (I may not have time to deal with it in the near future; I will update to this issue after verification).

@theweakgod
Copy link
Contributor

@wklken Has the problem been solved?

@wenj91
Copy link

wenj91 commented Feb 23, 2024

@wklken 这个问题有什么进展不?

不能在生产上验, 暂时没有同样的环境可以验证, 需要想办法复现生产一样的流量压一段时间(近期都不一定有时间处理; 验证后我会更新到这个issue)

It cannot be tested in production. We do not have the same environment to verify for the time being. We need to find a way to reproduce the same traffic pressure in production for a period of time (I may not have time to deal with it in the near future; I will update to this issue after verification).

提供个线索,在上传图片跟上传文件接口特别容易出现这种现象

@wklken
Copy link
Author

wklken commented Feb 23, 2024

image

We rolling update another release, and the memory didn't increase after about 1 week.


@theweakgod I still can't reproduce the memory increasing on my own cluster yet, will try again later.

@theweakgod
Copy link
Contributor

@theweakgod I still can't reproduce the memory increasing on my own cluster yet, will try again later.

👌

@theweakgod
Copy link
Contributor

theweakgod commented Feb 23, 2024

@theweakgod I still can't reproduce the memory increasing.

need huge metrics

@wklken
Copy link
Author

wklken commented Feb 23, 2024

but from the

image We rolling update another release, and the memory didn't increase after about 1 week.

@theweakgod I still can't reproduce the memory increasing on my own cluster yet, will try again later.

Here's a revised version of the text with corrected grammar and improved clarity:

From the provided chart:

  • During the first 7 days, there is an increase in memory usage, with all Prometheus label value combinations present.
  • Over the following 20 days, there are no new label value combinations, and the memory usage remains stable.

If the pull request bugfix: limit lookup table size is effective, the memory usage should not exceed 5.59 GB and should only show an increase for no more than 7 days.


Copy link

github-actions bot commented Feb 7, 2025

This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 7, 2025
Copy link

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 21, 2025
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Apache APISIX backlog Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

6 participants