-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: apisix 内存管理-过期监控key不释放问题 #9627
Comments
The metrics will be flushed (eventually) to shared memory, and the shared memory is sized lru cache (i.e. eviction when the cache is full), which is not counted into the nginx memory, so no worry about the OOM. Which version of APISIX do you use? |
version:3.3.0 |
apisix: # universal configurations
node_listen:
- port: 9080 # APISIX listening port
enable_http2: false
- port: 9081
enable_http2: true
enable_heartbeat: true
enable_admin: true
enable_admin_cors: true
enable_debug: false
enable_dev_mode: false # Sets nginx worker_processes to 1 if set to true
enable_reuseport: true # Enable nginx SO_REUSEPORT switch if set to true.
enable_ipv6: true # Enable nginx IPv6 resolver
enable_server_tokens: false # Whether the APISIX version number should be shown in Server header
# proxy_protocol: # Proxy Protocol configuration
# listen_http_port: 9181 # The port with proxy protocol for http, it differs from node_listen and admin_listen.
# # This port can only receive http request with proxy protocol, but node_listen & admin_listen
# # can only receive http request. If you enable proxy protocol, you must use this port to
# # receive http request with proxy protocol
# listen_https_port: 9182 # The port with proxy protocol for https
# enable_tcp_pp: true # Enable the proxy protocol for tcp proxy, it works for stream_proxy.tcp option
# enable_tcp_pp_to_upstream: true # Enables the proxy protocol to the upstream server
proxy_cache: # Proxy Caching configuration
cache_ttl: 10s # The default caching time if the upstream does not specify the cache time
zones: # The parameters of a cache
- name: disk_cache_one # The name of the cache, administrator can be specify
# which cache to use by name in the admin api
memory_size: 50m # The size of shared memory, it's used to store the cache index
disk_size: 1G # The size of disk, it's used to store the cache data
disk_path: "/tmp/disk_cache_one" # The path to store the cache data
cache_levels: "1:2" # The hierarchy levels of a cache
# - name: disk_cache_two
# memory_size: 50m
# disk_size: 1G
# disk_path: "/tmp/disk_cache_two"
# cache_levels: "1:2"
router:
http: radixtree_uri # radixtree_uri: match route by uri(base on radixtree)
# radixtree_host_uri: match route by host + uri(base on radixtree)
# radixtree_uri_with_parameter: match route by uri with parameters
ssl: 'radixtree_sni' # radixtree_sni: match route by SNI(base on radixtree)
stream_proxy: # TCP/UDP proxy
only: false
tcp: # TCP proxy port list
- 8001
# dns_resolver:
#
# - 127.0.0.1
#
# - 172.20.0.10
#
# - 114.114.114.114
#
# - 223.5.5.5
#
# - 1.1.1.1
#
# - 8.8.8.8
#
dns_resolver_valid: 30
resolver_timeout: 5
ssl:
enable: true
listen:
- port: 9443
enable_http2: true
ssl_protocols: "TLSv1.2 TLSv1.3"
ssl_ciphers: "xxxxx"
ssl_trusted_certificate: "/etcd-ssl/ca.pem"
nginx_config: # config for render the template to genarate nginx.conf
http_server_configuration_snippet: |
proxy_ignore_client_abort on;
error_log: "/dev/stderr"
error_log_level: "error" # warn,error
worker_processes: "8"
enable_cpu_affinity: true
worker_rlimit_nofile: 102400 # the number of files a worker process can open, should be larger than worker_connections
event:
worker_connections: 65535
http:
enable_access_log: true
access_log: "/dev/stdout"
access_log_format: '{\"timestamp\":\"$time_iso8601\",\"server_addr\":\"$server_addr\",\"remote_addr\":\"$remote_addr\",\"remote_port\":\"$realip_remote_port\",\"all_cookie\":\"$http_cookie\",\"http_host\":\"$http_host\",\"query_string\":\"$query_string\",\"request_method\":\"$request_method\",\"uri\":\"$uri\",\"service\":\"apisix_backend\",\"request_uri\":\"$request_uri\",\"status\":\"$status\",\"body_bytes_sent\":\"$body_bytes_sent\",\"request_time\":\"$request_time\",\"upstream_response_time\":\"$upstream_response_time\",\"upstream_addr\":\"$upstream_addr\",\"upstream_status\":\"$upstream_status\",\"http_referer\":\"$http_referer\",\"http_user_agent\":\"$http_user_agent\",\"http_x_forwarded_for\":\"$http_x_forwarded_for\",\"spanId\":\"$http_X_B3_SpanId\",\"http_token\":\"$http_token\",\"http_authorizationv2\":\"$http_authorizationv2\",\"content-type\":\"$content_type\",\"content-length\":\"$content_length\",\"traceId\":\"$http_X_B3_TraceId\"}'
access_log_format_escape: json
lua_shared_dict:
prometheus-metrics: 800m
discovery: 300m
kubernetes: 200m
keepalive_timeout: 60s # timeout during which a keep-alive client connection will stay open on the server side.
client_header_timeout: 60s # timeout for reading client request header, then 408 (Request Time-out) error is returned to the client
client_body_timeout: 60s # timeout for reading client request body, then 408 (Request Time-out) error is returned to the client
send_timeout: 10s # timeout for transmitting a response to the client.then the connection is closed
underscores_in_headers: "on" # default enables the use of underscores in client request header fields
real_ip_header: "X-Forwarded-For" # http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_header
real_ip_recursive: on # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
#real_ip_from: # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
# - 127.0.0.1
# - 'unix:'
real_ip_from:
- 127.0.0.1/24
- 'unix:'
- 10.28.0.0/14
- 10.32.0.0/17
discovery:
kubernetes:
client:
token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
service:
host: ${KUBERNETES_SERVICE_HOST}
port: ${KUBERNETES_SERVICE_PORT}
schema: https
plugins: # plugin list
- api-breaker
- authz-keycloak
- basic-auth
- batch-requests
- consumer-restriction
- cors
- client-control
- echo
- fault-injection
- file-logger
- grpc-transcode
- grpc-web
- hmac-auth
- http-logger
- ip-restriction
- ua-restriction
- jwt-auth
- kafka-logger
- key-auth
- limit-conn
- limit-count
- limit-req
- node-status
- openid-connect
- authz-casbin
- prometheus
- proxy-cache
- proxy-mirror
- proxy-rewrite
- redirect
- referer-restriction
- request-id
- request-validation
- response-rewrite
- serverless-post-function
- serverless-pre-function
- sls-logger
- syslog
- tcp-logger
- udp-logger
- uri-blocker
- wolf-rbac
- zipkin
- traffic-split
- gzip
- real-ip
- ext-plugin-pre-req
- ext-plugin-post-req
stream_plugins:
- mqtt-proxy
- ip-restriction
- limit-conn
plugin_attr:
prometheus:
enable_export_server: true
export_addr:
ip: 0.0.0.0
port: 9091
export_uri: /apisix/prometheus/metrics
metric_prefix: apisix_
deployment:
role: traditional
role_traditional:
config_provider: etcd
admin:
allow_admin: # http://nginx.org/en/docs/http/ngx_http_access_module.html#allow
- 127.0.0.1/24
- 172.16.174.0/24
# - "::/64"
admin_listen:
ip: 0.0.0.0
port: 9180
# Default token when use API to call for Admin API.
# *NOTE*: Highly recommended to modify this value to protect APISIX's Admin API.
# Disabling this configuration item means that the Admin API does not
# require any authentication.
admin_key:
# admin: can everything for configuration data
- name: "admin"
key: xxxxx
role: admin
# viewer: only can view configuration data
- name: "viewer"
key: xxxxx
role: viewer
https_admin: false
admin_api_mtls:
admin_ssl_ca_cert: "/etcd-ssl/ca.pem"
admin_ssl_cert: "/etcd-ssl/etcd.pem"
admin_ssl_cert_key: "/etcd-ssl/etcd-key.pem"
etcd:
host: # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
- "https://xx.xx:2379" # multiple etcd address
prefix: "/apisix" # configuration prefix in etcd
timeout: 30 # 30 seconds
tls:
ssl_trusted_certificate: "/etcd-ssl/ca.pem"
cert: "/etcd-ssl/etcd.pem"
key: "/etcd-ssl/etcd-key.pem"
verify: true
sni: "xxx.com" |
It is obvious that there are problems with the mechanism of this Prometheus exporter, which can be seen from four aspects:
In fact nginx-lua-prometheus provides counter:del() and gauge:del() methods to delete Labels. The APISIX Prometheus plugin may need to delete Prometheus Metric data at certain times. Currently, our approach is similar; however we are more aggressive by only retaining type-level and route-level data while removing everything else. before: metrics.latency = prometheus:histogram("http_latency",
"HTTP request latency in milliseconds per service in APISIX",
{"type", "route", "service", "consumer", "node", unpack(extra_labels("http_latency"))},
buckets) after: metrics.latency = prometheus:histogram("http_latency",
"HTTP request latency in milliseconds per service in APISIX",
{"type", "route", unpack(extra_labels("http_latency"))},
buckets) |
@hansedong well said 👍 We are trying to find a general proposal, for example: set the TTL of these prom metrics data in LRU to 10 minutes (of course it can be adjusted, here is just an example), and then this memory issue can be solved. What do you think? |
This is a good idea. As I understand it, the TTL mechanism can preserve data for specific metrics (which are updated regularly) and also allow for the deletion of expired metrics. |
hi. this case has been reproduced by test case. pls take a look. |
I think this ttl solution is a bit troublesome, because the upstream Prometheus library does not set the ttl parameter location when setting the vaule. If you want the ttl solution, you need to change the upstream library. The more important point is that latency is a histogram data type, and ttl cannot be used to automatically reclaim resources. |
I have inspect I found kong looks has same problem
|
|
Thank you for your continued attention. After discussion with @membphis, I found that my previous understanding of metrics was wrong. We use the TTL scheme to recycle metrics that have not been reset for a long time, which has no impact on grafana's display. |
Do we have plan for TTL ? |
APISIX uses the knyar/nginx-lua-prometheus library to set the metric. The ttl solution would be better if it is supported by the underlying library. Currently in discussion with the maintainer of knyar/nginx-lua-prometheus knyar/nginx-lua-prometheus#164, in any case this issue is already being advanced. |
@hansedong The TTL feature is merged, would you like to do some testing? |
Yes I'd love to, I plan to upgrade one APISIX gateway of the microservice scenario to test the effect of the new feature |
@moonming Hello, this problem has been fixed in version 3.9.0. When will it be updated in version 3.2.2? |
No, we will keep new features and bug fixes in the master branch. |
If it is only fixed in the new version, as a long-term support version, how should we solve this kind of problem that affects production stability? Should we consider updating? |
The 3.9 version of apisix still has the issue of the http_status upstream_status key not being updated. When can this be resolved? |
Current Behavior
如果打开了prometheus插件,并且upstream使用了k8s服务发现或者upstream ip随着发布而改变的话,在apisix中就会产生过多的监控key,从而导致内存不断增长,如果不重启apisix最终OOM
![Uploading image.png…]()
Expected Behavior
我期望在upstream ip改变的时候有一个自动检测机制,把内存里的监控指标中不存在的node节点的key进行释放
Error Logs
No response
Steps to Reproduce
我通过将指标中的node纬度关闭从而规避了因发布而导致的upstream ip 改变产生过多key的问题

Environment
apisix version
):uname -a
):openresty -V
ornginx -V
):curl http://127.0.0.1:9090/v1/server_info
):luarocks --version
):The text was updated successfully, but these errors were encountered: