Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: openresty child process cpu user space occupies 100% #7550

Closed
chenqiaorui opened this issue Jul 27, 2022 · 18 comments
Closed

bug: openresty child process cpu user space occupies 100% #7550

chenqiaorui opened this issue Jul 27, 2022 · 18 comments
Labels

Comments

@chenqiaorui
Copy link

Current Behavior

openresty其中一个子进程cpu100%,使用ltrace追踪该异常进程,结果如下:
...
lua_gettop(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 1
lua_getexdata(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 0x91a650
luaL_checktype(0x7fda9c61f508, 1, 5, 0x7fda9c0e2c10) = 5
lua_rawgeti(0x7fda9c61f508, 1, 1, 44) = 0x7fda9c0e2b78
lua_touserdata(0x7fda9c61f508, 0xffffffff, 0xfff9ffda9c16b090, 44) = 0x7fda9c16b0c0
lua_pushnil(0x7fda9c61f508, -1, 69, -13) = 0x7fda9c0e2b80
lua_pushlstring(0x7fda9c61f508, 0x54cf9b, 6, -13) = 0xfffdffdaa9302ff0
lua_gettop(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 1
lua_getexdata(0x7fda9c61f508, 0x7fdaa930300e, 0x7fda9c0e2b68, 0x7fda9c0e2c10) = 0x91a650
luaL_checktype(0x7fda9c61f508, 1, 5, 0x7fda9c0e2c10) = 5
lua_rawgeti(0x7fda9c61f508, 1, 1, 44) = 0x7fda9c0e2b78
lua_touserdata(0x7fda9c61f508, 0xffffffff, 0xfff9ffda9c16b090, 44) = 0x7fda9c16b0c0
lua_pushnil(0x7fda9c61f508, -1, 69, -13) = 0x7fda9c0e2b80
lua_pushlstring(0x7fda9c61f508, 0x54cf9b, 6, -13) = 0xfffdffdaa9302ff0
...

Expected Behavior

No response

Error Logs

No response

Steps to Reproduce

1.apisix是源码编译,部署在centos
2.不稳定复现,apisix运行两个月左右就出现一个nginx子进程cpu占用100%。

Environment

  • APISIX version (run apisix version): 2.9
  • Operating system (run uname -a): Linux test-181-102-abjc 3.10.0-514.10.2.el7.x86_64 change: added doc of how to load plugin. #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • OpenResty / Nginx version (run openresty -V or nginx -V): nginx version: openresty/1.19.9.1
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 3.4.0
  • APISIX Dashboard version, if relevant: 2.7
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version): /usr/local/bin/luarocks 3.4.0
@tzssangglass
Copy link
Member

It looks like the rawget is caused dead loop.

Can you provide more information?

@spacewander
Copy link
Member

You can use lbt to get the Lua stack:
https://github.com/spacewander/openresty-gdb-utils/#lbt

@moonming moonming changed the title bug: openresty 子进程cpu用户空间占用100% bug: openresty child process cpu user space occupies 100% Aug 8, 2022
@chenqiaorui
Copy link
Author

hi, spacewander.
After i type lbt, it return "No global L located (tried globalL and ngx_cycle)".

The details:

[root@ricky apisix-debug]# ps aux|grep nginx
root       1748  0.0  0.0  27660  1584 ?        Ss   14:14   0:00 nginx: master process openresty -p /opt/apisix-debug
nobody     1749  0.0  0.0  28108  2336 ?        S    14:14   0:00 nginx: worker process
root       1751  0.0  0.0 112812   980 pts/0    R+   14:14   0:00 grep --color=auto nginx
[root@ricky apisix-debug]# gdb -p 1749
GNU gdb (GDB) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 1749
Reading symbols from /usr/local/openresty/nginx/sbin/nginx...(no debugging symbols found)...done.
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/pcre/lib/libpcre.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/openssl111/lib/libssl.so.1.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/openssl111/lib/libcrypto.so.1.1...(no debugging symbols found)...done.
Reading symbols from /usr/local/openresty/zlib/lib/libz.so.1...(no debugging symbols found)...done.
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Reading symbols from /lib64/libfreebl3.so...Reading symbols from .gnu_debugdata for /lib64/libfreebl3.so...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
0x00007f143fac20c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
(gdb) source luajit21.py
(gdb) lbt
No global L located (tried globalL and ngx_cycle)

What should i do?

added ~/.gdbinit configuration:

directory /opt/apisix-debug/openresty-gdb-utils-lbt_gc64
  
py import sys
py sys.path.append("/opt/apisix-debug/openresty-gdb-utils-lbt_gc64")

source luajit20.gdb
source ngx-lua.gdb
source luajit21.py
source ngx-raw-req.py
set python print-stack full

@chenqiaorui
Copy link
Author

I try to install openresty-debuginfo-1.19.9.1-1.el7 but lbt return nothing:

(gdb) source luajit21.py
(gdb) lbt
(gdb) 
(gdb)

@spacewander
Copy link
Member

What does bt output?

@chenqiaorui
Copy link
Author

(gdb) bt
#0 0x00007fc2124b70c3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x000000000044ce08 in ngx_epoll_process_events (cycle=0x20acee0, timer=100, flags=1) at src/event/modules/ngx_epoll_module.c:800
#2 0x0000000000443c65 in ngx_process_events_and_timers (cycle=0x20acee0) at src/event/ngx_event.c:257
#3 0x000000000044b192 in ngx_worker_process_cycle (cycle=cycle@entry=0x20acee0, data=data@entry=0x1) at src/os/unix/ngx_process_cycle.c:782
#4 0x0000000000449977 in ngx_spawn_process (cycle=0x20acee0, proc=0x44b120 <ngx_worker_process_cycle>, data=0x1, name=0x550323 "worker process", respawn=-3) at src/os/unix/ngx_process.c:199
#5 0x000000000044a6b4 in ngx_start_worker_processes (cycle=cycle@entry=0x20acee0, n=8, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:382
#6 0x000000000044bb4c in ngx_master_process_cycle (cycle=0x20acee0) at src/os/unix/ngx_process_cycle.c:135
#7 0x0000000000423cb0 in main (argc=, argv=) at src/core/nginx.c:386
(gdb)

@spacewander
Copy link
Member

You may need to generate a C level on-CPU framegraph. __epoll_wait_nocancel is a blocking syscall, a process can't have 100% CPU usage in user space when it is blocked on this.

@tzssangglass
Copy link
Member

tzssangglass commented Aug 8, 2022

Here is the CI related to generating framegraph: https://github.com/apache/apisix/blob/master/.github/workflows/performance.yml

@tzssangglass
Copy link
Member

tzssangglass commented Aug 8, 2022

I want to make sure of one question: do you deploy private code code, such as custom plugins?

@tzssangglass
Copy link
Member

And you also need to provide some configuration information. Since you are using APISIX version 2.9, we have fixed some CPU usage 100% bugs after 2.9, so your information will help us to determine if it is a fixed issue.

@chenqiaorui
Copy link
Author

You may need to generate a C level on-CPU framegraph. __epoll_wait_nocancel is a blocking syscall, a process can't have 100% CPU usage in user space when it is blocked on this.

@spacewander the bt message just from a nomal process. 100% CPU usage child process is killed because i need to recover the production enviroment. I don't know what things cause 100% CPU, so i want to find someways to debug the exception when it happens.

@chenqiaorui
Copy link
Author

I want to make sure of one question: do you deploy private code code, such as custom plugins?

@tzssangglass yes, i have a custom plugin. The plugin use redis subscribe and ngx.share. The code can be see below:

local redis_new = require("resty.redis").new
local core = require("apisix.core")
local setmetatable = setmetatable
local tostring = tostring

local _M = {version = 0.1}

local mt = {
    __index = _M
}

function _M.new(plugin_name, conf)
    local self = {conf = conf, plugin_name = plugin_name}
    return setmetatable(self, mt)
end -- redis conf

local function red_conn(conf)
    local red = redis_new() 
    local timeout = conf.redis_timeout or 1000   

    red:set_timeouts(timeout, timeout, timeout) 

    local ok, err = red:connect(conf.redis_host, conf.redis_port or 6379) -- redis conn
    if not ok then
        return false, err
    end

    local count
    count, err = red:get_reused_times()
    if 0 == count then
        if conf.redis_password and conf.redis_password ~= '' then
            local ok, err = red:auth(conf.redis_password) -- redis auth
            if not ok then
                return nil, err
            end
        end
    elseif err then
        return nil, err
    end
    return red, false
end

function _M.token(self, key)
    local auth_token = ngx.shared["auth_token"] 
    if auth_token == ngx.null then
        return false, 'index auth_token error.'
    end

    if auth_token:get(key) then      
		return true, nil -- token vaild, return true
    end

    local conf = self.conf
    local red, err = red_conn(conf) -- redis conn
    if err then
        return false, err
    end
  
    core.log.info("***connect to redis ***")
    local ok, err = red:select(conf.redis_db)
    if not ok then
        return false, err
    end
    local ret, err = red:exists(key) 
    
    if not ret then
        return false, err 
    end
    local ttl, err = red:ttl(key)
    if ttl == -1 then
        ttl = 604800
    end

    ok, err = red:set_keepalive(10000, 100) -- keep redis conn
    if not ok then
        return false, err
    end

    if ret == 0 then
        return false, "rejected" -- token not exists ,return reject
    end

    auth_token:set(key,'',ttl) -- set shared memory

    return true, ret -- token vaild, return true
end

function _M.subscribe(self, func)
    local co = coroutine.create(function()
        local conf = self.conf
        local red, err = red_conn(conf)
        if err then
            core.log.error("failed to conn redis: ", err)
        end
        
        local ok, err = red:subscribe('__keyevent@'..conf.redis_db..'__:del') -- subscribe delete command
        if not ok then
            core.log.error("failed to create subscribe: ", err)
        end
        
        red:subscribe('__keyevent@'..conf.redis_db..'__:expired') -- subscribe expired command
        while true do
            local ret,err = red:read_reply()
            if err then
                core.log.info("failed to read_reply at subscribe: ", err)
            else
                func(ret) -- update shared memeory
            end
            red:set_keepalive(100, 100) -- keep redis conn
        end
    end)
    coroutine.resume(co) 
end

return _M

@chenqiaorui
Copy link
Author

added config.yaml configuration:

...
lua_shared_dicts:             # add custom shared cache to nginx.conf
      auth_token: 10m
...
redis:
  redis_host: "xx"
  redis_port: 6379
  redis_password: "xx"
  redis_db: 1
  redis_timeout: 3000
...

@tzssangglass
Copy link
Member

If you can generate a flamgraph at 100% CPU, the problem would be obvious.

@chenqiaorui
Copy link
Author

I captured the perf.data using perf at 100% CPU, maybe it can provide some helpful message?

perf record -g -p 21515 -- sleep 30   

and then,

git clone https://github.com/brendangregg/FlameGraph
mv perf.data ./FlameGraph
perf script > out.perf
./stackcollapse-perf.pl out.perf > out.folded
./flamegraph.pl out.folded > flame.svg

Here is flamgraph:
flame

@tzssangglass
Copy link
Member

This is a useless flamgraph without any valid stack information. Perhaps you are using perf incorrectly, or you are running APISIX and apisix-base with missing debug information (the official releases of APISIX and apisix-base do have debug information).

@github-actions
Copy link

This issue has been marked as stale due to 350 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 26, 2023
@github-actions
Copy link

github-actions bot commented Aug 9, 2023

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants