You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to fix this issue on the live demo server. The tracker container restarts every 2 hours because of the healthcheck. I'm still trying to figure out what is happening. However, I've noticed a lot of zombie processes. This may or may not be related to the periodic restart.
Some minutes after restarting the server, you see a lot of zombie processes.
This is the server 3 hours after restarting the tracker container:
docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMESef72b037bf26 nginx:mainline-alpine "/docker-entrypoint.…" 3 hours ago Up 3 hours 0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp proxyd7618a22d425 torrust/index-gui:develop "/usr/local/bin/entr…" 3 hours ago Up 3 hours (unhealthy) 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp index-gui3f34f41514bb torrust/index:develop "/usr/local/bin/entr…" 3 hours ago Up 3 hours (healthy) 0.0.0.0:3001->3001/tcp, :::3001->3001/tcp indexe938bf65ea02 torrust/tracker:develop "/usr/local/bin/entr…" 3 hours ago Up 3 hours (unhealthy) 0.0.0.0:1212->1212/tcp, :::1212->1212/tcp, 0.0.0.0:7070->7070/tcp, :::7070->7070/tcp, 1313/tcp, 0.0.0.0:6969->6969/udp, :::6969->6969/udp tracker
As you can see the tracker is unhealthy. Running top gives you this output:
87 zombie processes but I've seen more in other cases. That output is where the server is already too busy swapping. Before reaching that point you get an output like this:
top -U torrusttop - 14:59:08 up 21:33, 1 user, load average: 13.99, 13.41, 11.21Tasks: 184 total, 5 running, 116 sleeping, 0 stopped, 63 zombie%Cpu(s): 14.6 us, 73.5 sy, 0.0 ni, 0.0 id, 4.6 wa, 0.0 hi, 7.0 si, 0.3 stMiB Mem : 957.4 total, 79.9 free, 823.5 used, 54.1 buff/cacheMiB Swap: 0.0 total, 0.0 free, 0.0 used. 30.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 494706 torrust 20 0 815552 538000 0 S 7.6 54.9 20:33.65 torrust-tracker 495006 torrust 20 0 21.0g 30616 0 S 0.3 3.1 0:34.54 node 599470 torrust 20 0 11040 3136 2244 R 0.3 0.3 0:00.07 top 598211 torrust 20 0 17068 2580 856 S 0.0 0.3 0:00.29 systemd 598212 torrust 20 0 169404 4000 0 S 0.0 0.4 0:00.00 (sd-pam) 598290 torrust 20 0 17224 3100 548 S 0.0 0.3 0:01.17 sshd 598291 torrust 20 0 9980 4656 2108 S 0.0 0.5 0:00.83 bash
And it was solved by adding timeouts. That could be the reason for the healthcheck zombies, but I have not idea but the problem is with the node webserver ( 495006 494983 /nodejs/bin/node /app/.output/server/index.mjs). I guess the webserver is launching threads to handle requests but they are not finishing correctly.
The text was updated successfully, but these errors were encountered:
A zombie process, also known as a defunct process, is a state that occurs in a Unix-like operating system when a process finishes execution, but its entry remains in the process table. In simpler terms, it's a process that has completed its execution but still has an entry in the process table because its parent process hasn't yet retrieved its exit status.
When a process finishes its execution, it typically sends an exit status to its parent process, indicating its completion. The parent process is then responsible for reading this exit status via system calls like wait() or waitpid(). Once the parent process retrieves the exit status, the zombie process is removed from the process table, and its resources are released.
However, if the parent process fails to retrieve the exit status of its child processes (perhaps because it's busy with other tasks or has terminated without cleaning up its child processes), the child process enters a zombie state. In this state, the process table entry remains, but the process itself is essentially defunct; it occupies virtually no system resources, except for its entry in the process table.
Zombie processes are usually harmless by themselves and don't consume significant system resources. However, having too many zombie processes can indicate a problem with process management, such as a bug in the parent process or a resource exhaustion issue. Therefore, while individual zombie processes are not a cause for concern, a large number of them may require investigation and remediation.
ChatGPT
I think we should check that healthcheck binaries end correctly in all cases. However, it looks like, in this case, the reason could be the parent process "fails to retrieve the exit status"·
Relates to: #1
I'm trying to fix this issue on the live demo server. The tracker container restarts every 2 hours because of the healthcheck. I'm still trying to figure out what is happening. However, I've noticed a lot of zombie processes. This may or may not be related to the periodic restart.
Some minutes after restarting the server, you see a lot of zombie processes.
This is the server 3 hours after restarting the tracker container:
As you can see the tracker is unhealthy. Running
top
gives you this output:87 zombie processes but I've seen more in other cases. That output is where the server is already too busy swapping. Before reaching that point you get an output like this:
You can see how zombie processes have increased.
I have also listed the zombie processes:
Those processes are a child of the main torrust tracker, index and index-gui processes.
These are the parent processes:
In the past, we had a similar problem:
And it was solved by adding timeouts. That could be the reason for the healthcheck zombies, but I have not idea but the problem is with the node webserver (
495006 494983 /nodejs/bin/node /app/.output/server/index.mjs
). I guess the webserver is launching threads to handle requests but they are not finishing correctly.The text was updated successfully, but these errors were encountered: