Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synapse not starting, runaway memory/cpu usage when using incorrect logging configuration #1888

Closed
rrix opened this issue Feb 6, 2017 · 4 comments

Comments

@rrix
Copy link

rrix commented Feb 6, 2017

As of this morning, I'm unable to start my synapse homeserver.

Until today, I was running 0.18.2, and this morning I woke up unable to send messages. logs showed:

Jan 23 00:05:44 hypervisor01 docker[64033]: Logged from file matrixfederationclient.py, line 181
Jan 23 00:05:45 hypervisor01 docker[64033]: Traceback (most recent call last):
Jan 23 00:05:45 hypervisor01 docker[64033]: File "/usr/lib64/python2.7/logging/__init__.py", line 861, in emit
Jan 23 00:05:45 hypervisor01 docker[64033]: msg = self.format(record)
Jan 23 00:05:45 hypervisor01 docker[64033]: File "/usr/lib64/python2.7/logging/__init__.py", line 734, in format
Jan 23 00:05:45 hypervisor01 docker[64033]: return fmt.format(record)
Jan 23 00:05:45 hypervisor01 docker[64033]: File "/usr/lib64/python2.7/logging/__init__.py", line 469, in format
Jan 23 00:05:45 hypervisor01 docker[64033]: s = self._fmt % record.__dict__
Jan 23 00:05:45 hypervisor01 docker[64033]: KeyError: 'request'

Repeating in a tight loop over and over again. After a restart, these errors continued, so I updated
to 0.19.0. After this, there was no exception, but my homeserver began chewing through CPU and
Memory pretty aggressively, without logging to file or stdout. I `strace`d the process which showed
a tight loop of the same syscalls over and over:

stat("/usr/lib64/python2.7/logging/__init__.py", {st_mode=S_IFREG|0644, st_size=61264, ...}) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2231470000
stat("/usr/lib64/python2.7/logging/__init__.py", {st_mode=S_IFREG|0644, st_size=61264, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2231430000
brk(0)                                  = 0x7f22a6b70000
brk(0x7f22a6b91000)                     = 0x7f22a6b91000

I would guess that implies it's in the same tight loop as above, but now is dying before even logging
the exception. I've dug a bit more, and discovered some OOM kills:

[Mon Feb  6 09:56:44 2017] postgres invoked oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0
[Mon Feb  6 09:56:44 2017] postgres cpuset=13cdc1a32895a63fbd8a94bf859c4a3cb8d4f578af0f9063372eeec61a35301e mems_allowed=0
...
[Mon Feb  6 09:56:44 2017] Out of memory: Kill process 29061 (python) score 844 or sacrifice child
[Mon Feb  6 09:56:44 2017] Killed process 29061 (python) total-vm:13481740kB, anon-rss:8917388kB, file-rss:0kB

[Mon Feb  6 09:56:44 2017] Out of memory: Kill process 29061 (python) score 844 or sacrifice child
[Mon Feb  6 09:56:44 2017] Killed process 29061 (python) total-vm:13481740kB, anon-rss:8917388kB, file-rss:0kB

[Mon Feb  6 10:27:22 2017] Out of memory: Kill process 44763 (python) score 845 or sacrifice child
[Mon Feb  6 10:27:22 2017] Killed process 44763 (python) total-vm:12339696kB, anon-rss:9074316kB, file-rss:0kB

My dmesg logs are inexplicably ~1:30 off per [Mon Feb 6 11:58:55 2017] Mon Feb 6 10:23:53 PST 2017
from echo $(date) > /dev/kmsg, but basically the timeline works out that after my update, not at
~10:00 UTC, my synapse went off the rails. This made me look back at my logs to see when that first
exception started to crop up, and it has been happening for a long time, so maybe this is all a
red-herring.

This is very reminiscent to the issues I had in 0.16.x involving the events table, but it is
acting like this out of the gate – there is no /sync call to be done, it doesn't even bind to a
port.

root@hypervisor01:~/synapse# ss -tlnp | grep 51465
root@hypervisor01:~/synapse# ps aux | grep synapse.app
root     51465 61.1 80.0 11096112 8181140 ?    Ds   10:30   1:31 python -m synapse.app.homeserver --config-path /data/homeserver.yaml

Interestingly there it's in disk-sleep for a while, but I eventually was able to kill it. But, yeah,
basically I don't know what my homeserver is doing, and unsure where to start with it. I've attached
a pyflame which implies a loop somewhwere; it's missing the startup since I can't sanely make
pyflame trace synapse as a child process when synapse is in my Docker. I'm going to turn on postgres
logging tonight to see if I can figure out where in the startup it's going south.

pyflame.zip

@rrix
Copy link
Author

rrix commented Feb 8, 2017

It's not even getting to the point where it's running any database queries, I'm not sure at all what's going on. I'm going to try to strace it from startup and see if I can make sense of it.

@rrix
Copy link
Author

rrix commented Feb 8, 2017

So it turns out if you make your console handler miss a filter synapse just TOTALLY eats shit.

  console:
    class: logging.StreamHandler
    formatter: precise

instead of

  console:
    class: logging.StreamHandler
    formatter: precise
    filters: [context]

So that's cool.

@rrix rrix changed the title Synapse not starting, runaway memory/cpu usage Synapse not starting, runaway memory/cpu usage when using incorrect logging configuration Feb 8, 2017
@rrix
Copy link
Author

rrix commented Feb 8, 2017

I'm getting a new exception which I'll put in to a new task, this whole situation should probably be untangled though

@rrix
Copy link
Author

rrix commented Feb 8, 2017

This is #1884 ...

@rrix rrix closed this as completed Feb 8, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant