Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant crashing with signal SEGV (segmentation fault) #5101

Closed
rreiner opened this issue Mar 11, 2021 · 20 comments
Closed

Constant crashing with signal SEGV (segmentation fault) #5101

rreiner opened this issue Mar 11, 2021 · 20 comments
Assignees
Labels
Bug Ready to Test a feedback is needed on a proposal or implementation

Comments

@rreiner
Copy link

rreiner commented Mar 11, 2021

ntopng is constantly crashing and restarting with a segmentation violation. It did this all night long, and all day yesterday, while the UI was rarely in use or open on any browser.

In journalctl I see entries like:

Mar 11 08:29:19 host89 systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 11 08:29:19 host89 systemd[1]: ntopng.service: Failed with result 'signal'.
Mar 11 08:29:25 host89 systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 11 08:29:25 host89 systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 38.

The ntopng version data is:
Version: 4.3.210307 [Enterprise/Professional/Embedded build]
GIT rev: dev:c80dc8af000ece2358518758f2a7177d8e9427b4:20210307
Pro rev: r3624
Built on: Raspbian GNU/Linux 10 (buster)
System Id: 130FA343499602D2
Platform: armv7l
Edition: Enterprise Embedded
License Type: Time-Limited [Empty license file]
Validity: Until Thu Mar 11 09:06:03 2021

@rreiner
Copy link
Author

rreiner commented Mar 11, 2021

Looks like this may be the same issue as #5090 and/or #5093

@rreiner rreiner changed the title Constant crashing with SEGV - every 10-20 minutes Constant crashing with signal SEGV - every 10-20 minutes Mar 12, 2021
@rreiner rreiner changed the title Constant crashing with signal SEGV - every 10-20 minutes Constant crashing with signal SEGV (segmentation fault) - every 10-20 minutes Mar 14, 2021
@rreiner
Copy link
Author

rreiner commented Mar 15, 2021

This is still happening after the 4.3.210314 update. I have also removed all previous data in the data directory and restarted again, that didn't help either.

$ ntopng --version
Version: 4.3.210314 [Enterprise/Professional/Embedded build]
GIT rev: dev:8915a98a8a2ba436cd8d71b3fd456fb7ef5a8977:20210314
Pro rev: r3630
Built on: Raspbian GNU/Linux 10 (buster)
System Id: 130FA343499602D2
Platform: armv7l
Edition: Enterprise Embedded
License Type: Time-Limited [Empty license file]
Validity: Until Mon Mar 15 09:52:32 2021

$ journalctl -f --system --lines=50000 --unit=ntopng | grep -i segv\|restart
Mar 14 23:35:32 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 14 23:35:37 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 14 23:35:37 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 1.
Mar 14 23:35:55 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 14 23:36:00 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 14 23:36:00 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 2.
Mar 15 00:38:12 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 00:38:17 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 00:38:17 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 3.
Mar 15 00:38:42 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 00:38:47 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 00:38:47 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 4.
Mar 15 08:30:12 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:17 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 08:30:17 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 5.
Mar 15 08:30:34 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:39 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 08:30:39 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 6.
Mar 15 08:30:47 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:52 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 08:30:52 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 7.
Mar 15 08:31:08 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:31:13 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 08:31:13 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 8.

@rreiner
Copy link
Author

rreiner commented Mar 15, 2021

Now monitoring it with the 4.3.210315 update

$ ntopng --version
Version: 4.3.210315 [Enterprise/Professional/Embedded build]
GIT rev: dev:6580aa2ac41ba2fe9e55e05bcb792496be5f010e:20210315
Pro rev: r3631
Built on: Raspbian GNU/Linux 10 (buster)
System Id: 130FA343499602D2
Platform: armv7l
Edition: Enterprise Embedded
License Type: Time-Limited [Empty license file]
Validity: Until Mon Mar 15 10:14:07 2021

@rreiner
Copy link
Author

rreiner commented Mar 15, 2021

It's still crashing periodically in the same way with version 4.3.210315

@cardigliano
Copy link
Member

@rreiner can I see your configuration file?

@rreiner
Copy link
Author

rreiner commented Mar 15, 2021

@cardigliano my config file contents follow:

# ntopng.conf for local collection on eth0 only
# eth0 is dedicated to packet capture
# eth1 is our management interface
#

#         The  configuration  file is similar to the command line, with the exception that an equal
#        sign '=' must be used between key and value. Example:  -i=p1p2  or  --interface=p1p2  For
#        options with no value (e.g. -v) the equal is also necessary. Example: "-v=" must be used.
#
#
#       -G|--pid-path
#        Specifies the path where the PID (process ID) is saved. This option is ignored when
#        ntopng is controlled with systemd (e.g., service ntopng start).
#
-G=/var/run/ntopng.pid
#
#       -e|--daemon
#        This  parameter  causes ntop to become a daemon, i.e. a task which runs in the background
#        without connection to a specific terminal. To use ntop other than as a casual  monitoring
#        tool, you probably will want to use this option. This option is ignored when ntopng is
#        controlled with systemd (e.g., service ntopng start)
#
# -e=
#
#       -i|--interface
#        Specifies  the  network  interface or collector endpoint to be used by ntopng for network
#        monitoring. On Unix you can specify both the interface name  (e.g.  lo)  or  the  numeric
#        interface id as shown by ntopng -h. On Windows you must use the interface number instead.
#        Note that you can specify -i multiple times in order to instruct ntopng to create  multi-
#        ple interfaces.
#
# -i=eth1
# -i=eth2
-i=eth0
#-i=tcp://127.0.0.1:5556
#
#       -w|--http-port
#        Sets the HTTP port of the embedded web server.
#
# -w=3000
#
#       -m|--local-networks
#        ntopng determines the ip addresses and netmasks for each active interface. Any traffic on
#        those  networks  is considered local. This parameter allows the user to define additional
#        networks and subnetworks whose traffic is also considered local in  ntopng  reports.  All
#        other hosts are considered remote. If not specified the default is set to 192.168.1.0/24.
#
#        Commas  separate  multiple  network  values.  Both netmask and CIDR notation may be used,
#        even mixed together, for instance "131.114.21.0/24,10.0.0.0/255.0.0.0".
#
# -m=10.10.123.0/24
# -m=10.10.124.0/24
-m=10.200.200.0/24
#
#       -n|--dns-mode
#        Sets the DNS address resolution mode: 0 - Decode DNS responses  and  resolve  only  local
#        (-m)  numeric  IPs  1  -  Decode DNS responses and resolve all numeric IPs 2 - Decode DNS
#        responses and don't resolve numeric IPs 3 - Don't decode DNS responses and don't  resolve
#
# -n=1
-n=1
#
#       -S|--sticky-hosts
#        ntopng  periodically purges idle hosts. With this option you can modify this behaviour by
#        telling ntopng not to purge the hosts specified by -S. This parameter requires  an  argu-
#        ment  that  can  be  "all"  (Keep  all hosts in memory), "local" (Keep only local hosts),
#        "remote" (Keep only remote hosts), "none" (Flush hosts when idle).
#
# -S=
#
#       -d|--data-dir
#        Specifies the data directory (it must be writable by the user that is executing ntopng).
#
# -d=/var/lib/ntopng
-d=/mnt/ntopngdata
#
#       -q|--disable-autologout
#        Disable web interface logout for inactivity.
#
# -q=
#
# Where should the nDPI custom rules file be stored?
# (see https://www.ntop.org/guides/ntopng/web_gui/categories.html#custom-applications for where to put and how to create such a file)
#
# -p=/var/lib/ntopng/protos.txt
-p=/var/lib/ntopng/protos.txt
#

@cardigliano
Copy link
Member

@rreiner could you try adding --community to the configuration? Please let me know if it still crashes. Thank you.

@rreiner
Copy link
Author

rreiner commented Mar 15, 2021

Done, and will monitor for crashes. But I will miss those ten minutes of extended features!

@rreiner
Copy link
Author

rreiner commented Mar 15, 2021

@cardigliano Still happens with --community

$ journalctl -f --system --lines=50 --unit=ntopng | grep -i segv\|restart
Mar 15 15:23:43 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 15:23:48 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 15:23:48 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 1.

@rreiner
Copy link
Author

rreiner commented Mar 15, 2021

One thing that's probably unusual about my config is that I have rather long caching intervals:

Midnight Stats Reset == off
Idle Local Hosts Cache == on
Local Hosts Cache Duration == 30 days
Active Local Hosts Cache == on
Active Local Host Cache Interval == 4 hours
Local Host Idle Timeout == 168 hours
Remote Host Idle Timeout == 5 mins
Hosts Statistics Update Frequency == 5

But the problem isn't memory exhaustion... I see

System
CPU Load 0.34
CPU States: iowait: 0% / active: 3% / idle: 97%
RAM: Used: 16.14% / Available: 1.57 GB / Total: 1.87 GB

ntopng
Process PID: 21625
RAM Used: 249.26 MB

@simonemainardi
Copy link
Contributor

A bit OT: may I ask you why you are using caching intervals so long. What is your use case. Seems you are pushing ntopng to the maximum caching values it allows to configure.

@rreiner
Copy link
Author

rreiner commented Mar 16, 2021

@simonemainardi The use case is simple:

The network is not large or busy (about 60 hosts total, average throughput seen on the SPAN port of under 1Mbit/sec, no more than about 40 hosts active at one time.)

BUT some hosts come and go at long intervals (1 week or more), and if we do not set the cache intervals high then they disappear from the Hosts displays and it becomes impossible to answer questions like "what was host X, which has been idle for 6 days, doing last Tuesday?", which we sometimes do need to answer.

Anyway stress testing is a good thing, right :-)?

@simonemainardi
Copy link
Contributor

Thank you for reporting this.

Anyway stress testing is a good thing, right :-)?

Totally. I was just curious to see if your use case could have been resolved differently.

@rreiner
Copy link
Author

rreiner commented Mar 17, 2021

It almost looks like there's some periodicity in the crash times -- midnight and 8am seem like the most common (but not the only) times for the SEGVs:

$ journalctl -f --system --lines=50000 --unit=ntopng | grep -i killed
Mar 14 23:35:32 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 14 23:35:55 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 00:38:12 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 00:38:42 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:12 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:34 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:47 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:31:08 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 12:15:16 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 15:23:43 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:28:59 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:32:22 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:34:22 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:42:00 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:44:33 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:45:00 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 01:10:40 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 05:48:16 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 08:44:36 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 20:51:37 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 20:53:28 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 23:53:06 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 23:53:44 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 23:53:57 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 17 00:20:34 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 17 03:18:33 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 17 08:35:49 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV

@rreiner rreiner changed the title Constant crashing with signal SEGV (segmentation fault) - every 10-20 minutes Constant crashing with signal SEGV (segmentation fault) Mar 18, 2021
@cardigliano
Copy link
Member

@rreiner please drop an email to cardigliano at ntop.org and I will send you a binary you can use to generate a trace. Thank you.

@cardigliano cardigliano self-assigned this Apr 21, 2021
@cardigliano
Copy link
Member

This is the stack trace from @rreiner. It seems the ndpiFlow ptr is not valid, a new debug session will be scheduled to dig more into this.

0x0008a2a4 in Flow::processDNSPacket (this=0x9b7f7fd8, ip_packet=0xa6f2fe "E", ip_len=46, packet_time=1619128522623) at
src/Flow.cpp:725
(gdb) bt
#0 0x0008a2a4 in Flow::processDNSPacket (this=0x9b7f7fd8, ip_packet=0xa6f2fe "E", ip_len=46,
packet_time=1619128522623) at src/Flow.cpp:725
#1 0x00132200 in NetworkInterface::processPacket (this=0xac3bd8, bridge_iface_idx=1, ingressPacket=true,
when=0xa6b9e4, packet_time=1619128522623, eth=0xa6f2f0, vlan_id=0, iph=0xa6f2fe, ip6=0x0, ip_offset=14,
len_on_wire=60, h=0xa6b9e4, packet=0xa6f2f0 "\246ݼ\300s\205\244Lg\225\060\b", ndpiProtocol=0x9ede93b2, srcHost=0x9ede93ac, dstHost=0x9ede93a8, hostFlow=0x9ede93a4) at src/NetworkInterface.cpp:1535 #2 0x001353c0 in NetworkInterface::dissectPacket (this=0xac3bd8, bridge_iface_idx=1, ingressPacket=true, sender_mac=0x0, h=0xa6b9e4, packet=0xa6f2f0 "\246ݼ\300s\205\244Lg\225\060\b", ndpiProtocol=0x9ede93b2,
srcHost=0x9ede93ac, dstHost=0x9ede93a8, flow=0x9ede93a4) at src/NetworkInterface.cpp:2142
#3 0x000de84c in packetPollLoop (ptr=0xac3bd8) at src/PcapInterface.cpp:334
#4 0xb666e494 in start_thread (arg=0x9ede9ca0) at pthread_create.c:486
#5 0xb647d578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@lucaderi
Copy link
Member

@rreiner I have made a fix you can try. Packages are being rebuilt and will be available in about one hour from now. Please upgrade and report. Thank you.

@lucaderi lucaderi added the Ready to Test a feedback is needed on a proposal or implementation label Apr 24, 2021
@rreiner
Copy link
Author

rreiner commented Apr 24, 2021 via email

@rreiner
Copy link
Author

rreiner commented Apr 25, 2021 via email

@cardigliano
Copy link
Member

@rreiner this is a great news, let's close this, please reopen in case you experience other crashes. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Ready to Test a feedback is needed on a proposal or implementation
Projects
None yet
Development

No branches or pull requests

4 participants