Constant crashing with signal SEGV (segmentation fault) #5101

rreiner · 2021-03-11T13:57:54Z

ntopng is constantly crashing and restarting with a segmentation violation. It did this all night long, and all day yesterday, while the UI was rarely in use or open on any browser.

In journalctl I see entries like:

Mar 11 08:29:19 host89 systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 11 08:29:19 host89 systemd[1]: ntopng.service: Failed with result 'signal'.
Mar 11 08:29:25 host89 systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 11 08:29:25 host89 systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 38.

The ntopng version data is:
Version: 4.3.210307 [Enterprise/Professional/Embedded build]
GIT rev: dev:c80dc8af000ece2358518758f2a7177d8e9427b4:20210307
Pro rev: r3624
Built on: Raspbian GNU/Linux 10 (buster)
System Id: 130FA343499602D2
Platform: armv7l
Edition: Enterprise Embedded
License Type: Time-Limited [Empty license file]
Validity: Until Thu Mar 11 09:06:03 2021

rreiner · 2021-03-11T19:46:10Z

Looks like this may be the same issue as #5090 and/or #5093

rreiner · 2021-03-15T13:44:58Z

This is still happening after the 4.3.210314 update. I have also removed all previous data in the data directory and restarted again, that didn't help either.

$ ntopng --version
Version: 4.3.210314 [Enterprise/Professional/Embedded build]
GIT rev: dev:8915a98a8a2ba436cd8d71b3fd456fb7ef5a8977:20210314
Pro rev: r3630
Built on: Raspbian GNU/Linux 10 (buster)
System Id: 130FA343499602D2
Platform: armv7l
Edition: Enterprise Embedded
License Type: Time-Limited [Empty license file]
Validity: Until Mon Mar 15 09:52:32 2021

$ journalctl -f --system --lines=50000 --unit=ntopng | grep -i segv\|restart
Mar 14 23:35:32 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 14 23:35:37 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 14 23:35:37 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 1.
Mar 14 23:35:55 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 14 23:36:00 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 14 23:36:00 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 2.
Mar 15 00:38:12 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 00:38:17 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 00:38:17 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 3.
Mar 15 00:38:42 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 00:38:47 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 00:38:47 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 4.
Mar 15 08:30:12 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:17 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 08:30:17 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 5.
Mar 15 08:30:34 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:39 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 08:30:39 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 6.
Mar 15 08:30:47 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:52 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 08:30:52 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 7.
Mar 15 08:31:08 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:31:13 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 08:31:13 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 8.

rreiner · 2021-03-15T14:04:29Z

Now monitoring it with the 4.3.210315 update

$ ntopng --version
Version: 4.3.210315 [Enterprise/Professional/Embedded build]
GIT rev: dev:6580aa2ac41ba2fe9e55e05bcb792496be5f010e:20210315
Pro rev: r3631
Built on: Raspbian GNU/Linux 10 (buster)
System Id: 130FA343499602D2
Platform: armv7l
Edition: Enterprise Embedded
License Type: Time-Limited [Empty license file]
Validity: Until Mon Mar 15 10:14:07 2021

rreiner · 2021-03-15T17:18:41Z

It's still crashing periodically in the same way with version 4.3.210315

cardigliano · 2021-03-15T17:22:55Z

@rreiner can I see your configuration file?

rreiner · 2021-03-15T17:28:13Z

@cardigliano my config file contents follow:

# ntopng.conf for local collection on eth0 only
# eth0 is dedicated to packet capture
# eth1 is our management interface
#

#         The  configuration  file is similar to the command line, with the exception that an equal
#        sign '=' must be used between key and value. Example:  -i=p1p2  or  --interface=p1p2  For
#        options with no value (e.g. -v) the equal is also necessary. Example: "-v=" must be used.
#
#
#       -G|--pid-path
#        Specifies the path where the PID (process ID) is saved. This option is ignored when
#        ntopng is controlled with systemd (e.g., service ntopng start).
#
-G=/var/run/ntopng.pid
#
#       -e|--daemon
#        This  parameter  causes ntop to become a daemon, i.e. a task which runs in the background
#        without connection to a specific terminal. To use ntop other than as a casual  monitoring
#        tool, you probably will want to use this option. This option is ignored when ntopng is
#        controlled with systemd (e.g., service ntopng start)
#
# -e=
#
#       -i|--interface
#        Specifies  the  network  interface or collector endpoint to be used by ntopng for network
#        monitoring. On Unix you can specify both the interface name  (e.g.  lo)  or  the  numeric
#        interface id as shown by ntopng -h. On Windows you must use the interface number instead.
#        Note that you can specify -i multiple times in order to instruct ntopng to create  multi-
#        ple interfaces.
#
# -i=eth1
# -i=eth2
-i=eth0
#-i=tcp://127.0.0.1:5556
#
#       -w|--http-port
#        Sets the HTTP port of the embedded web server.
#
# -w=3000
#
#       -m|--local-networks
#        ntopng determines the ip addresses and netmasks for each active interface. Any traffic on
#        those  networks  is considered local. This parameter allows the user to define additional
#        networks and subnetworks whose traffic is also considered local in  ntopng  reports.  All
#        other hosts are considered remote. If not specified the default is set to 192.168.1.0/24.
#
#        Commas  separate  multiple  network  values.  Both netmask and CIDR notation may be used,
#        even mixed together, for instance "131.114.21.0/24,10.0.0.0/255.0.0.0".
#
# -m=10.10.123.0/24
# -m=10.10.124.0/24
-m=10.200.200.0/24
#
#       -n|--dns-mode
#        Sets the DNS address resolution mode: 0 - Decode DNS responses  and  resolve  only  local
#        (-m)  numeric  IPs  1  -  Decode DNS responses and resolve all numeric IPs 2 - Decode DNS
#        responses and don't resolve numeric IPs 3 - Don't decode DNS responses and don't  resolve
#
# -n=1
-n=1
#
#       -S|--sticky-hosts
#        ntopng  periodically purges idle hosts. With this option you can modify this behaviour by
#        telling ntopng not to purge the hosts specified by -S. This parameter requires  an  argu-
#        ment  that  can  be  "all"  (Keep  all hosts in memory), "local" (Keep only local hosts),
#        "remote" (Keep only remote hosts), "none" (Flush hosts when idle).
#
# -S=
#
#       -d|--data-dir
#        Specifies the data directory (it must be writable by the user that is executing ntopng).
#
# -d=/var/lib/ntopng
-d=/mnt/ntopngdata
#
#       -q|--disable-autologout
#        Disable web interface logout for inactivity.
#
# -q=
#
# Where should the nDPI custom rules file be stored?
# (see https://www.ntop.org/guides/ntopng/web_gui/categories.html#custom-applications for where to put and how to create such a file)
#
# -p=/var/lib/ntopng/protos.txt
-p=/var/lib/ntopng/protos.txt
#

cardigliano · 2021-03-15T17:39:12Z

@rreiner could you try adding --community to the configuration? Please let me know if it still crashes. Thank you.

rreiner · 2021-03-15T17:44:56Z

Done, and will monitor for crashes. But I will miss those ten minutes of extended features!

rreiner · 2021-03-15T19:30:56Z

@cardigliano Still happens with --community

$ journalctl -f --system --lines=50 --unit=ntopng | grep -i segv\|restart
Mar 15 15:23:43 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 15:23:48 nghost systemd[1]: ntopng.service: Service RestartSec=5s expired, scheduling restart.
Mar 15 15:23:48 nghost systemd[1]: ntopng.service: Scheduled restart job, restart counter is at 1.

rreiner · 2021-03-15T22:39:33Z

One thing that's probably unusual about my config is that I have rather long caching intervals:

Midnight Stats Reset == off
Idle Local Hosts Cache == on
Local Hosts Cache Duration == 30 days
Active Local Hosts Cache == on
Active Local Host Cache Interval == 4 hours
Local Host Idle Timeout == 168 hours
Remote Host Idle Timeout == 5 mins
Hosts Statistics Update Frequency == 5

But the problem isn't memory exhaustion... I see

System
CPU Load 0.34
CPU States: iowait: 0% / active: 3% / idle: 97%
RAM: Used: 16.14% / Available: 1.57 GB / Total: 1.87 GB

ntopng
Process PID: 21625
RAM Used: 249.26 MB

simonemainardi · 2021-03-16T07:57:07Z

A bit OT: may I ask you why you are using caching intervals so long. What is your use case. Seems you are pushing ntopng to the maximum caching values it allows to configure.

rreiner · 2021-03-16T13:22:47Z

@simonemainardi The use case is simple:

The network is not large or busy (about 60 hosts total, average throughput seen on the SPAN port of under 1Mbit/sec, no more than about 40 hosts active at one time.)

BUT some hosts come and go at long intervals (1 week or more), and if we do not set the cache intervals high then they disappear from the Hosts displays and it becomes impossible to answer questions like "what was host X, which has been idle for 6 days, doing last Tuesday?", which we sometimes do need to answer.

Anyway stress testing is a good thing, right :-)?

simonemainardi · 2021-03-16T17:48:24Z

Thank you for reporting this.

Anyway stress testing is a good thing, right :-)?

Totally. I was just curious to see if your use case could have been resolved differently.

rreiner · 2021-03-17T15:59:20Z

It almost looks like there's some periodicity in the crash times -- midnight and 8am seem like the most common (but not the only) times for the SEGVs:

$ journalctl -f --system --lines=50000 --unit=ntopng | grep -i killed
Mar 14 23:35:32 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 14 23:35:55 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 00:38:12 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 00:38:42 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:12 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:34 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:30:47 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 08:31:08 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 12:15:16 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 15 15:23:43 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:28:59 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:32:22 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:34:22 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:42:00 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:44:33 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 00:45:00 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 01:10:40 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 05:48:16 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 08:44:36 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 20:51:37 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 20:53:28 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 23:53:06 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 23:53:44 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 16 23:53:57 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 17 00:20:34 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 17 03:18:33 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV
Mar 17 08:35:49 nghost systemd[1]: ntopng.service: Main process exited, code=killed, status=11/SEGV

cardigliano · 2021-04-21T12:34:41Z

@rreiner please drop an email to cardigliano at ntop.org and I will send you a binary you can use to generate a trace. Thank you.

cardigliano · 2021-04-23T07:42:34Z

This is the stack trace from @rreiner. It seems the ndpiFlow ptr is not valid, a new debug session will be scheduled to dig more into this.

0x0008a2a4 in Flow::processDNSPacket (this=0x9b7f7fd8, ip_packet=0xa6f2fe "E", ip_len=46, packet_time=1619128522623) at
src/Flow.cpp:725
(gdb) bt
#0 0x0008a2a4 in Flow::processDNSPacket (this=0x9b7f7fd8, ip_packet=0xa6f2fe "E", ip_len=46,
packet_time=1619128522623) at src/Flow.cpp:725
#1 0x00132200 in NetworkInterface::processPacket (this=0xac3bd8, bridge_iface_idx=1, ingressPacket=true,
when=0xa6b9e4, packet_time=1619128522623, eth=0xa6f2f0, vlan_id=0, iph=0xa6f2fe, ip6=0x0, ip_offset=14,
len_on_wire=60, h=0xa6b9e4, packet=0xa6f2f0 "\246ݼ\300s\205\244Lg\225\060\b", ndpiProtocol=0x9ede93b2, srcHost=0x9ede93ac, dstHost=0x9ede93a8, hostFlow=0x9ede93a4) at src/NetworkInterface.cpp:1535 #2 0x001353c0 in NetworkInterface::dissectPacket (this=0xac3bd8, bridge_iface_idx=1, ingressPacket=true, sender_mac=0x0, h=0xa6b9e4, packet=0xa6f2f0 "\246ݼ\300s\205\244Lg\225\060\b", ndpiProtocol=0x9ede93b2,
srcHost=0x9ede93ac, dstHost=0x9ede93a8, flow=0x9ede93a4) at src/NetworkInterface.cpp:2142
#3 0x000de84c in packetPollLoop (ptr=0xac3bd8) at src/PcapInterface.cpp:334
#4 0xb666e494 in start_thread (arg=0x9ede9ca0) at pthread_create.c:486
#5 0xb647d578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

lucaderi · 2021-04-24T07:48:22Z

@rreiner I have made a fix you can try. Packages are being rebuilt and will be available in about one hour from now. Please upgrade and report. Thank you.

rreiner · 2021-04-24T15:16:18Z

OK, got it installed and running, and will monitor for crashes.

…

On Sat, Apr 24, 2021 at 3:48 AM Luca Deri ***@***.***> wrote: @rreiner <https://github.com/rreiner> I have made a fix you can try. Packages are being rebuilt and will be available in about one hour from now. Please upgrade and report. Thank you. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5101 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC6TFYOS4Q5RZSOKX4STLW3TKJZVJANCNFSM4ZAKQVKQ> .

rreiner · 2021-04-25T20:58:29Z

No crashes in 36 hours since I installed the update. This compares to 5-6 per day prior. So tentatively this appears to be fixed.

…

-- Sent from my phone On Sat., Apr. 24, 2021, 11:16 Richard Reiner, ***@***.***> wrote:

OK, got it installed and running, and will monitor for crashes. On Sat, Apr 24, 2021 at 3:48 AM Luca Deri ***@***.***> wrote: > @rreiner <https://github.com/rreiner> I have made a fix you can try. > Packages are being rebuilt and will be available in about one hour from > now. Please upgrade and report. Thank you. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#5101 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AC6TFYOS4Q5RZSOKX4STLW3TKJZVJANCNFSM4ZAKQVKQ> > . >

cardigliano · 2021-04-26T07:07:28Z

@rreiner this is a great news, let's close this, please reopen in case you experience other crashes. Thank you.

rreiner changed the title ~~Constant crashing with SEGV - every 10-20 minutes~~ Constant crashing with signal SEGV - every 10-20 minutes Mar 12, 2021

rreiner changed the title ~~Constant crashing with signal SEGV - every 10-20 minutes~~ Constant crashing with signal SEGV (segmentation fault) - every 10-20 minutes Mar 14, 2021

simonemainardi added the Bug label Mar 15, 2021

cardigliano added the RPI label Mar 15, 2021

rreiner changed the title ~~Constant crashing with signal SEGV (segmentation fault) - every 10-20 minutes~~ Constant crashing with signal SEGV (segmentation fault) Mar 18, 2021

cardigliano self-assigned this Apr 21, 2021

lucaderi added a commit that referenced this issue Apr 24, 2021

Added check to avoid crash when dissecing DNS packets (#5101)

cce92c4

lucaderi added the Ready to Test a feedback is needed on a proposal or implementation label Apr 24, 2021

cardigliano closed this as completed Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constant crashing with signal SEGV (segmentation fault) #5101

Constant crashing with signal SEGV (segmentation fault) #5101

rreiner commented Mar 11, 2021

rreiner commented Mar 11, 2021

rreiner commented Mar 15, 2021 •

edited

Loading

rreiner commented Mar 15, 2021

rreiner commented Mar 15, 2021

cardigliano commented Mar 15, 2021

rreiner commented Mar 15, 2021 •

edited

Loading

cardigliano commented Mar 15, 2021

rreiner commented Mar 15, 2021

rreiner commented Mar 15, 2021

rreiner commented Mar 15, 2021 •

edited

Loading

simonemainardi commented Mar 16, 2021

rreiner commented Mar 16, 2021 •

edited

Loading

simonemainardi commented Mar 16, 2021

rreiner commented Mar 17, 2021

cardigliano commented Apr 21, 2021

cardigliano commented Apr 23, 2021

lucaderi commented Apr 24, 2021

rreiner commented Apr 24, 2021 via email

rreiner commented Apr 25, 2021 via email

cardigliano commented Apr 26, 2021

Constant crashing with signal SEGV (segmentation fault) #5101

Constant crashing with signal SEGV (segmentation fault) #5101

Comments

rreiner commented Mar 11, 2021

rreiner commented Mar 11, 2021

rreiner commented Mar 15, 2021 • edited Loading

rreiner commented Mar 15, 2021

rreiner commented Mar 15, 2021

cardigliano commented Mar 15, 2021

rreiner commented Mar 15, 2021 • edited Loading

cardigliano commented Mar 15, 2021

rreiner commented Mar 15, 2021

rreiner commented Mar 15, 2021

rreiner commented Mar 15, 2021 • edited Loading

simonemainardi commented Mar 16, 2021

rreiner commented Mar 16, 2021 • edited Loading

simonemainardi commented Mar 16, 2021

rreiner commented Mar 17, 2021

cardigliano commented Apr 21, 2021

cardigliano commented Apr 23, 2021

lucaderi commented Apr 24, 2021

rreiner commented Apr 24, 2021 via email

rreiner commented Apr 25, 2021 via email

cardigliano commented Apr 26, 2021

rreiner commented Mar 15, 2021 •

edited

Loading

rreiner commented Mar 15, 2021 •

edited

Loading

rreiner commented Mar 15, 2021 •

edited

Loading

rreiner commented Mar 16, 2021 •

edited

Loading