-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCGX v2.20~48 stuck while shutting down #312
Comments
Why didn't the watchdog intervene?A device hanging itself up during reboot is stupid. So, as a precaution it would be best to as a first thing stop the watchdog, to only thereafter stop all services and do whatever else needs doing. Right now the watchdog is only stopped after various other things are stopped (remeber, init first Kills, then Starts:
Wouldn't stopping the watchdog risk also tripping systems that are still rebooting?No: the kernel driver expects a write to the watchdog device every minute. The userland watchdog process updates the kernel every 10 seconds: stopping it therefore leaves at least 50 seconds for the device to reboot. Proposed changeChange the watchdog to K00watchdog, so when entering runlevel 6, the first thing done is stopping the watchdog. |
A device hanging itself up during shutdown as part of a reboot is stupid. Therefore, the first to do when initiating a reboot is to stop the watchdog. Wouldn't this risk interrupting a reboot that was otherwise OK? No the kernel driver expects a write to the watchdog device every minute. The userland watchdog process updates the kernel every 10 seconds: stopping it therefore leaves at least 50 seconds for the device to reboot. Also add a message to the syslog, to make the log file clear: Aug 17 11:07:07 ccgx user.notice shutdown[2127]: shutting down for system reboot Aug 17 11:07:07 ccgx daemon.info init: Switching to runlevel: 6 Aug 17 11:07:07 ccgx user.notice root: Stopping watchdog (keeping hw watchdog alive) Aug 17 11:07:07 ccgx user.crit kernel: [ 236.577026] omap_wdt: Unexpected close, not stopping! victronenergy/venus#312
A device hanging itself up during shutdown as part of a reboot is stupid. Therefore, the first to do when initiating a reboot is to stop the watchdog. The processes stopped before the watchdog where: K15svscanboot.sh -> ../init.d/svscanboot.sh K20dbus-1 -> ../init.d/dbus-1 K20dnsmasq -> ../init.d/dnsmasq K20hwclock.sh -> ../init.d/hwclock.sh K20resolv-watch -> ../init.d/resolv-watch K20syslog -> ../init.d/syslog K20watchdog -> ../init.d/watchdog With this commit its the first thing. Wouldn't this risk interrupting a reboot that was otherwise OK? No the kernel driver expects a write to the watchdog device every minute. The userland watchdog process updates the kernel every 10 seconds: stopping it therefore leaves at least 50 seconds for the device to reboot. Also add a message to the syslog, to make the log file clear: Aug 17 11:07:07 ccgx user.notice shutdown[2127]: shutting down for system reboot Aug 17 11:07:07 ccgx daemon.info init: Switching to runlevel: 6 Aug 17 11:07:07 ccgx user.notice root: Stopping watchdog (keeping hw watchdog alive) Aug 17 11:07:07 ccgx user.crit kernel: [ 236.577026] omap_wdt: Unexpected close, not stopping! closes victronenergy/venus#312
Another one getting stuck on reboot, idSite 12032. Note, this is with the watchdog "fix"
|
Closing this as it is the broken USB pendrive causing this. The pendrive is mounted again after a wd reset. Just use a better pendrive for now. We can always make an issue for this itself if we get many complaints about it. |
reopening this since hub2 got stuck during usb pendrive update test on hub2 groningen:
|
closing again, since the watchdog is simply not included in the test image... |
https://vrm.victronenergy.com/installation/14404/advanced
My conclusion so far: it got stuck during the shutdown process after installing v2.20~50. The good news is that the issue doesn't seem to be in booting v2.20~50.
Analysis
The VRM database shows no data after 17/08/2018 02:20, next expected entry would have been 02:35. This matches the logs on the device itself: the reboot was initiated by swupdate at 2:23. Next entry in the VRM Database is at 05:46: it shows a booting 2.20~50. This gap means that the device hasn't been running that time with the display off and/or a not working network interface. As in such case the data would have been backlogged, and sent out once all restored.
/log/messages doesn't show a complete shutdown. You see the shutdown starting, but then it gets silent and there are some ntp messages. Then at 05:45:16 there is a flurry of networking messages: before giving the device a hard reset, the customer confirmed me to have unplugged and replugged the network cable. Whereafter at 05:45:32 there is a fresh boot, right at the time at which the customer confirms he has given the device a hard reset.
Herewith the excerpt, full log further down below:
For comparison a normal reboot sequence:
And then there are some other files, /data/dmesg and /data/boot, which don't contain the boot one would expect after the restart initiated at 02:32.
vrmlogger shows what it always shows during a restart, except for the 2 hour and a few minutes gap ofcourse:
gui log is empty around that time:
Customer email
Hello Matthijs,
One of my local customers who is kind enough to be guinea pig for me, and who is running on release candidates, had their CCGX hang today.
By ‘hang’ I mean the system dropped off of the network (e.g. VRM) while the customer was out of their house.
When the customer came home - about 3 hours after the system dropped off of VRM logging - he found the CCGX was unresponsive (screen dark, no response to keypresses).
The system is set to turn its display off when idle so the black screen is normal - but clearly the hard hang isn’t.
I dropped by his place (I happened to be nearby) and popped the unit off of the wall, observed it still had power (led’s let on the ethernet port), so I tried a forced restart (both round buttons pressed and released).
The system rebooted perfectly happily in response to that hard reset - and it is running again right now.
I guess it could be a hardware issue, but clearly because its running 2.20~50 I thought it best to report it to you as a potential release candidate bug. Its not had any history of such things until now - its always worked perfectly well, no history of just ‘stopping’ like this in the past.
This is the system:
https://vrm.victronenergy.com/installation/14404/diagnostics
2 x VEDirect MPPT, 2 x MultiGrid-I, grid interactive, ESS system
Kind Regards, Simon
Log files
The text was updated successfully, but these errors were encountered: