Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old issue back again? PiGPIOd hanging count down on shutdown? #519

Closed
Jibun-no-Kage opened this issue Feb 11, 2022 · 23 comments
Closed

Old issue back again? PiGPIOd hanging count down on shutdown? #519

Jibun-no-Kage opened this issue Feb 11, 2022 · 23 comments

Comments

@Jibun-no-Kage
Copy link

Old issue back again? PiGPIOd hanging count down on shutdown? Suspect the default unit file from the pigpiod master zip (download) is not correct?

Pi OS 11 Bullseye, updated and upgraded as of this post date https://www.raspberrypi.com/news/raspberry-pi-os-debian-bullseye/
Downloaded pigpiod-master.zip from website https://abyz.me.uk/rpi/pigpio/pigpiod.html

Noticed that the old issue of pigpiod.service stop is hanging the system shutdown/restart process.

Unit file from master zip, that does the shutdown/restart hang countdown...
`[Unit]
Description=Pigpio daemon

[Service]
Type=forking
PIDFile=pigpio.pid
ExecStart=/usr/local/bin/pigpiod

[Install]
WantedBy=multi-user.target
`

Unit file that does not exhibit the issue...
`[Unit]
Description=Pigpio daemon

[Service]
Type=forking
ExecStart=/usr/local/bin/pigpiod -l
ExecStop=/bin/systemctl kill pigpiod

[Install]
WantedBy=multi-user.target
`

Pretty much the key line is the ExecuteStop entry of course.

@guymcswain
Copy link
Collaborator

This is weird because shutdown causes SIGTERM to be sent to all service processes. kill pigpiod also sends SIGTERM to the process.

@Jibun-no-Kage
Copy link
Author

Right, completely weird. I am doing more testing. But it seems so far, that the something is being lost, ignored, or what have you during the shutdown sequence. I even wondered if it was maybe a queue issue or something, where the daemon was just busy and ignoring things, but SIGTERM is more aggressive than SIGINT, no? The only thing that comes to mind, is that maybe some change between buster and bullseye is part of the scenario. Once we finalized the unit file (on buster) I never saw the issue again. It was only after I started creating Pi OS 11 images and installing the daemon, I notice the shutdown/reboots where do the hang/time count thing.

@Jibun-no-Kage
Copy link
Author

Still seeing the long shutdown time, should I have the execute stop task generate an explicit SIGINT or SIGTERM as a test or simply do a pid kill? Or try something else?

@guymcswain
Copy link
Collaborator

I've spent a good deal of time looking into this behavior and these are my observations:

Running the (first) service unit without ExecStop, pigpiod receives both SIGTERM and SIGCONT signals. The SIGCONT is an "unhandled" signal, treated in the sigHandler() by calling: gpioTerminate(); exit(-1); The resulting behavior is inconsistent but more often than not results in the hang for the 90 second timeout period (at which point systemd sends SIGKILL).

The signals sent by systemd are consistent with the documentation:

KillSignal=
Specifies which signal to use when stopping a service. This controls the signal that is sent as first step of shutting down a unit (see above), and is usually followed by SIGKILL (see above and below). For a list of valid signals, see signal(7). Defaults to SIGTERM.

Note that, right after sending the signal specified in this setting, systemd will always send SIGCONT, to ensure that even suspended tasks can be terminated cleanly.

In the second case, the command ExecStop=/bin/systemctl kill pigpiod sends a single SIGTERM signal while systemd waits for the process to terminate. pigpiod exits successfully.

In limited testing so far, if I configure sigHandler to handle SIGCONT, the timeout is avoided with either unit file.

@Jibun-no-Kage
Copy link
Author

Wow, this is really cool. Which scenario, such would you consider the more graceful? That would be the one I would suggest. I am not that familiar with SIGCONT, so I had to look up the details on it.

"...You can send a SIGCONT signal to a process to make it continue. This signal is special—it always makes the process continue if it is stopped, before the signal is delivered. The default behavior is to do nothing else."

It is the comment 'Note that, right after sending the signal specified in this setting, systemd will always send SIGCONT, to ensure that even suspended tasks can be terminated cleanly.' That I struggle with, that feels like a potential race condition? Send SIGTERM and while that is happening, SIGCONT is sent... and the result is the process may not actaully stop?

Given your testing... I don't recall ever having to handle SIGCONT explicitly, I have several python3 scripts that handle SIGINT, SIGTERM running as defined in unit files under systemd, and I have never had them hang thus far. I don't handle SIGCONT at all (as yet, maybe I should look at that as well).

Really good work, truly appreciated it. :)

@guymcswain
Copy link
Collaborator

I too am concerned about a potential race between two signals (SIGTERM, SIGCONT) calling gpioTerminate(). But the window of time is short. A flag is used early on in the function to exit if pigpio is already in a termination state. In my tests that have been successful, both signals are calling gpioTerminate(), so while a race may exist, so far, I've not been able to hit it.

I have more testing to do ...

@Jibun-no-Kage
Copy link
Author

Yup. If you need anything, that I can help with, let me know. I have almost every model of Pi (excluding compute modules, and still trying to find Pi Zero 2).

@guymcswain
Copy link
Collaborator

Yup. If you need anything, that I can help with, let me know.

Thanks, I will tag your name when I need your input. In the meantime, I'm just using this thread to document some of my findings and to remind myself later what I need to do.

@guymcswain
Copy link
Collaborator

Observation: Sometimes systemd only sends SIGCONT (without SIGTERM). In this situation, pigpio daemon will terminate in an unclean state - /dev/pigerr may be left open,

Hypothesis: systemd is confused as to which process is the main PID.

Type=
If set to forking, it is expected that the process configured with ExecStart= will call fork() as part of its start-up. The parent process is expected to exit when start-up is complete and all communication channels are set up. The child continues to run as the main service process, and the service manager will consider the unit started when the parent process exits. This is the behavior of traditional UNIX services. If this setting is used, it is recommended to also use the PIDFile= option, so that systemd can reliably identify the main process of the service. systemd will proceed with starting follow-up units as soon as the parent process exits.

The pigpiod parent process exits immediately after forking. On occasion, I have seen a journal entry complaining that systemd can't open/read the pid file. Need to delay exit of parent process until pid file has been created.

@guymcswain
Copy link
Collaborator

@Jibun-no-Kage , in my limited testing, I have observed positive results with the change described below. Can you try this on your system and report your findings?

On version 79 in file pigpio.c, replace line 5630 with:
_exit(-1);

@Jibun-no-Kage
Copy link
Author

Sure, will do, will get back to you in a few days.

@Jibun-no-Kage
Copy link
Author

Downloaded the latest master zip. I appears the code change is already in place? Line 5630 is "_exit(-1);"

@guymcswain
Copy link
Collaborator

v79?

@SlySven
Copy link
Contributor

SlySven commented Apr 24, 2022

I am seeing a few issues that have been opened over the past few years about the miss-handling of signals. E,g this one and #449 and fundamentally IMVHO there are some short-comings in the code that (referring to, e.g. signal(7)):

  • allows the user to (try) to assign handlers to SIG_KILL and SIG_STOP - these are uncatchable so trying to catch them (assign a handler) should be rejected
  • allows the user to (try) to assign handlers to (likely two) signals in the range SIG_SYS + 1 to SIGRTMIN which are reserved purely for the kernel for the "Native POSIX Threads Library" which has been used since kernel 2.4.
  • allows the user to (try) to assign a handler for a signal with the value of 0 - which is a non-existent one that is reserved purely to test whether a process exists and that you can send signals to it!
  • terminates on unhandled signals - when the default for some of them is supposed to be:
    • to ignore them, these are
      • SIGCHILD {A child process has been stopped or terminated}
      • SIGWINCH {(Terminal) window size has changed}
      • SIGURG {Urgent condition on a socket}
    • to do nothing (by default) as they stop the process
      • SIGSTOP - as noted already this one is uncatchable anyhow
      • SIGTSTOP - paused by (usually) the user - what Cntl-Z does to the foreground process when typed in a bash shell!
      • SIGTTIN - background process paused, awaiting to receive standard input
      • SIGTTOUT - background process paused, awaiting to send standard output
    • to terminate but ALSO dump core:
      • SIGQUIT - {Quit from keyboard}
      • SIGILL - {Illegal instruction}
      • SIGTRAP - {Trace/breakpoint trap}
      • SIGABRT - {Abort signal, from abort(3)}
      • SIGBUS - {Bus error, bad memory address}
      • SIGFPE - {Floating-point exception}
      • SIGSEGV - {Invalid memory reference}
      • SIGALRM - {Timer signal from alarm(2)}
      • SIGXCPU - {CPU time limit exceeded, see setrlimit(2)}
      • SIGXFSZ - {File size limit exceeded, see setrlimit(2)}
      • SIGSYS - {Bad system call}
  • Using non-safe functions when handling signals, including anything from the stdio library functions - see signal-safety!

I had a go at trying to fix things in #58 in a "safe" manner - but that wasn't used (by anyone other than myself!) - apart from anything else it needed to move some of the unsafe things out of the signal handling into a separate function that needed to be inserted into the daemon's main event loop so it wasn't perfect either...

@Jibun-no-Kage
Copy link
Author

Yeah, per the -v results, 79. version.

@guymcswain
Copy link
Collaborator

guymcswain commented Apr 24, 2022

@SlySven,
Thank you for you input, now and in the past, on signal handling. I am committing myself to better understanding linux signal handling in c so that I can (somewhat) intelligently resolve signal issues as related to this library. In doing so, I have to take a strong position of 'do no harm' to the underlying hardware that pigpio exposes for direct manipulations. In regard to general best practice of signal handling, I think pigpio stands out uniquely.

I agree that pigpio should be more discriminating in assigning handlers - currently all signals in range 0-63 are treated equally. (There is, of course, the option for no signal handling.) Using the signal(7) signal disposition categories for default actions,I have the following recommendations:

Disposition / Action
Term / gpioTerminate; _exit()
Ign / SIG_DFL
Core / gpioTerminate; _exit()
Stop / gpioTerminate; _exit()
Cont / gpioTerminate; _exit()

For "Term" disposition, pigpio must use gpioTerminate() to ensure the hardware is properly stopped. Let's assume for now that gpioTerminate() is async-signal-safe. Notable exceptions are SIGUSR1, SIGUSR2.

Likewise, for "Core" we want to stop the hardware. What I'm unclear on is without the default handler will the core dump happen without taking some other explicit action. And does this matter?

For "Stop", I think the correct course of action is to stop the hardware.

For "Cont", I would otherwise have taken the position of ignoring but now that I have seen systemd send SIGCONT when stopping a service, I think it should also be to terminate.

Avoid any stream operations in the signal handler. (ie DBG)

I am also mulling over refactoring pigpiod so that it simply polls a sig_atomic flag and then runs gpioTerminate(). That would make the daemon signal safe but still leaves users who call pigpio directly exposed.

Your comments are appreciated.

@Jibun-no-Kage
Copy link
Author

Would looking at the code logic under RPi.GPIO cleanup() not be applicable? Just a thought. For python the best practice for closing/releasing GPIO state is calling RPi.GPIO cleanup(), either per pin or to return all pins to a default state. Calling cleanup is considered safe end state.

@SlySven
Copy link
Contributor

SlySven commented Apr 25, 2022

I'm not in a position to give a full reply right now, but in passing I would point out that for the "stop" case behaviour - I'm not sure you can do anything then - because I think you (the process) are stopped (maybe that happens when execution leaves the handler or maybe before it is entered - I'm not sure), I think you won't be able to do any thing until you get SIGCONT to continue - and you do not want to terminate on THAT precisely because it is used to recover from those signals that stopped you. However you might want to handle then, the fact that you may have lost data in the intervening period...

Also, bear in mind that if you do get some of the term/core type signals you may not be even able to shutdown gracefully because, for instance, the memory assigned to your process has been corrupted, or you have invalid pointers being dereferenced (SIGSEGV springs to mind in that scope!) - indeed the purpose of dumping core is so that a post-mortem can be carried out on what was in the memory at the time - so trying to clean things up could be counter productive. In my attempt I tried to handle the "core" cases by restoring the default - kernel built-in - behaviour to do that cored dumping and then returning - so the signal would be immediately re-raised and then fire that default behaviour.

@Jibun-no-Kage
Copy link
Author

What are the odds or frequency of such corruption? It should be rare, no?

@guymcswain
Copy link
Collaborator

@SlySven , please comment on this darft PR #532 to improve signal handling. Note, I'll be adding comments so you don't have to go through the source line by line change.

@Jibun-no-Kage , can you test the latest develop branch. This contains a fix for the specific time out issue on systemd.

@Jibun-no-Kage
Copy link
Author

On it... I should be able to test this tomorrow.

@guymcswain
Copy link
Collaborator

@Jibun-no-Kage can this issue now be closed?

@Jibun-no-Kage
Copy link
Author

Oh, I did not realize it was still open. Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants