Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking with Lwt_unix can cause unexpected EINTR #738

Closed
samwgoldman opened this issue Oct 21, 2019 · 20 comments
Closed

Linking with Lwt_unix can cause unexpected EINTR #738

samwgoldman opened this issue Oct 21, 2019 · 20 comments
Labels
Milestone

Comments

@samwgoldman
Copy link

The Lwt_unix module's top-level code installs a sigchld handler, which can cause blocking system calls to be interrupted. Note that the default behavior is to ignore the signal, so even if sigchld is received, blocking system calls will not be interrupted.

It's easy to write some code that does not check for EINTR which, while arguably wrong, works well enough because the program does not use signals. This happened to us in the Flow/Hack codebase where a shared module became linked with Lwt_unix and then linked into a binary that uses fork and select, but does not handle EINTR on the select.

The behavior was difficult to reproduce, since it depends on receiving the sigchld while in the select call. The issue was further difficult to diagnose since no relevant code had recently changed. Once diagnosed, the issue was trivially resolved by separating the Lwt-using module from the non-Lwt code.

While this is an edge case, and I think it's fair to say that Lwt_unix is doing nothing wrong, it would be nice to defer setting the sigchld handler until later, so that code which is linked with Lwt_unix but does not actually use it does not have its blocking system calls interrupted by sigchld.

@aantron aantron added the bug label Oct 21, 2019
@aantron
Copy link
Collaborator

aantron commented Oct 21, 2019

We should indeed be able to delay this, until either of waitpid or wait4 is used.

@samwgoldman
Copy link
Author

Thanks for the quick response!

@aantron aantron added this to the 4.4.1 milestone Oct 21, 2019
@aantron
Copy link
Collaborator

aantron commented Oct 21, 2019

I will make sure this is resolved by the next release. I understood that this isn't an emergency for you presently, since you have worked around it. Is that right? If so, I'll do it on Lwt's ordinary release schedule :)

@samwgoldman
Copy link
Author

No emergency, and absolutely no rush on our end.

@aantron
Copy link
Collaborator

aantron commented Oct 21, 2019

Great, and thanks for diagnosing that!

@aantron
Copy link
Collaborator

aantron commented Oct 21, 2019

I have a minor concern, that fixing this will make the problem even more surprising and/or difficult to diagnose, since EINTR will now be restricted further, to when linking into a larger Lwt program that uses waitpid/wait4 and only after those functions are used. I don't think we can easily warn library authors about EINTR from the docs of Lwt, since a library can't know if the larger program will use Lwt. So, this might harm the next person that needs to figure out what is happening.

I'm still leaning toward fixing this, but would you mind commenting on the above?

For example, while diagnosing, did you quickly find that Lwt is now being linked in, but (reasonably) could have no easy way of knowing that it is Lwt which is causing EINTR? If so, the right way to fix this might be to only document it in Lwt instead.

@samwgoldman
Copy link
Author

That is a fair concern.

For what it's worth, I opened this issue to make you aware of something that happened internally, but I'm not convinced that any change to Lwt is actually necessary. In the abstract, I feel that library code should always be careful to handle EINTR, since the programmer can not know whether the program will use signals, as you noted. Application code can probably be a bit less careful, but in our case being lax about EINTR combined with multiple programmers working with little coordination led to issues.

I was not the one who connected the unexpected EINTR to a change that only linked in Lwt_unix, I only connected a few dots to show that it was the sigchld handler which caused the change in behavior. I am not certain how the initial connection to Lwt_unix was made.

As I understand, your concern is that people might write library code that does not properly handle EINTR, but wouldn't necessarily figure it out during testing, since their test harness might not call waitpid/wait4. Only when linked into a program that does would the bug be discovered. Do I misunderstand?

@aantron
Copy link
Collaborator

aantron commented Oct 21, 2019

As I understand, your concern is that people might write library code that does not properly handle EINTR, but wouldn't necessarily figure it out during testing, since their test harness might not call waitpid/wait4.

It's more specific. When EINTR first appears for a user that has EINTR-unsafe code somewhere in their program (whether in their project or something it is linked with):

  1. Without fixing this issue, it is because Lwt has also been linked in.
  2. If this issue is fixed, it is because some code called waitpid or wait4.

(1) seems better because...

  • It should be easier to discover a new dependency on Lwt when bisecting, etc.
  • It is closer to being static, since it requires a less specific runtime behavior.
  • It fails faster.

In my mind, the main argument for delayed signal handler installation at this point is that the developers of a project that is suffering from EINTR might not have control over the EINTR-unsafe code that is linked into their project. However, in OCaml and Reason, I think most developers ultimately are in control of all the code, since it is generally open source or they have source access.

...which then suggests not to change the code of Lwt, but maybe to add a note to the docs of Lwt_unix, in hopes it will help people to diagnose this more quickly.

@samwgoldman
Copy link
Author

Yeah, I see what you mean. Both ways have problems, so perhaps the best course right now is to just document the behavior.

I think the confusing part of the existing behavior is that people assume that module initialization alone will not affect runtime behavior. Is it the case that Lwt_main.run will always be called before Lwt_unix.waitpid? Maybe the handler could be installed lazily when the event loop is initialized?

@aantron
Copy link
Collaborator

aantron commented Oct 21, 2019

That's a good idea.

It's not necessary to call Lwt_main.run before wait4/waitpid, but the Lwt API deliberately doesn't guarantee that the returned promises can resolve until after Lwt_main.run is called. So, we can insert the lazy initialization into all ofwait4, waitpid, and Lwt_main.run. This will both avoid the module side-effect, and provide a sort of fail-fast behavior for programs that do actually use Lwt.

I'm still not 100% sure this is better than failing if Lwt is simply linked, but I think I will go with this approach.

Could you get a comment from the person who connected this to Lwt_unix, about what would have helped them to diagnose the issue more quickly?

@arxanas
Copy link

arxanas commented Oct 21, 2019

Some change was introduced that caused this behavior. I found the problem by bisecting to that commit. Then I reduced the changes in that commit until I got a repro. Here's my original comment on the bug:

I tried making a minimal change, and just adding the necessary dependencies and creating a stub like this in serverCommand.ml:

let wait_for_rpc_response_lwt () =
  Lwt.return_unit

This did not trigger the problem. Then I tried this instead:

let wait_for_rpc_response_lwt () =
  Lwt_unix.yield ()

This reproduced the problem. Note that in neither case are these functions called anywhere else in the code; however, OCaml will not actually run a library's top-level code unless you make a reference to it, so what's happening here is that Lwt_unix contains some initialization code which in itself breaks our regular Unix.select calls.

I was only able to bisect it to the commit easily and narrow down the cause because it very reliably triggered EINTR issues during the startup sequence for the executable. So for debugging purposes, the current behavior is fine. That being said, the binary in question never invoked Lwt_main.run, nor any other Lwt functions, so deferring the signal handler installation until Lwt_main.run-time or similar would also be fine, as the signal handler would never be installed.

I do think it would be more intuitive if we started getting EINTRs only once we inserted a call to Lwt_main.run into the binary. It would be much more obvious of a cause than just "we happened to link some Lwt code against this other binary". I can't see any disadvantages as far as debuggability is concerned, except that with the way it is currently, you might become aware that your code isn't EINTR-safe earlier in the development cycle, which seems like a marginal benefit.

@aantron
Copy link
Collaborator

aantron commented Oct 21, 2019

Excellent, thanks!

@arxanas
Copy link

arxanas commented Oct 21, 2019

To this day, that binary doesn't use Lwt. But there was a time when I was considering running Lwt_main.run on a single function just to reuse some existing code (which had to be done synchronously with respect to the rest of the program anyways). I suppose if Lwt_main.run had installed the signal handler and didn't uninstall it after it was done (is that possible?), then we would get very confusing behavior where calling Lwt_main.run affects later calls to Unix.select. In that respect, if the signal handlers aren't removed when Lwt_main.run is done, then it could be considered worse than the current behavior, where the effect lasts for the entire program duration and isn't dependent on the particular control flow.

@aantron
Copy link
Collaborator

aantron commented Oct 21, 2019

If we make the change to defer installation, we would also document Lwt_main.run installs a signal handler. In your experience, would that be visible enough to mitigate the effect of not removing the signal handler?

@aantron
Copy link
Collaborator

aantron commented Oct 21, 2019

In particular, we would mention the need for the whole program to be EINTR-safe. Hopefully, when developers later get EINTR, they, or someone on the team, would recall mention of it in the Lwt_main.run docs.

@arxanas
Copy link

arxanas commented Oct 22, 2019

If the text EINTR were written in the documentation, then I think it would have been enough for me to have debugged the issue also.

@aantron
Copy link
Collaborator

aantron commented Oct 31, 2019

The attached commit defers installation of the SIGCHLD handler until Lwt_unix.waitpid, Lwt_unix.wait4, or Lwt_main.run is called. The SIGCHLD handler is also installed immediately before process exit if Lwt_io is linked into the program, because Lwt_io includes a module side effect that installs an Lwt exit hook, which causes Lwt_main.run to run on process exit. I hope this won't affect anyone, but I'm noting it for future reference.

I also mentioned EINTR in the docs of Lwt_main.run.

@rvantonder
Copy link

Just want to say: thank you for this fix @aantron. I was dealing with issues with the signal handler being registered too early, which would interfere with other worker processes and signals. Can't want for the next release 🎉

@aantron
Copy link
Collaborator

aantron commented Nov 21, 2019

@rvantonder Great :) The release should be next week.

Just want to confirm: you have tried your application with Lwt from master, and it indeed fixes your issue, the same as your modification had before? The reason I am asking is that if not, maybe we should sneak additional fixes in before the release :)

@rvantonder
Copy link

Yes, I removed the special casing and compiled against Lwt master and confirmed the issue is fixed :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants