Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for sentinel-related race condition #53

Open
josephmturner opened this issue May 25, 2024 · 9 comments
Open

Workaround for sentinel-related race condition #53

josephmturner opened this issue May 25, 2024 · 9 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@josephmturner
Copy link
Contributor

plz-sentinel-race-condition-debug.txt

@alphapapa
Copy link
Owner

alphapapa commented May 25, 2024

Thanks, Joseph!

For our future reference, what seems to be happening is that this call to accept-process-output is sometimes not resulting in the process sentinel being called before the plz function returns.

When that is so, the "hack" workaround starting in this form calls the sentinel first (currently, with the "finished\n" value hard-coded), and after the plz function returns, Emacs finally calls the sentinel with the real status (which, in our testing, happened when curl exited with code 7, a connection failure (as it so happens, an immediate one, since it was trying to connect to a port on localhost that was not being listened on)).

Then, since the sentinel was already called, and the process's buffer was already killed (by plz--respond, which the sentinel calls), when the sentinel calls plz--respond again for the same process, the with-current-buffer form, which calls set-buffer on the killed buffer, signals an error.

In our testing, this was only reproducible on Joseph's system, and then only 5-10% of the time. So it appears to be a race condition in Emacs's process-handling code.

With respect to plz, this seems to indicate that the aforementioned hack is incomplete, i.e. plz--sentinel does not check whether the process's result property has already been set, and it ends up calling plz--respond a second time.

Also, the hack of calling plz--sentinel with the hard-coded "finished\n" value presents a problem if the process did not finish successfully.

All together, the fixes are twofold:

  1. In plz--respond, to check the process-status and process-exit-status rather than the status string (which may be misleading). Since, when the hack is necessary, we don't know what status string Emacs would send to the sentinel, we can't rely on it to know the process's status. Luckily, the combination of process-status and process-exit-status seem equivalent to it, at least for our purposes.
  2. In plz--sentinel, to check whether the process's :plz-result property has already been set; and if so, to not call plz--respond again (because we've already called the sentinel and plz--respond for that request's process).

As well, this should probably be reported in an Emacs bug report (perhaps the one made a few years ago about this problem, which the hack was devised to work around). Unfortunately we still don't have a way to reproduce it consistently, but it seems that we can at least add to the problem description (i.e. that not only may it be that the process sentinel may not be called after the process has exited and all of its output has been accepted, but that it may be called after the top-level Elisp function returns--which may be fine for async requests, but not for synchronous ones (which may explain why we noticed and were able to reproduce the problem specifically with a synchronous request)).

alphapapa added a commit that referenced this issue May 25, 2024
See #53.
@alphapapa alphapapa self-assigned this May 25, 2024
@alphapapa alphapapa added this to the v0.9 milestone May 25, 2024
@alphapapa alphapapa added the bug Something isn't working label May 25, 2024
alphapapa added a commit that referenced this issue May 25, 2024
See #53.
@josephmturner
Copy link
Contributor Author

Thank you!! I'm no longer able to reproduce the issue on my machine in normal hyperdrive.el usage or when letting Emacs hang for a few seconds with

(while t
  (ignore-errors (plz 'get "http://localhost:1337/")))

No error signaled!

The ":PLZ-RESULT ALREADY CHANGED" warning cropped up three times in that span, each time like so:

⛔ Debug (plz--sentinel [interpreted]): (FLOAT-TIME):1716621664.5090535 ":PLZ-RESULT ALREADY CHANGED" PROCESS:#<process plz-request-curl> STATUS:"exited abnormally with code 7
" (PROCESS-GET...):#s(plz-error (7 . "Failed to connect to host.") nil nil) 

Each time, it appears that the :plz-result process variable is being set to the plz-error struct.

This is great!! Thank you!!!

@alphapapa
Copy link
Owner

That's great news! Really, it's thanks to you for your help debugging it, and your persistence in wanting it fixed, despite the difficulty in reproducing it.

Quick question, to be sure I didn't make a mistake: that warning appeared because you bound warning-minimum-log-level to :debug and then re-evaluated the plz.el file, right?

Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?

@josephmturner
Copy link
Contributor Author

josephmturner commented May 25, 2024

Quick question, to be sure I didn't make a mistake: that warning appeared because you bound warning-minimum-log-level to :debug and then re-evaluated the plz.el file, right?

Yes. Please also see #54. I'm not sure that submitting a PR from my remote's wip/53-sentinel-race-condition-workaround branch to your remote's wip/53-sentinel-race-condition-workaround branch was the correct action. Feel free to close it.

Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?

No. Previously, I had repeatedly run eval-last-sexp manually. After applying the workaround, I was not able to trigger the warning by running eval-last-sexp manually many times, so I pulled out all the stops. ;)

Just now, I evaluted plz.el at the master branch (without the workaround) and then ran the same while loop. Emacs hung for ~20s and then became responsive again without my having to press C-g. The *Messages* buffer displayed the following message 14 times, where NNN was a different number each time:

Buffer " *plz-request-curl*-NNN" has a running process; kill it? (y or n) n

@alphapapa
Copy link
Owner

Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?

No. Previously, I had repeatedly run eval-last-sexp manually. After applying the workaround, I was not able to trigger the warning by running eval-last-sexp manually many times, so I pulled out all the stops. ;)

Just now, I evaluted plz.el at the master branch (without the workaround) and then ran the same while loop. Emacs hung for ~20s and then became responsive again without my having to press C-g. The *Messages* buffer displayed the following message 14 times, where NNN was a different number each time:

Buffer " *plz-request-curl*-NNN" has a running process; kill it? (y or n) n

Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)

alphapapa added a commit that referenced this issue May 26, 2024
See #53.
@josephmturner
Copy link
Contributor Author


Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)

Yes, I have to manually stop the loop with C-g.

@josephmturner
Copy link
Contributor Author

I'll close this issue now, since the workaround is doing the right thing!

@alphapapa
Copy link
Owner


Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)

Yes, I have to manually stop the loop with C-g.

That's good news!

Leaving the issue open to remind me to do something about the Emacs bug report. :)

@alphapapa alphapapa reopened this May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants