Workaround for sentinel-related race condition #53

josephmturner · 2024-05-25T05:21:19Z

alphapapa · 2024-05-25T05:55:46Z

Thanks, Joseph!

For our future reference, what seems to be happening is that this call to accept-process-output is sometimes not resulting in the process sentinel being called before the plz function returns.

When that is so, the "hack" workaround starting in this form calls the sentinel first (currently, with the "finished\n" value hard-coded), and after the plz function returns, Emacs finally calls the sentinel with the real status (which, in our testing, happened when curl exited with code 7, a connection failure (as it so happens, an immediate one, since it was trying to connect to a port on localhost that was not being listened on)).

Then, since the sentinel was already called, and the process's buffer was already killed (by plz--respond, which the sentinel calls), when the sentinel calls plz--respond again for the same process, the with-current-buffer form, which calls set-buffer on the killed buffer, signals an error.

In our testing, this was only reproducible on Joseph's system, and then only 5-10% of the time. So it appears to be a race condition in Emacs's process-handling code.

With respect to plz, this seems to indicate that the aforementioned hack is incomplete, i.e. plz--sentinel does not check whether the process's result property has already been set, and it ends up calling plz--respond a second time.

Also, the hack of calling plz--sentinel with the hard-coded "finished\n" value presents a problem if the process did not finish successfully.

All together, the fixes are twofold:

In plz--respond, to check the process-status and process-exit-status rather than the status string (which may be misleading). Since, when the hack is necessary, we don't know what status string Emacs would send to the sentinel, we can't rely on it to know the process's status. Luckily, the combination of process-status and process-exit-status seem equivalent to it, at least for our purposes.
In plz--sentinel, to check whether the process's :plz-result property has already been set; and if so, to not call plz--respond again (because we've already called the sentinel and plz--respond for that request's process).

As well, this should probably be reported in an Emacs bug report (perhaps the one made a few years ago about this problem, which the hack was devised to work around). Unfortunately we still don't have a way to reproduce it consistently, but it seems that we can at least add to the problem description (i.e. that not only may it be that the process sentinel may not be called after the process has exited and all of its output has been accepted, but that it may be called after the top-level Elisp function returns--which may be fine for async requests, but not for synchronous ones (which may explain why we noticed and were able to reproduce the problem specifically with a synchronous request)).

See #53.

alphapapa · 2024-05-25T06:51:33Z

@josephmturner The tests pass! https://github.com/alphapapa/plz.el/actions/runs/9233649184/job/25406336668

FYI, the changes: https://github.com/alphapapa/plz.el/compare/wip/53-sentinel-race-condition-workaround

josephmturner · 2024-05-25T07:39:07Z

Thank you!! I'm no longer able to reproduce the issue on my machine in normal hyperdrive.el usage or when letting Emacs hang for a few seconds with

(while t
  (ignore-errors (plz 'get "http://localhost:1337/")))

No error signaled!

The ":PLZ-RESULT ALREADY CHANGED" warning cropped up three times in that span, each time like so:

⛔ Debug (plz--sentinel [interpreted]): (FLOAT-TIME):1716621664.5090535 ":PLZ-RESULT ALREADY CHANGED" PROCESS:#<process plz-request-curl> STATUS:"exited abnormally with code 7
" (PROCESS-GET...):#s(plz-error (7 . "Failed to connect to host.") nil nil)

Each time, it appears that the :plz-result process variable is being set to the plz-error struct.

This is great!! Thank you!!!

alphapapa · 2024-05-25T09:07:14Z

That's great news! Really, it's thanks to you for your help debugging it, and your persistence in wanting it fixed, despite the difficulty in reproducing it.

Quick question, to be sure I didn't make a mistake: that warning appeared because you bound warning-minimum-log-level to :debug and then re-evaluated the plz.el file, right?

Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?

josephmturner · 2024-05-25T18:23:37Z

Quick question, to be sure I didn't make a mistake: that warning appeared because you bound warning-minimum-log-level to :debug and then re-evaluated the plz.el file, right?

Yes. Please also see #54. I'm not sure that submitting a PR from my remote's wip/53-sentinel-race-condition-workaround branch to your remote's wip/53-sentinel-race-condition-workaround branch was the correct action. Feel free to close it.

Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?

No. Previously, I had repeatedly run eval-last-sexp manually. After applying the workaround, I was not able to trigger the warning by running eval-last-sexp manually many times, so I pulled out all the stops. ;)

Just now, I evaluted plz.el at the master branch (without the workaround) and then ran the same while loop. Emacs hung for ~20s and then became responsive again without my having to press C-g. The *Messages* buffer displayed the following message 14 times, where NNN was a different number each time:

Buffer " *plz-request-curl*-NNN" has a running process; kill it? (y or n) n

alphapapa · 2024-05-26T00:49:49Z

Also, had you used the (while t (ignore-errors (plz 'get "http://localhost:1337/"))) form to reproduce the problem before?

No. Previously, I had repeatedly run eval-last-sexp manually. After applying the workaround, I was not able to trigger the warning by running eval-last-sexp manually many times, so I pulled out all the stops. ;)

Just now, I evaluted plz.el at the master branch (without the workaround) and then ran the same while loop. Emacs hung for ~20s and then became responsive again without my having to press C-g. The *Messages* buffer displayed the following message 14 times, where NNN was a different number each time:
Buffer " *plz-request-curl*-NNN" has a running process; kill it? (y or n) n

Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)

See #53.

josephmturner · 2024-05-26T06:51:05Z

Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)

Yes, I have to manually stop the loop with C-g.

josephmturner · 2024-05-26T06:53:54Z

I'll close this issue now, since the workaround is doing the right thing!

alphapapa · 2024-05-26T13:09:53Z

Ok, so when you run that while loop with this branch's code, does Emacs stay in the loop until you press C-g? (Not that it necessarily matters, but I am curious to see how it performs; ISTM that it should stay in the loop until manually interrupted.)
Yes, I have to manually stop the loop with C-g.

That's good news!

Leaving the issue open to remind me to do something about the Emacs bug report. :)

alphapapa added a commit that referenced this issue May 25, 2024

WIP

a40d46c

See #53.

alphapapa self-assigned this May 25, 2024

alphapapa added this to the v0.9 milestone May 25, 2024

alphapapa added the bug Something isn't working label May 25, 2024

alphapapa added a commit that referenced this issue May 25, 2024

WIP

0eab331

See #53.

alphapapa added a commit that referenced this issue May 26, 2024

WIP

46d0c54

See #53.

josephmturner closed this as completed May 26, 2024

alphapapa reopened this May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround for sentinel-related race condition #53

Workaround for sentinel-related race condition #53

josephmturner commented May 25, 2024

alphapapa commented May 25, 2024 •

edited

Loading

alphapapa commented May 25, 2024 •

edited

Loading

josephmturner commented May 25, 2024

alphapapa commented May 25, 2024

josephmturner commented May 25, 2024 •

edited

Loading

alphapapa commented May 26, 2024

josephmturner commented May 26, 2024

josephmturner commented May 26, 2024

alphapapa commented May 26, 2024

Workaround for sentinel-related race condition #53

Workaround for sentinel-related race condition #53

Comments

josephmturner commented May 25, 2024

alphapapa commented May 25, 2024 • edited Loading

alphapapa commented May 25, 2024 • edited Loading

josephmturner commented May 25, 2024

alphapapa commented May 25, 2024

josephmturner commented May 25, 2024 • edited Loading

alphapapa commented May 26, 2024

josephmturner commented May 26, 2024

josephmturner commented May 26, 2024

alphapapa commented May 26, 2024

alphapapa commented May 25, 2024 •

edited

Loading

alphapapa commented May 25, 2024 •

edited

Loading

josephmturner commented May 25, 2024 •

edited

Loading