Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macOS - fix deadlock on reqrep socket close #1824

Merged
merged 1 commit into from
Apr 24, 2024

Conversation

alzix
Copy link
Contributor

@alzix alzix commented Apr 24, 2024

when an aio has no a_cancel_fn and the task is in task_prepstate abort it on nni_aio_stop call

fixes #1813 Deadlock during nng_close() - multi platform

@gdamore
Copy link
Contributor

gdamore commented Apr 24, 2024

This looks like a good find. Furthermore, it looks like nni_aio_abort also suffers from the same flaw.

I want to look at this in more detail later today before I move forward.

@gdamore
Copy link
Contributor

gdamore commented Apr 24, 2024

I think I've convinced myself that this is precisely the right fix, and we just need to add the same change to the implementation of nni_aio_abort.

@alzix
Copy link
Contributor Author

alzix commented Apr 24, 2024

i'm on it

@gdamore
Copy link
Contributor

gdamore commented Apr 24, 2024

So there are other callers.

Basically we also need the same logic in nni_aio_cancel, and I think nni_aio_fini. It looks like it was missed in all the paths where we tear down or abort an aio.

when an `aio` has no `a_cancel_fn` and the task is in `task_prep` abort it on `nni_aio_stop` call
Copy link
Contributor

@gdamore gdamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an excellent find, and probably tricky to do so as well. Thank you for your contribution!

@alzix
Copy link
Contributor Author

alzix commented Apr 24, 2024

it does not solve the Windows issue though :(
On Windows the issue seems to be different
image

it seems that on Windows it is a TOCTOU issue in the nni_sock_shutdown

it is stuck on

// We have to wait for pipes to be removed.
while (!nni_list_empty(&sock->s_pipes)) {
=>	nni_cv_wait(&sock->s_cv);
}

while s_pipes is already empty
image

the issue predates the fix - so it is better to address it in another PR

@alzix alzix changed the title fix deadlock on reqrep socket close macOS - fix deadlock on reqrep socket close Apr 24, 2024
@gdamore
Copy link
Contributor

gdamore commented Apr 24, 2024

I'm merging this... the hang waiting for pipes to be empty feels like a missed cv_wake somewhere. I'll look for it later. I'm out of time for today.

@gdamore gdamore merged commit aac4dc3 into nanomsg:master Apr 24, 2024
7 of 8 checks passed
@alzix
Copy link
Contributor Author

alzix commented Apr 24, 2024

created a new ticket for this #1827

shikokuchuo added a commit to shikokuchuo/nng that referenced this pull request Apr 28, 2024
@gdamore
Copy link
Contributor

gdamore commented Dec 4, 2024

Coming back to this ... I'm now thinking that this change is responsible for a use-after-free.

Essentially, we cannot simply "abort" the task, if there is no cancellation then the task has to run to completion.

I'm now quite interesting to understand what the original hang was that this was supposed to fix.

@gdamore
Copy link
Contributor

gdamore commented Dec 4, 2024

Ah, I think I see one possible problem.

If we call nni_aio_abort() between a prep, and a schedule, that could lead to problems.
Its better to leave a flag like we do for nni_aio_stop / nni_aio_close so that schedule can see that the operation has been canceled between the begin and the schedule call.

Then schedule can return an error properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock during nng_close() - multi platform
2 participants