Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: calling randombytes_close with libsodium can crash Contexts in other threads #4242

Merged
merged 2 commits into from
Aug 13, 2021

Conversation

minrk
Copy link
Member

@minrk minrk commented Aug 13, 2021

Problem: When using libsodium, calling zmq::random_close() closes a shared resources that may still be in use.

See #4241 for details and repro, but this is called unconditionally on every curve_keypair() and every context destructor, since #3001. Closing any context and/or issuing a new curve keypair can crash another context if it is working with curve-enabled sockets at the time, as it is still a consumer of a now-closed resource. libsodium will try to reinitialize the source, so it doesn't crash every time; instead, certain race conditions must be met to result in a crash.

Solution: Do not close this shared resource while it is still in use, as suggested by the libsodium documentation.

For implementations using /dev/[u]random, this can leave up to one leftover FD per process (restores #2632, but should be bounded to one).

If we want to call randombytes_close(), we must restore the use of refcounting in #2636 when using libsodium to avoid crashes.

closes #4241

randombytes_close is not threadsafe and calling it while still in use by a Context can cause a crash.

For implementations using /dev/[u]random, this can leave up to one leftover FD per process.

The libsodium docs suggest that this function rarely needs to be called explicitly.
@bluca
Copy link
Member

bluca commented Aug 13, 2021

Unfortunately this is not a solution, because then we go back to leaking file descriptors, which trigger a cascade of other problems (selinux errors, etc etc)

@minrk
Copy link
Member Author

minrk commented Aug 13, 2021

OK, then the fix is to revert #3001

@bluca
Copy link
Member

bluca commented Aug 13, 2021

Eh I wish - that breaks a whole other host of cases due to C++ broken global initialization mechanisms...

@minrk
Copy link
Member Author

minrk commented Aug 13, 2021

I don't fully understand how the refcounting broke things, but to be correct, libsodium must have a refcounter to protect the call to randombytes_close(), just like tweetnacl. I don't know how to make it properly global to avoid #2991, but this means refcounting is clearly required with libsodium, unfortunately.

As it is now, one must ensure that once a context is active dealing with curve messages, no other contexts can be terminated and no new keypairs issued.

@bluca
Copy link
Member

bluca commented Aug 13, 2021

It's explained in the comments - it's not possible to do refcounting via globals in C++. The only solution is to get a kernel with getrandom(), everything else fixes one things and breaks another, which is not acceptable, as we really need to keep things stable.

@minrk
Copy link
Member Author

minrk commented Aug 13, 2021

The only solution is to get a kernel with getrandom()

That's not something folks usually have control over, so not really a solution.

everything else fixes one things and breaks another, which is not acceptable, as we really need to keep things stable.

Yeah, I get that. Even fixing wrong behavior can break things.

WDYT about a runtime and/or compile-time opt-out of the random_close? When getrandom is available, removing it changes nothing. When it's not, you trade a crash for a (bounded to one process-wide) leak. If the leak were opt-in, it seems reasonable to expect the calling code to manage closing the shared resource, in the rare cases where that's useful.

The 'wrong' behavior of the leak is easy to workaround (close the FD) and of little consequence (leak up to one FD in the lifetime of the process), whereas the current behavior is not possible to workaround except with "get a different kernel".

@bluca
Copy link
Member

bluca commented Aug 13, 2021

The only solution is to get a kernel with getrandom()

That's not something folks usually have control over, so not really a solution.

everything else fixes one things and breaks another, which is not acceptable, as we really need to keep things stable.

Yeah, I get that. Even fixing wrong behavior can break things.

WDYT about a runtime and/or compile-time opt-out of the random_close? When getrandom is available, removing it changes nothing. When it's not, you trade a crash for a (bounded to one process-wide) leak. If the leak were opt-in, it seems reasonable to expect the calling code to manage closing the shared resource, in the rare cases where that's useful.

The 'wrong' behavior of the leak is easy to workaround (close the FD) and of little consequence (leak up to one FD in the lifetime of the process), whereas the current behavior is not possible to workaround except with "get a different kernel".

A build option would be fine. Note that the leak is not just cosmetic, things like selinux will raise an audit and so on.

@minrk
Copy link
Member Author

minrk commented Aug 13, 2021

I added an opt-in to cmake and autotools (my m4 is rusty, so I hope it's right). As you might guess, I would prefer the non-crashing behavior be the default, but I assume no change by default is the right choice for consistency.

I wonder if a runtime zmq_random_open() / zmq_random_close() that allowed applications to call sodium_init / randombytes_close() and take it out of the hands of zmq::random_open/close that have refcounting troubles would be best?

things like selinux will raise an audit and so on.

Do you have a link to more information on that? If I have application code that does:

zmq_term() // doesn't call randombytes_close() anymore
randombytes_close()

would that avoid the audit issue, or no? I don't have experience with selinux, so don't know exactly what's being audited there.

@minrk minrk force-pushed the no-close-libsodium-randombytes branch from da20a13 to e9b1292 Compare August 13, 2021 11:45
CMakeLists.txt Outdated
@@ -258,6 +258,7 @@ endif()

option(WITH_LIBSODIUM "Use libsodium instead of built-in tweetnacl" ON)
option(WITH_LIBSODIUM_STATIC "Use static libsodium library" OFF)
option(ENABLE_LIBSODIUM_RANDOMBYTES_CLOSE "Automatically closing libsodium randombytes (not threadsafe)" ON)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to list the explanation, please be precise "not threadsafe without getrandom()"

@minrk
Copy link
Member Author

minrk commented Aug 13, 2021

Do you think it's better to restore the refcounting instead of disabling entirely (i.e. exactly reverting #3001 if the new flag is set)?

@bluca
Copy link
Member

bluca commented Aug 13, 2021

Do you think it's better to restore the refcounting instead o disabling entirely (i.e. exactly reverting #3001 if the new flag is set)?

Current patch is best - the workaround to call randombytes_close manually is trivial, while safely dealing with the global init shenanigans is extremely convoluted, if at all possible

Problem: randombytes_close is either a no-op or unsafe when a Context is running.

Unfortunately, there does not appear to be a single always correct choice,
so let builders pick between two not-great options.

Opting out can leak an FD on /dev/urandom which may need to be closed explicitly.
However, with the default behavior,
using multiple contexts with CURVE can crash with no application-level workaround available.
@minrk minrk force-pushed the no-close-libsodium-randombytes branch from e9b1292 to e3da163 Compare August 13, 2021 11:56
@bluca bluca merged commit bcb659e into zeromq:master Aug 13, 2021
@minrk minrk deleted the no-close-libsodium-randombytes branch August 16, 2021 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

threadsafety issue with curve_keypair, libsodium, randombytes_close
2 participants