-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std::sync::Once has way more barriers than necessary #27610
Comments
cc @aturon |
(mentioning this later, but load(SeqCst) doesn't actually emit a barrier on x86 since C11 is weird, so while things can be relaxed, it's far less of a problem than I thought) |
In some situations it's possible to implement Once with no memory barrier at all in the common case (except for a compiler barrier to preventing the compiler from reordering instructions across it). This doesn't matter on x86, since as talchas noted, SeqCst loads don't require a barrier on x86 anyway, but on ARM it could help. Since typical usage of 'once' executes the no-op case far more often than the initialization case, improving performance of the former may be justified even at the expense of adding quite a bit of overhead to the latter. The first way to do this is to just get the OS to execute barriers on all other cores of the system. Windows has had a syscall to do this since Vista (and thus on all ARM editions), I don't believe iOS has such a syscall. But in some cases it's possible to do something different, because iOS only runs on Apple CPUs and Apple has a trick: https://www.mikeash.com/pyblog/friday-qa-2014-06-06-secrets-of-dispatch_once.html (The linked blog post refers to OS X and Intel CPUs, but as I said, the whole thing is redundant on x86; iOS, however, uses the same code.) I'm not suggesting that libstd implement this trick itself, at least on the write side: the library where Apple implements it is always dynamically linked and so Apple retains the option of having future CPUs require something else, and updating the OS to suit. But the read side of DISPATCH_INLINE DISPATCH_ALWAYS_INLINE DISPATCH_NONNULL_ALL DISPATCH_NOTHROW
void
_dispatch_once(dispatch_once_t *predicate,
DISPATCH_NOESCAPE dispatch_block_t block)
{
if (DISPATCH_EXPECT(*predicate, ~0l) != ~0l) {
dispatch_once(predicate, block);
} else {
dispatch_compiler_barrier();
}
DISPATCH_COMPILER_CAN_ASSUME(*predicate == ~0l);
} ...Except for one huge caveat that would make actually implementing this extremely difficult. The documentation for
And of course |
I'm not sure what you mean. This is the signature for fn call_once<F>(&'static self, f: F) |
@briansmith Hrm... my mistake, I didn't notice that signature. That's still technically a weaker guarantee than what the documentation requires, since functions like |
If I understand you correctly, you are saying that there are some kinds of |
We've since landed #52349 which moves the fast path to an acquire load (which according to the discussion on that PR is indeed likely the best possible); the cold path is intentionally left as SeqCst to avoid something going wrong unintentionally. As such, I'm going to close this issue as done (and further optimization being not worth it). |
I believe that (comments removed)
is the correct set of barriers (I'm not certain if the lock_cnt barriers can be Relaxed instead, but it's pretty irrelevant and these are safe).
At the very least I am /certain/ that the very first load can be Acquire instead of SeqCst, which removes the barrier from the fast-path on x86.
The text was updated successfully, but these errors were encountered: