Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::sync::Once has way more barriers than necessary #27610

Closed
talchas opened this issue Aug 8, 2015 · 7 comments
Closed

std::sync::Once has way more barriers than necessary #27610

talchas opened this issue Aug 8, 2015 · 7 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@talchas
Copy link

talchas commented Aug 8, 2015

I believe that (comments removed)

    pub fn call_once<F>(&'static self, f: F) where F: FnOnce() {
        if self.cnt.load(Ordering::Acquire) < 0 {
            return
        }

        let prev = self.cnt.fetch_add(1, Ordering::Acquire);
        if prev < 0 {
            self.cnt.store(isize::MIN, Ordering::Release);
            return
        }

        let guard = self.mutex.lock();
        if self.cnt.load(Ordering::Relaxed) > 0 {
            f();
            let prev = self.cnt.swap(isize::MIN, Ordering::Release);
            self.lock_cnt.store(prev, Ordering::Release);
        }
        drop(guard);

        if self.lock_cnt.fetch_add(-1, Ordering::AcqRel) == 1 {
            unsafe { self.mutex.destroy() }
        }
    }

is the correct set of barriers (I'm not certain if the lock_cnt barriers can be Relaxed instead, but it's pretty irrelevant and these are safe).

At the very least I am /certain/ that the very first load can be Acquire instead of SeqCst, which removes the barrier from the fast-path on x86.

@Gankra
Copy link
Contributor

Gankra commented Aug 8, 2015

cc @aturon

@Gankra Gankra added I-slow Issue: Problems and improvements with respect to performance of generated code. A-libs labels Aug 8, 2015
@talchas
Copy link
Author

talchas commented Oct 1, 2015

(mentioning this later, but load(SeqCst) doesn't actually emit a barrier on x86 since C11 is weird, so while things can be relaxed, it's far less of a problem than I thought)

@comex
Copy link
Contributor

comex commented Aug 17, 2016

In some situations it's possible to implement Once with no memory barrier at all in the common case (except for a compiler barrier to preventing the compiler from reordering instructions across it). This doesn't matter on x86, since as talchas noted, SeqCst loads don't require a barrier on x86 anyway, but on ARM it could help. Since typical usage of 'once' executes the no-op case far more often than the initialization case, improving performance of the former may be justified even at the expense of adding quite a bit of overhead to the latter.

The first way to do this is to just get the OS to execute barriers on all other cores of the system. Windows has had a syscall to do this since Vista (and thus on all ARM editions), FlushProcessWriteBuffers. Linux has had a similar syscall, membarrier, only since late 2015, but there are hacky ways to accomplish the same on older versions. (The linked mailing list discussion occurred before the syscall was added; there is some skepticism of the hack there because of the potential for it to stop working on future versions of Linux, but since all future versions will now support the syscall, the hack should be fine as a fallback for old versions.)

I don't believe iOS has such a syscall. But in some cases it's possible to do something different, because iOS only runs on Apple CPUs and Apple has a trick:

https://www.mikeash.com/pyblog/friday-qa-2014-06-06-secrets-of-dispatch_once.html

(The linked blog post refers to OS X and Intel CPUs, but as I said, the whole thing is redundant on x86; iOS, however, uses the same code.)

I'm not suggesting that libstd implement this trick itself, at least on the write side: the library where Apple implements it is always dynamically linked and so Apple retains the option of having future CPUs require something else, and updating the OS to suit. But the read side of dispatch_once is present as an inlineable (indeed, always_inline) function in a system C header, which more or less means it must keep working forever, and (combined with the reasonable storage requirement, typedef long dispatch_once_t;) also means that using it would be as efficient as the current reimplementation - so Once on OS X could just be a wrapper around dispatch_once. Rust code can't properly inline it (without some fancy LTO stuff), but we could just reimplement the body in Rust to simulate the effect of inlining. It looks like this:

DISPATCH_INLINE DISPATCH_ALWAYS_INLINE DISPATCH_NONNULL_ALL DISPATCH_NOTHROW
void
_dispatch_once(dispatch_once_t *predicate,
        DISPATCH_NOESCAPE dispatch_block_t block)
{
    if (DISPATCH_EXPECT(*predicate, ~0l) != ~0l) {
        dispatch_once(predicate, block);
    } else {
        dispatch_compiler_barrier();
    }
    DISPATCH_COMPILER_CAN_ASSUME(*predicate == ~0l);
}

...Except for one huge caveat that would make actually implementing this extremely difficult. The documentation for dispatch_once_t says:

Variables of this type must have global or static scope. The result of using this type with automatic or dynamic allocation is undefined.

And of course Once has no such requirement. I suppose it was added out of the fear that another core could load a stale cached value for the status word from before the dispatch_once_t was initialized, which in many cases is impossible for some other reason, but not all. Therefore, this whole technique on iOS would only be useful if the compiler could statically detect whether a given Once object is stored as a global, and use a memory barrier if not, and I can't think of any existing mechanism that would allow such a thing. Oh well.

@briansmith
Copy link
Contributor

And of course Once has no such requirement.

I'm not sure what you mean. This is the signature for call_once:

fn call_once<F>(&'static self, f: F) 

@comex
Copy link
Contributor

comex commented Aug 25, 2016

@briansmith Hrm... my mistake, I didn't notice that signature. That's still technically a weaker guarantee than what the documentation requires, since functions like leak<T: 'static>(Box<T>) -> &'static T are considered to be safe, but using dispatch_once_t would probably be safe anyway.

@briansmith
Copy link
Contributor

That's still technically a weaker guarantee than what the documentation requires, since functions like leak<T: 'static>(Box) -> &'static T are considered to be safe, but using dispatch_once_t would probably be safe anyway.

If I understand you correctly, you are saying that there are some kinds of 'static references to thinkgs that don't have actually have "global or static" lifetimes as dispatch_once_t requires, right? If so, perhaps we should define a new lifetime that has the "global or static" meaning, or perhaps we should replace the current use of 'static for those lifetimes with something else, like '_?

@steveklabnik steveklabnik added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-libs labels Mar 24, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 22, 2017
@Mark-Simulacrum
Copy link
Member

We've since landed #52349 which moves the fast path to an acquire load (which according to the discussion on that PR is indeed likely the best possible); the cold path is intentionally left as SeqCst to avoid something going wrong unintentionally. As such, I'm going to close this issue as done (and further optimization being not worth it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants