std::sync::Once has way more barriers than necessary #27610

talchas · 2015-08-08T22:27:02Z

I believe that (comments removed)

    pub fn call_once<F>(&'static self, f: F) where F: FnOnce() {
        if self.cnt.load(Ordering::Acquire) < 0 {
            return
        }

        let prev = self.cnt.fetch_add(1, Ordering::Acquire);
        if prev < 0 {
            self.cnt.store(isize::MIN, Ordering::Release);
            return
        }

        let guard = self.mutex.lock();
        if self.cnt.load(Ordering::Relaxed) > 0 {
            f();
            let prev = self.cnt.swap(isize::MIN, Ordering::Release);
            self.lock_cnt.store(prev, Ordering::Release);
        }
        drop(guard);

        if self.lock_cnt.fetch_add(-1, Ordering::AcqRel) == 1 {
            unsafe { self.mutex.destroy() }
        }
    }

is the correct set of barriers (I'm not certain if the lock_cnt barriers can be Relaxed instead, but it's pretty irrelevant and these are safe).

At the very least I am /certain/ that the very first load can be Acquire instead of SeqCst, which removes the barrier from the fast-path on x86.

Gankra · 2015-08-08T22:46:18Z

cc @aturon

talchas · 2015-10-01T19:49:49Z

(mentioning this later, but load(SeqCst) doesn't actually emit a barrier on x86 since C11 is weird, so while things can be relaxed, it's far less of a problem than I thought)

comex · 2016-08-17T22:43:26Z

In some situations it's possible to implement Once with no memory barrier at all in the common case (except for a compiler barrier to preventing the compiler from reordering instructions across it). This doesn't matter on x86, since as talchas noted, SeqCst loads don't require a barrier on x86 anyway, but on ARM it could help. Since typical usage of 'once' executes the no-op case far more often than the initialization case, improving performance of the former may be justified even at the expense of adding quite a bit of overhead to the latter.

The first way to do this is to just get the OS to execute barriers on all other cores of the system. Windows has had a syscall to do this since Vista (and thus on all ARM editions), FlushProcessWriteBuffers. Linux has had a similar syscall, membarrier, only since late 2015, but there are hacky ways to accomplish the same on older versions. (The linked mailing list discussion occurred before the syscall was added; there is some skepticism of the hack there because of the potential for it to stop working on future versions of Linux, but since all future versions will now support the syscall, the hack should be fine as a fallback for old versions.)

I don't believe iOS has such a syscall. But in some cases it's possible to do something different, because iOS only runs on Apple CPUs and Apple has a trick:

https://www.mikeash.com/pyblog/friday-qa-2014-06-06-secrets-of-dispatch_once.html

(The linked blog post refers to OS X and Intel CPUs, but as I said, the whole thing is redundant on x86; iOS, however, uses the same code.)

I'm not suggesting that libstd implement this trick itself, at least on the write side: the library where Apple implements it is always dynamically linked and so Apple retains the option of having future CPUs require something else, and updating the OS to suit. But the read side of dispatch_once is present as an inlineable (indeed, always_inline) function in a system C header, which more or less means it must keep working forever, and (combined with the reasonable storage requirement, typedef long dispatch_once_t;) also means that using it would be as efficient as the current reimplementation - so Once on OS X could just be a wrapper around dispatch_once. Rust code can't properly inline it (without some fancy LTO stuff), but we could just reimplement the body in Rust to simulate the effect of inlining. It looks like this:

DISPATCH_INLINE DISPATCH_ALWAYS_INLINE DISPATCH_NONNULL_ALL DISPATCH_NOTHROW
void
_dispatch_once(dispatch_once_t *predicate,
        DISPATCH_NOESCAPE dispatch_block_t block)
{
    if (DISPATCH_EXPECT(*predicate, ~0l) != ~0l) {
        dispatch_once(predicate, block);
    } else {
        dispatch_compiler_barrier();
    }
    DISPATCH_COMPILER_CAN_ASSUME(*predicate == ~0l);
}

...Except for one huge caveat that would make actually implementing this extremely difficult. The documentation for dispatch_once_t says:

Variables of this type must have global or static scope. The result of using this type with automatic or dynamic allocation is undefined.

And of course Once has no such requirement. I suppose it was added out of the fear that another core could load a stale cached value for the status word from before the dispatch_once_t was initialized, which in many cases is impossible for some other reason, but not all. Therefore, this whole technique on iOS would only be useful if the compiler could statically detect whether a given Once object is stored as a global, and use a memory barrier if not, and I can't think of any existing mechanism that would allow such a thing. Oh well.

briansmith · 2016-08-24T01:44:38Z

And of course Once has no such requirement.

I'm not sure what you mean. This is the signature for call_once:

fn call_once<F>(&'static self, f: F)

comex · 2016-08-25T01:22:18Z

@briansmith Hrm... my mistake, I didn't notice that signature. That's still technically a weaker guarantee than what the documentation requires, since functions like leak<T: 'static>(Box<T>) -> &'static T are considered to be safe, but using dispatch_once_t would probably be safe anyway.

briansmith · 2016-10-18T00:05:48Z

That's still technically a weaker guarantee than what the documentation requires, since functions like leak<T: 'static>(Box) -> &'static T are considered to be safe, but using dispatch_once_t would probably be safe anyway.

If I understand you correctly, you are saying that there are some kinds of 'static references to thinkgs that don't have actually have "global or static" lifetimes as dispatch_once_t requires, right? If so, perhaps we should define a new lifetime that has the "global or static" meaning, or perhaps we should replace the current use of 'static for those lifetimes with something else, like '_?

Mark-Simulacrum · 2019-12-07T18:56:18Z

We've since landed #52349 which moves the fast path to an acquire load (which according to the discussion on that PR is indeed likely the best possible); the cold path is intentionally left as SeqCst to avoid something going wrong unintentionally. As such, I'm going to close this issue as done (and further optimization being not worth it).

Gankra added I-slow Issue: Problems and improvements with respect to performance of generated code. A-libs labels Aug 8, 2015

AlisdairO mentioned this issue Feb 14, 2016

Alter Once to remove unnecessary SeqCst usage #31650

Closed

steveklabnik added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed A-libs labels Mar 24, 2017

Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 22, 2017

gereeter mentioned this issue Jul 14, 2018

sync::Once use release-acquire access modes #52349

Merged

Mark-Simulacrum closed this as completed Dec 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std::sync::Once has way more barriers than necessary #27610

std::sync::Once has way more barriers than necessary #27610

talchas commented Aug 8, 2015 •

edited by apasel422

Loading

Gankra commented Aug 8, 2015

talchas commented Oct 1, 2015

comex commented Aug 17, 2016

briansmith commented Aug 24, 2016

comex commented Aug 25, 2016

briansmith commented Oct 18, 2016

Mark-Simulacrum commented Dec 7, 2019

std::sync::Once has way more barriers than necessary #27610

std::sync::Once has way more barriers than necessary #27610

Comments

talchas commented Aug 8, 2015 • edited by apasel422 Loading

Gankra commented Aug 8, 2015

talchas commented Oct 1, 2015

comex commented Aug 17, 2016

briansmith commented Aug 24, 2016

comex commented Aug 25, 2016

briansmith commented Oct 18, 2016

Mark-Simulacrum commented Dec 7, 2019

talchas commented Aug 8, 2015 •

edited by apasel422

Loading