`AtomicCell::load()` can return a value that was modified by a concurrent racing `store()` during the loader's critical section #859

cbeuw · 2022-07-05T23:02:12Z

It is well known that SeqLocks - the underlying algorithm of AtomicCell - rely on data races which is technically always UB, though in practice it is assumed that a volatile read not ordered with a write is fine to perform, only that the read value must be discarded if it is determined after-the-fact that a writer was concurrently running with the volatile read.

However, crossbeam's SeqLock cannot reliably detect the presence of a writer running concurrently while a reader is in its critical section, this is due to the atomic load of the timestamp in validate_read() being too weakly ordered/synchronised, and can miss a writer's modification to the timestamp.

The following test fails with a modified Miri. We need to stop Miri from eagerly throwing a data-race UB when the unordered volatile read or write happens (by simply changing the throw ub macro to println), since we need to keep it running. You can clone the modified Miri here: https://github.com/cbeuw/miri/tree/no-throw-on-data-race. Running ./miri install will compile and install Miri and cargo-miri binaries under a new rustup toolchain called miri

MIRIFLAGS="-Zmiri-seed=1" cargo +miri miri test --lib -- atomic::seq_lock::tests::test_racy_read_discarded

  use core::ptr;
  use std::thread::{self, yield_now};

  #[derive(Copy, Clone)]
  struct EvilSend<T>(T);

  unsafe impl<T> Send for EvilSend<T> {}
  unsafe impl<T> Sync for EvilSend<T> {}

  #[test]
  fn test_racy_read_discarded() {
      static LK: SeqLock = SeqLock::new();

      let mut a = 0u32;
      let b = ptr::addr_of_mut!(a);
      let shared = EvilSend(b);

      let writer = thread::spawn(move || {
          // Same as AtomicCell::atomic_store()
          let _guard = LK.write();
          unsafe { ptr::write(shared.0, 1) };
      });

      {
          // Same as AtomicCell::atomic_load()
          if let Some(stamp) = LK.optimistic_read() {
              if stamp != 0 {
                  // Writer has already run
                  return;
              }

              // We entered the critical section before the writer did, so
              // *shared.0 == 0 (although we shouldn't be asserting it here since
              // we could be technically using a value obtained from a data-racing read)

              for _ in 0..100 {
                  // Give the writer some chance to ram in...
                  yield_now();
              }

              // If the writer did come into our critical section, this val will be 1
              // but then validate_read(stamp) should fail because it'll see the new stamp
              // written by the writer
              let val = unsafe { ptr::read_volatile(&*shared.0) };

              if LK.validate_read(stamp) {
                  if val == 1 {
                      panic!(
                      "The value was modified by a racing writer, and validate_read thought it was ok!"
                  );
                  } else {
                      // There was no concurrent writer while we are in the critical section
                      // the lock worked
                  }
              } else {
                  // validate_read() detected a concurrent writer, the lock worked
              }
          }
      }
      writer.join().unwrap();
  }

This is what happened




                  shared = 0 //Initial value of shared variable
          ┌───────state = 0 //Initial timestamp
          │
          │       Reader                                            │  Writer
          │                                                         │
          │       state.load(Acquire) in optimistic_read()          │
          │                                                         │  // Now a writer starts running in our critial section
          │                                                         │
          │                                                         │  state.swap(1, Acquire) in write()
 read-from│                                                         │  fence(Release)         in write()
          │                                                         │
          │                                                         │  *shared = 1
          │                                                         │
          │       val = *shared // Read 1 which must be discarded   │
          │                                                         │
          │                                                         │
          │       fence(Acquire) // in validate_read(), does nothing! 
          └──────►state.load(Relaxed)                               │
                                                                    │
                  // So stamp is still 0, the writer is completely  │
                  // missed!                                        │

As we can see, the crux of the issue is that while the volatile read on the shared variable could see the writer's modification, the atomic load on the timestamp may not. To ensure the correctness of validate_read(), if the value of the shared variable read comes from a concurrent writer (hence a "real" data race happened), then state.load() must read-from the modification to state of a (not necessarily the same, if there were multiple writers that ran in the critical section) writer.

Proposed solutions

I could think of three ways to fix it, in two categories

Ensure `validate_read()` always observes the latest timestamp

If the state.load() in validate_read() can somehow always see the globally latest timestamp (a total modification order exists for each memory location that has been atomically accessed), then it'll always catch a racing writer. There are two ways to do it

SeqCst

A SeqCst load cannot see anything earlier than the latest SeqCst store to the same location. If all stores and RMWs to state in writers are SeqCst, then a state.load(SeqCst) in validate_read() will see the latest writer's modification. ~~The fence can be removed.~~ optimistic_read() can still be relaxed though.

I do not like this solution, since we only care about reading the latest value on one location, but there is a Single Total Order of all SeqCst accesses on all locations - it's likely very damaging in performance, and also expresses the wrong intention in our code.

Edit: I was wrong. The fence between the volatile read and SeqCst timestamp read cannot be removed since the volatile read can actually be reordered (by the compiler or CPU) after the timestamp read: https://pdfs.semanticscholar.org/ac35/455b128baf4e280f2571160c242b67b3f85e.pdf Slide 19

RMW

The R in RMW is special - it always reads the latest value, regardless of the memory order argument. No other atomic access has this property (I believe there should be a load_latest() though that's off-topic). Though we also have to buy-in to the Modify and Write. Thankfully state is an integral type and fetch_add(0) is idempotent. So we can replace state.load(Relaxed) with state.fetch_add(0, Acquire). The fence can be removed as this acquires the releasing swap in write().

Create a synchronisation edge between the start of `validate_read()` and the end of `write()`

I believe this was the original intention of the code, but was thwarted by volitile reads not helping in any synchronisation effects.

The fence(Release) in write() and fence(Acquire) in validate_read() can create a synchronisation edge, if an atomic write to a location is done after write(), and an atomic read sees that atomic write's value before validate_read() (fence-fence synchronisation). The memory orders of the accesses do not matter for the purpose of release fence -> acquire fence synchronisation.

We can in fact replace the ptr::write() and ptr::volatile_read() on the shared variable with byte-wise atomic writes and reads. Note that if there were multiple writers executing during a reader's critical section, then each byte-wise atomic read may read from a different writer. The resulting value is garbage and must be discarded - but validate_read() can detect this as the acquring fence would create synchronisation edges with all writers from whom the byte-wise atomic read has read value. The timestamp read will note that certain writer executed and fail the check - it doesn't matter which specific writer that modified the timestamp.

This will also solve the issue of unordered ptr::write() and ptr::volatile_read() being technically UB. SeqLock is in fact the motivating example of the proposal P1478 for adding byte-wise atomic memcpy to C++.

We can actually go one step further and ditch the fences, as the proposal said

The explicit fence in the current seqlock idiom is confusing; this will usually eliminate it.

Indeed, the byte-wise atomic write and read on the shared variable can be used for synchronisation all by themselves. Again, if there were multiple writers executing during a reader's critical section, then each byte-wise atomic read may read from a different writer. But the timestamp load in validate_read() happens-after the timestamp modification in at least one of the writers, so it'll fail the check.

cc @RalfJung who first pointed out this might be an issue

The text was updated successfully, but these errors were encountered:

taiki-e · 2022-07-08T04:34:55Z

Thanks for investigating and writing this!

In the middle term, I plan to use inline-asm-based byte-wise atomic memcpy (basically taiki-e/atomic-memcpy#7).
In the long term, I expect that equivalent of it (but no inline-asm) will be available in the standard library.

I'm not sure what is better to do in the short term:

std atomic integer based byte-wise atomic memcpy (basically https://github.com/taiki-e/atomic-memcpy)
This is UB on the type with padding, but AtomicCell already depends on that UB, so it may still be valuable in terms of reducing the kinds of UBs it depends on.
SeqCst
I agree with your opinion that you do not like this solution.
RMW
LLVM can optimize such a RMW to atomic-load (godbolt: x86_64, arm, aarch64, powerpc). This means that the cost of such an operation would not be very high, but it may not have the effect we are hoping for.
Ignore this in the short term as part of the known problem of being dependent on implementation details.

RalfJung · 2022-07-08T20:57:51Z

LLVM can optimize such a RMW to atomic-load

Does an ARM sequence like that guarantee that it reads the latest value, i.e., the value that is "current at the time of the barrier? If not that would sound like a wrong compilation to me...

RalfJung · 2022-07-08T20:58:44Z

In the middle term, I plan to use inline-asm-based byte-wise atomic memcpy (basically taiki-e/atomic-memcpy#7).
In the long term, I expect that equivalent of it (but no inline-asm) will be available in the standard library.

Since C++ is also getting bytewise atomic memcpy, adding it to the language and stdlib should hopefully not be too bad. So maybe we won't need an inline assembly hack that makes Miri unhappy? :D

cbeuw · 2022-07-08T21:09:52Z

LLVM can optimize such a RMW to atomic-load (

I agree with @RalfJung that it sounds like a miscompilation. In Can Seqlocks Get Along with Programming Language Memory Models? the author stated

In the seqlock case, there is no relevant store preceding the fetch add(0, ...) operation. Hence a MOV would be an acceptable translation. However, the compiler normally has no way to determine the absence of such an atomic store
operation, especially since it could occur much earlier, in a different compilation unit. Hence it appears impossible to avoid generating a fence as part of the fetch add(0, ...) operation.

I have an idea for a simple test case that might break on real hardware. Will file a LLVM bug report if it works

RalfJung · 2022-07-08T21:21:53Z

Hence it appears impossible to avoid generating a fence

There's a dmb ish there, which is a fence, isn't it?

cbeuw · 2022-07-08T21:24:09Z

There's a dmb ish there, which is a fence, isn't it?

But if you change the ordering to relaxed it becomes a normal ldr. The x86 case is also just a mov for Release and Acquire

cbeuw · 2022-07-08T22:24:09Z

This is definitely a miscompilation. This panics on arm64. I thought at least I had to separate out GTE and LTE to different cache lines, but apparently that's not even needed!

use std::sync::atomic::AtomicU64;
use std::sync::atomic::Ordering::Relaxed;
use std::thread;

#[no_mangle]
pub fn writer(gte: &AtomicU64, lte: &AtomicU64) {
    for i in 0.. {
        // Since there is only one writer, at any point of the program execution,
        // the latest value of gte is either equal or 1 greater than lte.
        gte.store(i, Relaxed);
        lte.store(i, Relaxed);
    }
}

#[no_mangle]
pub fn reader(gte: &AtomicU64, lte: &AtomicU64) {
    loop {
        let smaller = lte.fetch_add(0, Relaxed);
        // The writer could increment lte and gte by any number of times
        // between the two fetch_add(0). This does not matter since smaller is still <= larger
        let larger = gte.fetch_add(0, Relaxed);
        assert!(larger >= smaller, "the latest gte value {larger} is less than the latest or earlier lte value {smaller}, this is not possible at any point in the program!");
    }
}

pub fn main() {
    static GTE: AtomicU64 = AtomicU64::new(0);
    static LTE: AtomicU64 = AtomicU64::new(0);
    let t_writer = thread::spawn(|| {
        writer(&GTE, &LTE);
    });

    let t_reader = thread::spawn(|| {
        reader(&GTE, &LTE);
    });

    t_writer.join().unwrap();
    t_reader.join().unwrap();
}

andy@Andys-MacBook-Pro src % rustc --edition 2021 -C opt-level=3 main.rs && ./main 
thread '<unnamed>' panicked at 'the latest gte value 4146 is less than the latest or earlier lte value 4182, this is not possible at any point in the program!', main.rs:22:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

But if you change the two rmws to fetch_add(0, Acquire), it doesn't panic anymore - even though semantically there should be no distinction since Acquire does not interact with Relaxed. The reason is what you noted - a fence is correctly emitted that's missing in the Relaxed case. (Edit: actually it emits a ldar on arm64 which doesn't panic on Apple Silicon but does on Cortex-A72, so it's still incorrect)

The equivalent C++ code does the same: https://gist.github.com/cbeuw/e5cb76b1d01b51a56fc35ff40248ab27#file-repro-cpp

andy@Andys-MacBook-Pro src % clang++ -O2 -Wall -Wextra -std=c++20 main.cpp && ./a.out
Assertion failed: (larger >= smaller), function reader, file main.cpp, line 21.
zsh: abort      ./a.out

I'll try to trigger this on X86_64 - it doesn't currently though I think it should. Cacheline may be at play here

taiki-e · 2022-07-08T22:24:43Z

LLVM can optimize such a RMW to atomic-load

According to the llvm patch that implemented this behavior

For a given atomicrmw <op>, this is possible when:

The ordering of that operation is compatible with a load (i.e., anything that doesn't have a release semantic).

<op> does not modify the value being stored

So it seems that such acquire/relaxed RMWs are simply converted to acquire/relaxed loads (without considering other properties provided by RMW).

EDIT: I agree that this is a miscompilation.

cbeuw · 2022-07-08T23:27:09Z

Issue filed with LLVM^

RalfJung · 2022-07-09T01:39:49Z

Can't relaxed stores to different locations be reordered? I don't understand why this is a miscompilation.

cbeuw · 2022-07-09T09:47:16Z

Can't relaxed stores to different locations be reordered

That's a good point, though if relaxed loads can be freely reordered how do this still work...

T1:                                                 T2:
X.store(1, relaxed)
X.store(2, relaxed)
fence(sc)
                                                     fence(sc)
                                                     X.load(relaxed) // must load 2

The authors of the Weak Memory Emulation paper clearly also had the same understanding

The compiler here can't prevent reordering by seeing a fence, since the fence could be in a different compilation unit (that's the pattern I used for futex).

I just noticed that cppreference stated this doesn't hold anymore in C++20

For a pair of atomic operations on M called A and B, where A writes and B reads M's value, if there are two memory_order_seq_cst std::atomic_thread_fences X and Y, and if A is sequenced-before X, Y is sequenced-before B, and X appears before Y in the Single Total Order, then B observes either:

the effect of A (until C++20)

some unrelated modification of M that appears after A in M's modification order

So I guess this is one of the changes they made in C++20's P0668 scfix and called it "Strengthen SC fences", but this clearly takes away the guarantee on a pattern that was explicitly stated in the standard, and therefore a weakening and can very subtly break stuff....

It's also possible though that the above pattern never worked due to people thinking relaxed stores can be freely reordered

In any case, my test case still panics with this writer

pub fn writer(gte: &AtomicU64, lte: &AtomicU64) {
    for i in 0.. {
        gte.store(i, SeqCst);
        lte.store(i, SeqCst);
    }
}

Now that all stores to gte and lte are in a Single Total Order. Observing the latest value of lte has implications on the latest value of gte. I updated the test case in the gist

RalfJung · 2022-07-09T11:36:41Z

That's a good point, though if relaxed loads can be freely reordered how do this still work...

There's no guarantee that that fence in the 2nd thread is SC-after the one in the first thread. Thread 2 might go first. Then it can load 0 or 1 as well.

cbeuw · 2022-07-09T11:48:05Z

But then you can have this

T1                                          T2
X.store(1, relaxed)
X.store(2, relaxed)
fence(sc)
DONE.store(true, ...)           
                                              while !DONE.load(...) {}
                                              fence(sc)
                                              X.load(relaxed)

The above will create a happens before edge between store and load on DONE. With Release/Acquire fences I suppose X can read either 1 or 2 in T2 due to reordering, but SC fences guarantees reading 2, which may not be true after reordering. Again, everything could be in a different compilation unit in T1 from the fence

RalfJung · 2022-07-09T14:35:31Z

With Release/Acquire fences I suppose X can read either 1 or 2 in T2 due to reordering

No. You can only reorder stores to different locations.

cbeuw · 2022-07-09T16:11:46Z

reorder stores to different locations.

Ah yes that's where I was very mistaken...

I hope my updated test case correctly demonstrates the miscompilation. There's also the question about what cppreference meant - the same guarantee could still be provided with the new wording

RalfJung · 2022-07-09T16:24:33Z

SC writes can definitely not be reordered, so at least this particular concern no longer applies to the example in the LLVM issue.

cbeuw changed the title ~~AtomicCell::load() can return a value that was modified by a concurrent racing store() inside the loader's critical section~~ AtomicCell::load() can return a value that was modified by a concurrent racing store() during the loader's critical section Jul 5, 2022

taiki-e added bug crossbeam-utils labels Jul 8, 2022

cbeuw mentioned this issue Jul 8, 2022

Invalid optimisation: atomicrmw <op>, 0 is not always the same as load atomic llvm/llvm-project#56450

Closed

cbeuw mentioned this issue Aug 19, 2022

[RFC] AtomicPerByte (aka "atomic memcpy") rust-lang/rfcs#3301

Open

taiki-e mentioned this issue Aug 13, 2023

AtomicCell: Use atomic-maybe-uninit #1015

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AtomicCell::load()` can return a value that was modified by a concurrent racing `store()` during the loader's critical section #859

`AtomicCell::load()` can return a value that was modified by a concurrent racing `store()` during the loader's critical section #859

cbeuw commented Jul 5, 2022 •

edited

Loading

taiki-e commented Jul 8, 2022 •

edited

Loading

RalfJung commented Jul 8, 2022

RalfJung commented Jul 8, 2022

cbeuw commented Jul 8, 2022

RalfJung commented Jul 8, 2022

cbeuw commented Jul 8, 2022

cbeuw commented Jul 8, 2022 •

edited

Loading

taiki-e commented Jul 8, 2022 •

edited

Loading

cbeuw commented Jul 8, 2022

RalfJung commented Jul 9, 2022

cbeuw commented Jul 9, 2022 •

edited

Loading

RalfJung commented Jul 9, 2022

cbeuw commented Jul 9, 2022

RalfJung commented Jul 9, 2022

cbeuw commented Jul 9, 2022

RalfJung commented Jul 9, 2022

AtomicCell::load() can return a value that was modified by a concurrent racing store() during the loader's critical section #859

AtomicCell::load() can return a value that was modified by a concurrent racing store() during the loader's critical section #859

Comments

cbeuw commented Jul 5, 2022 • edited Loading

Proposed solutions

Ensure validate_read() always observes the latest timestamp

SeqCst

RMW

Create a synchronisation edge between the start of validate_read() and the end of write()

taiki-e commented Jul 8, 2022 • edited Loading

RalfJung commented Jul 8, 2022

RalfJung commented Jul 8, 2022

cbeuw commented Jul 8, 2022

RalfJung commented Jul 8, 2022

cbeuw commented Jul 8, 2022

cbeuw commented Jul 8, 2022 • edited Loading

taiki-e commented Jul 8, 2022 • edited Loading

cbeuw commented Jul 8, 2022

RalfJung commented Jul 9, 2022

cbeuw commented Jul 9, 2022 • edited Loading

RalfJung commented Jul 9, 2022

cbeuw commented Jul 9, 2022

RalfJung commented Jul 9, 2022

cbeuw commented Jul 9, 2022

RalfJung commented Jul 9, 2022

`AtomicCell::load()` can return a value that was modified by a concurrent racing `store()` during the loader's critical section #859

`AtomicCell::load()` can return a value that was modified by a concurrent racing `store()` during the loader's critical section #859

cbeuw commented Jul 5, 2022 •

edited

Loading

Ensure `validate_read()` always observes the latest timestamp

Create a synchronisation edge between the start of `validate_read()` and the end of `write()`

taiki-e commented Jul 8, 2022 •

edited

Loading

cbeuw commented Jul 8, 2022 •

edited

Loading

taiki-e commented Jul 8, 2022 •

edited

Loading

cbeuw commented Jul 9, 2022 •

edited

Loading