This analysis shows how Rust atomic operations with different memory orderings (Acquire, Release, Relaxed) translate through LLVM IR to actual ARM64 assembly instructions.
Rust Source:
while self
.locked
.compare_exchange(false, true, Ordering::Acquire, Ordering::Relaxed)
.is_err()
{
std::hint::spin_loop();
}LLVM IR:
bb8.i:
%8 = cmpxchg ptr %_4.i, i8 0, i8 1 acquire monotonic, align 1
%9 = extractvalue { i8, i1 } %8, 1
br i1 %9, label %bb11.i, label %bb4.i3.i
bb4.i3.i: ; spin loop
tail call void @llvm.aarch64.isb(i32 noundef 15)
%10 = cmpxchg ptr %_4.i, i8 0, i8 1 acquire monotonic, align 1
%11 = extractvalue { i8, i1 } %10, 1
br i1 %11, label %bb11.i, label %bb4.i3.iARM64 Assembly:
LBB5_2: ; spin loop
isb ; Instruction Synchronization Barrier
LBB5_3:
mov w13, #0 ; expected value (false)
add x14, x10, #16 ; address of lock
casab w13, w12, [x14] ; Compare And Swap Acquire Byte
cmp w13, #0 ; check if we got the lock
b.ne LBB5_2 ; if not, spin againRust Source:
self.locked.store(false, Ordering::Release);LLVM IR:
store atomic i8 0, ptr %_4.i release, align 1ARM64 Assembly:
stlurb wzr, [x10, #16] ; STore reLease Unscaled Register ByteRust Source:
let current = counter.load(Ordering::Relaxed);
counter.store(current + 1, Ordering::Relaxed);LLVM IR:
%12 = load atomic i64, ptr %_18.i monotonic, align 8
%_6.i = add i64 %12, 1
store atomic i64 %_6.i, ptr %_18.i monotonic, align 8ARM64 Assembly:
ldr x13, [x11, #16] ; regular load (no barriers)
add x13, x13, #1 ; increment
str x13, [x11, #16] ; regular store (no barriers)- Full Name: Compare And Swap Acquire Byte
- Memory Ordering: Acquire semantics
- Effect:
- Atomically compares value at
[x14]withw13(0/false) - If equal, writes
w12(1/true) to[x14] - Returns original value in
w13 - Ensures all subsequent loads/stores see memory state after this operation
- Atomically compares value at
- x86-64 equivalent:
lock cmpxchgfollowed by load barrier
- Full Name: Store Release Unscaled Register Byte
- Memory Ordering: Release semantics
- Effect:
- Stores byte value to memory
- Ensures all prior loads/stores complete before this store becomes visible
- Makes modifications visible to other threads
- x86-64 equivalent: Regular
mov(x86-64 has strong ordering)
- Full Name: Instruction Synchronization Barrier
- From:
std::hint::spin_loop() - Effect:
- Flushes pipeline and ensures subsequent instructions see any context changes
- Optimizes spin loops by reducing power consumption
- Prevents speculative execution during spinning
- x86-64 equivalent:
pauseinstruction
- Full Name: Data Memory Barrier, Inner Shareable, Load
- Memory Ordering: Acquire fence
- Effect:
- Ensures all prior loads complete before subsequent operations
- Used when downgrading reference counts (Arc drops)
- x86-64 equivalent:
lfenceor implicit in loads
- Full Name: Load and Add with Release ordering
- Used for: Arc reference counting (atomicrmw sub)
- Effect:
- Atomically adds value to memory location
- Returns original value
- Release ordering ensures prior operations complete first
| Rust Ordering | LLVM Ordering | ARM64 Instruction | Memory Barrier |
|---|---|---|---|
Acquire |
acquire |
casab (CAS with acquire) |
Load barrier after |
Release |
release |
stlurb (Store with release) |
Store barrier before |
Relaxed |
monotonic |
ldr / str (regular) |
None |
| Arc fence | fence acquire |
dmb ishld |
Full load barrier |
The atomic operations ensure cache coherency across cores:
-
Lock Acquisition (
casab):- Triggers cache line acquisition in Exclusive state
- Other cores' cache lines invalidated
- Acquire semantics prevent reordering of subsequent loads
-
Lock Release (
stlurb):- Flushes cache line to shared state
- Makes modifications visible to other cores
- Release semantics prevent reordering of prior stores
-
Spin Loop (
isb):- Reduces contention by yielding CPU resources
- Prevents aggressive polling that would cause cache line ping-ponging
- Uncontended lock: ~2-3 cycles for
casab+ minimal overhead - Contended lock: Spin loop with
isbreduces power consumption - Memory barriers:
- Acquire: ~4-10 cycles depending on cache state
- Release: ~4-10 cycles depending on cache state
- Full barrier (dmb): ~10-20 cycles
Running the program with 2 threads × 10,000 iterations:
Final counter value: 20000
Expected value: 20000
The spinlock successfully synchronizes access to the shared counter, demonstrating correct implementation of acquire-release semantics.