LLVM doesn't optimize redundant atomics

Given the following examples, I would expect LLVM to be able to remove the redundant atomics since one possible valid code ordering` is that they run with no one else observing the value change

define ptr @redundant_store(ptr align 8 dereferenceable(8) %"x") {
top:
  store atomic i64 2, ptr %"x" seq_cst, align 8
  store atomic i64 3, ptr %"x" seq_cst, align 8
  ret ptr %"x"
}

define i64 @redundant_load(ptr align 8 dereferenceable(8) %"x") {
top:
  %1 = load atomic i64, ptr %"x" seq_cst, align 8
  %2 = load atomic i64, ptr %"x" seq_cst, align 8
  %3 = sub i64 %1, %2
  ret i64 %3
}

However, LLVM does not remove the redundant loads.

define noundef nonnull ptr @redundant_store(ptr returned writeonly align 8 captures(ret: address, provenance) dereferenceable(8) %x) local_unnamed_addr #0 {
  store atomic i64 2, ptr %x seq_cst, align 8
  store atomic i64 3, ptr %x seq_cst, align 8
  ret ptr %x
}

define i64 @redundant_load(ptr readonly align 8 captures(none) dereferenceable(8) %x) local_unnamed_addr #0 {
  %0 = load atomic i64, ptr %x seq_cst, align 8
  %1 = load atomic i64, ptr %x seq_cst, align 8
  %2 = sub i64 %0, %1
  ret i64 %2
}

Is there a reason why it can’t?

2 Likes

LLVM is really conservative on eliminating atomic loads/stores. You can think of removing any of them allows code to moves across the program point where an atomic used to be, which really shouldn’t happen.

The whole reason this optimization should be valid is that it is valid to move non-atomic code across an atomic so long as that code doesn’t depend on an atomic.

Yes, removing the redundant load and store are valid transforms per the C++ memory model, but LLVM is very sensitive here. We don’t remove them with any memory ordering.

There are cases where you cannot remove a seq_cst operation entirely, but it’s always fine to remove redundant ones if you would otherwise be able to move them next to each other.

I don’t recall why we don’t though.

Also if you’re not already aware: N4455 No Sane Compiler Would Optimize Atomics
Maybe @jfb knows more.

1 Like

It’s just conservative and nobody has bothered to write the optimization since I wrote the paper, besides Robin who had some but they got reverted.

2 Likes

I see that Poor codegen with atomics · Issue #37064 · llvm/llvm-project · GitHub is a pretty good writeup of some of the missing optimizations (although it focuses more on the relaxed case which is easier)

1 Like

To give a little context: This question was spawned by a question on julia discourse, and was originally about store-load forwarding.

The pattern was: We want an init_once that can LICM, I.e. for

tmp = get_lazy(lazy_var);
//something that llvm understands
tmp = get_lazy(lazy_var);

we want the second get_lazy removed.

There must be an initial load_acquire in get_lazy to check whether the thing has been initialized; and if it hasn’t, there must be a store_release to mark that it has been initialized.

We can make it so that the load_acquire and the conditional store_release are inlined (the slow-path initialization code doesn’t need to be inlined / visible to llvm).

The second get_lazy load-acquire and conditional check should be completely optimized away: It knows that it has been initialized, either because the first check already confirmed initialization and there was no intervening store, or because the first call did the initialization and the value can be forwarded from the store-release.

Mechanically, the second load_acquire should be moved upwards, duplicated at the branch, and then be removed at both points (once as a redundant load, and once due to store-load forwarding).

I think this optimization would be invalid if the load was seq_cst – we would not be able to move it across the intervening code. On the other hand, monotonic / relaxed is not strong enough for this construction to be safe.

So the original question was pretty exactly about duplicate load_acquire removal and about store → load_acquire forwarding.

1 Like