Lowering Atomic Load to Acquire and Load

I'm working with an experimental backend for an MCU with heavy multithreading capabilities but lacks proper acquire/release semantics. This is okay, as the programmer can customize __cxa_guard_acquire and __cxa_guard_release to lower/raise appropriate semaphores. The issue I'm having is that I can't seem to figure out when to lower atomic load into an acquire/load pair early enough that the __cxa_guard_acquire is evaluated for optimization (most importantly inlining.) First, is this even the proper way to do this and further am I going about this the wrong way and is there a "best time" to do a pass to catch these guys?

Thanks!

-Sam

The code clang generates for a guarded initialization looks like this normally:

entry:
  %0 = load atomic i8* bitcast (i64* @_ZGVZ3barvE1x to i8*) acquire, align 8
  %guard.uninitialized = icmp eq i8 %0, 0
  br i1 %guard.uninitialized, label %init.check, label %init.end

init.check: ; preds = %entry
  %1 = tail call i32 @__cxa_guard_acquire(i64* @_ZGVZ3barvE1x) #1
  %tobool = icmp eq i32 %1, 0
  br i1 %tobool, label %init.end, label %init

init: ; preds = %init.check
  %call = tail call i32 @_Z3foov() #1
  store i32 %call, i32* @_ZZ3barvE1x, align 4, !tbaa !0
  tail call void @__cxa_guard_release(i64* @_ZGVZ3barvE1x) #1
  br label %init.end

Given this, there is no reason to inline the call to
__cxa_guard_acquire; it would bloat code-size for no performance
benefit.

What does the IR you are working with look like?

-Eli