In working on an LLVM backend for SBCL (a lisp compiler), there are certain sequences of code that must be atomic with regards to async signals. So, for example, on x86, a single SUB on a memory location should be used, not a load/sub/store sequence. LLVM's IR doesn't currently have any way to express this kind of constraint (...and really, that's essentially impossible since different architectures have different possibilities, so I'm not asking for this...).
All I really would like is to be able to specify the exact instruction sequence to emit there. I'd hoped that inline asm would be the way to do so, but LLVM doesn't appear to support asm output when using the JIT compiler. Is there any hope for inline asm being supported with the JIT anytime soon? Or is there an alternative suggested way of doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment, but that's both more expensive than I need as it has an unnecessary LOCK prefix, and is also theoretically incorrect. While it generates correct code currently on x86-64, LLVM doesn't actually *guarantee* that it generates a single instruction, that's just "luck".
Additionally, I think there will be some situations where a particular ordering of memory operations is required. LLVM makes no guarantees about the order of stores, unless there's some way that you could tell the difference in a linear program. Unfortunately, I don't have a linear program, I have a program which can run signal handlers between arbitrary instructions. So, I think I'll need something like an llvm.memory.barrier of type "ss", except only affecting the codegen, not actually inserting a processor memory barrier.
Is there already some way to insert a codegen-barrier with no additional runtime cost (beyond the opportunity-cost of not being able to reorder/delete stores across the barrier)? If not, can such a thing be added? On x86, this is a non-issue, since the processor already implicitly has inter-processor store-store barriers, so using:
call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0)
is fine: it's a noop at runtime but ensures the correct sequence of stores...but I'm thinking ahead here to other architectures where that would actually require expensive instructions to be emitted.
Thanks,
James