I have recently started on optimizing C11/C++11 atomics in LLVM, and plan to focus on that for the next two months as an intern in the PNaCl team. I’ve sent two patches on this topic to Phabricator that fix http://llvm.org/bugs/show_bug.cgi?id=17281:
The first patch is X86-specific, and tries to apply operations with immediates to atomics without going through a register. The main trouble here is that the X86 backend appears to respect LLVM memory model instead of the x86-TSO memory model, and may reorder instructions. In order to prevent illegal reordering of atomic accesses, the backend converts atomic accesses to pseudo-instructions in X86InstrCompiler.td (RELEASE_MOV* and ACQUIRE_MOV*) that are opaque to most of the rest of the backend, and only lowers those at the very end of the pipeline. I have decided to follow the same approach, just adding some more RELEASE_* pseudo-instructions rather than trying to find every possibly misbehaving part of the backend in order to do early lowering. This lowers the risk and complexity of the patch, but at the cost of possibly missing some optimization possibilities.
Another trouble I had with this patch is a failure of TableGen type inference when adding negate/not to the scheme. As a result I have left these two instructions commented out in this patch. Does anyone have an idea for how to proceed with this ?
The second patch is more straightforward: in the C11/C++11 memory model (that LLVM basically borrows), optimizations like DSE can safely fire across atomic accesses in many cases, basically as long as they are not operating across a release-acquire pair (see paper referenced in the comments). So I tweaked MemoryDependenceAnalysis to track such pairs and only return a clobber result in such cases.
My next steps will probably be to improve the compilation of acquire atomics in the ARM backend. In particular, they are currently compiled by a load + dmb, while a load + dependant useless branch + isb is also valid (see http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for example) and may be faster. Even better: if there is already a dependant branch (such as the loop for the lowering of CAS), it is just a cheap isb. The main step will be switching off the InsertFencesForAtomic flag, and do the lowering of atomics in the backend, because once an acquire load has been transformed in an acquire fence, too much information has been lost to apply this mapping.
Longer term, I hope to improve the fence elimination of the ARM backend with a kind of PRE algorithm. Both of these improvements to the ARM backend should be fairly straightforward to port to the POWER architecture later, and I hope to also do that.
Does this approach seem worthwhile to you ? Can I do anything to help the review process ?