Proposal: "load linked" and "store conditional" atomic instructions

Hi,

I've been looking at improving atomicrmw & cmpxchg code more,
particularly on architectures using the load-linked/store-conditional
model.

The summary is that current expansion for cmpxchg seems to happen too
late for LLVM to make meaningful use of the opportunities it provides.
I'd like to move it earlier and express it in terms of a first-class
pair of "load linked" and "store conditional" instructions.

More details are attached and inlined below. Currently the only code I
have is for one or two AtomicExpand peepholes, so any suggestions or
alternatives would be very welcome. It's possible this is entirely the
wrong approach.

What do people think?

Cheers.

Tim.

The Problem

ll-sc.rst (6.04 KB)

Hi Tim,

1. Can invalid transformations be done on them? Separating an LL from its
   corresponding SC by too much almost guarantees failure (an OS task switch
   happens, clearing the monitor).
From my admittedly weak understanding, LICM shouldn't hoist loads

unless they're unordered and this would also apply here to the new
load linked instruction. In what other cases could the LL/SC become
separated enough for OS context switches to become a problem?

Amara

Hi Amara,

I've had a chat with Chandler on IRC since sending this, and it looks
like it's at the very least a less pressing issue.

The summary was that we may want LL/SC instructions in future
(possibly to represent situations currently handed by cmpxchg loops),
but they wouldn't actually solve my problem anyway becaue cmpxchg
would still be the canonical representation when on its own.

Fortunately, he also suggested a very neat solution which deals with
my issues: just run EarlyCSE to tidy up my mess.

I have some reservations about this proposal. I don't have anything particularly concrete, but the idea of supporting both LL/SC and atomicrwm in the IR concerns me from a complexity perspective. If we find that there is significant profit in supporting both, we can, but I would like to see documentation of significant benefit before changing the IR.

Tim, for those of us not directly involved, could you share a selection of bugs or other background? I'd like to read through and try to get a better sense for the problem you're trying to solve.

Philip

Hi Philip,

I have some reservations about this proposal. I don't have anything
particularly concrete, but the idea of supporting both LL/SC and atomicrwm
in the IR concerns me from a complexity perspective.

Well, I'll start by saying my particular optimisation use case looks
like it's not enough to justify the addition. I've got something
basically working for my efficiency worries, with less effort than I
thought. So I'm a lot less keen on it myself than I was a few hours
ago.

But I'm still worried about how closely LLVM IR is tied to both C and
X86 in this matter. A weak cmpxchg would go a long way to resolving
this, but it's still difficult to see a path from an IR-level "cmpxchg
weak" to optimal "atomicrmw lambda" support in LL/SC backends.

Given C like

    void atomic_foo(int *addr) {
      int oldval = *addr;
      do {
        newval = foo(oldval);
      } while (__c11_compare_exchange_weak(addr, &oldval, newval));

The cmpxchg representation would be something like:

    define void @atomic_foo(int *addr) {
    entry:
        %firstval = load i32* %addr
        br label %loop
    loop:
        %oldval = phi i32 [%firstval, %entry], [%wrongval, %loop]
        %newval = call i32 @foo(i32 %oldval)
        %res = cmpxchg weak i32* %addr, i32 %oldval, i32 %newval
        %wrongval = extractvalue { i32, i1 } %res, 0
        %success = extractvalue { i32, i1 } %res, 1
        br i1 %success, label %end, label %loop
    end:
        ret void
    }

But the optimal LL/SC form would be more like:

    define void @atomic_foo(int *addr) {
    entry:
        br label %loop
    loop:
        %oldval = load linked i32* %addr
        %newval = call i32 @foo(i32 %oldval)
        %success = store conditional i32 %newval, i32* %addr
        br i1 %success, label %end, label %loop
    end:
        ret void
    }

That kind of analysis is a very big burden to put on any pass. On the
other hand, mapping the other way doesn't seem much simpler either.

I feel like there ought to be a good way to combine this with
Haswell's xbegin/xend functionality in an even more generic IR
construct too, but I can't quite come up with a representation. More
thought needed.

Tim, for those of us not directly involved, could you share a selection of
bugs or other background? I'd like to read through and try to get a better
sense for the problem you're trying to solve.

My immediate concern was the C++11 compare_exchange which inserts an
automatic and mostly redundant icmp after the result. Variants of the
IR in my original message are representative, though possibly not
exhaustive, of what might be seen.

Of course, it's all much more speculative now. Except, possibly, "how
should we handle compare_exchange_weak".

Cheers.

Tim.

Hi Philip,

I have some reservations about this proposal. I don't have anything
particularly concrete, but the idea of supporting both LL/SC and atomicrwm
in the IR concerns me from a complexity perspective.

Well, I'll start by saying my particular optimisation use case looks
like it's not enough to justify the addition. I've got something
basically working for my efficiency worries, with less effort than I
thought. So I'm a lot less keen on it myself than I was a few hours
ago.

Good to know.

But I'm still worried about how closely LLVM IR is tied to both C and
X86 in this matter. A weak cmpxchg would go a long way to resolving
this, but it's still difficult to see a path from an IR-level "cmpxchg
weak" to optimal "atomicrmw lambda" support in LL/SC backends.

I share your concerns actually. It doesn't effect my current usage so it's not a high priority for me, but from a idealist perspective, it is worrying. On the other hand, overly generic IR is an evil in and of itself. So it's a delicate balance.

Given C like

     void atomic_foo(int *addr) {
       int oldval = *addr;
       do {
         newval = foo(oldval);
       } while (__c11_compare_exchange_weak(addr, &oldval, newval));

The cmpxchg representation would be something like:

     define void @atomic_foo(int *addr) {
     entry:
         %firstval = load i32* %addr
         br label %loop
     loop:
         %oldval = phi i32 [%firstval, %entry], [%wrongval, %loop]
         %newval = call i32 @foo(i32 %oldval)
         %res = cmpxchg weak i32* %addr, i32 %oldval, i32 %newval
         %wrongval = extractvalue { i32, i1 } %res, 0
         %success = extractvalue { i32, i1 } %res, 1
         br i1 %success, label %end, label %loop
     end:
         ret void
     }

But the optimal LL/SC form would be more like:

     define void @atomic_foo(int *addr) {
     entry:
         br label %loop
     loop:
         %oldval = load linked i32* %addr
         %newval = call i32 @foo(i32 %oldval)
         %success = store conditional i32 %newval, i32* %addr
         br i1 %success, label %end, label %loop
     end:
         ret void
     }

That kind of analysis is a very big burden to put on any pass. On the
other hand, mapping the other way doesn't seem much simpler either.

I feel like there ought to be a good way to combine this with
Haswell's xbegin/xend functionality in an even more generic IR
construct too, but I can't quite come up with a representation. More
thought needed.

I agree with both points, but particularly the more thought needed one. :slight_smile:

While it's tempting to introduce scoped atomic constructs which map nicely to all three, this looses much of the power of the xbegin/xend scheme. Being able to spread transactions across function boundaries is essential.

I suspect we'll have to end up modelling the transaction boundaries as some form of memory fence. This doesn't get all of their semantics, but it does prevent a number of illegal transforms.
xbegin -> loadstore, storestore fence after (i.e. stores can't float out of the atomic region!)
xend -> storestore, storeload fence before (nor this way)

You probably do want to allow load reordering into a transaction past an xend. Doing so past a xbegin is legal (I think?), but likely not profitable. It can turn a potentially succeeding transaction into an always failing one. (Or an always succeeding one.) There's a lot of cases to be explored here both w.r.t. legality and profitability.

It would also be good to get input from folks who've built previous compilers with T.M. constructs. I just don't know enough about prior art to propose a good design.

What I'm missing from this discussion is *when* LL/SC would be
emitted: is it Clang that lowers code differently when it knows about
the target; is it the user that calls these directly; or is it a pass
that does this IR transform early enough so more optimizations can be
applied? The latter seems like the best solution to me, but it still
requires adding this new construct to the IR and documenting it, and I
think doing so may uncover more issues that we'll need to discuss.

I'm somewhat wary of deviating further from LLVM's weird superset of
C++11's memory model and operations (yes yes, I know there's history
there) before they gets "fixed" in the language.

This particular example is a bit difficult, because the best representation depends on the complexity of foo(). If foo() is simple (and guaranteed not to contain any atomics) then you'd want to do the load-linked at the start and store-conditional at the end, something like this:

     int oldval = load_linked(addr);
     do {
       newval = foo(oldval);
     } while (!store_conditional(addr, newval));

If, however, foo() is not visible, is complex, or has some other memory accesses[1] then the correct lowering would be a load at the start and then a load-linked/store conditional (but without a loop) at the end.

     int oldval = *addr;
     int oldval2;
     do {
       newval = foo(oldval);
       oldval2 = load_linked(addr);
     } while ((oldval != oldval2) || !store_conditional(addr, newval));

The lowering that we currently generate, however, is more like this:

     int oldval = *addr;
     int success = 1;
     do {
       newval = foo(oldval);
       int oldval2;
       do {
         oldval2 = load_linked(addr);
         if (oldval2 != oldval) {
           success = 0;
           break;
         }
       } while ((oldval != oldval2) || !store_conditional(addr, newval));
     } while (!success);

This is clearly suboptimal. My problem with your current proposal is that we really need to preserve the weak cmpxchg semantics through optimisation to ensure correctness. The front end does not know (except in trivial cases) which of the two earlier forms is correct. It could always emit the second form for any weak cmpxchg, but that then makes optimisation harder because optimisers have to infer that it means a weak cmpxchg[2].

When we get to CodeGenPrepare, it's fairly easy to transform the original structure into the first correct output, if there are no memory accesses in the middle. Transforming the second version into the first is harder unless the structure is preserved.

In short, I believe that adding ll/sc to the IR will:

- Not make generating IR easier than adding weak cmpxchg
- Not make optimising IR easier than adding weak cmpxchg
- Not make code generation easier than adding weak cmpxchg
- Be more invasive than adding weak cmpxchg

David

[1] RISC-V, for example, requires that there be no memory accesses *at all* between ll and sc, some ARM implementations require that there be no memory accesses within the same cache line (see a bug report from a year or so ago on Apple hardware).

[2] There's a lot of work being done by Peter Sewell's group and others currently on what are semantically valid transforms for the C11 memory model, so as long as we preserve those semantics in the IR we have lots of opportunities to improve optimisations.