Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities

Hi,

This might sound a bit controversial at this stage of maturity for LLVM. Can the community consider deprecating architecture specific features which has come into the IR and replacing them with more generic IR codes.

Also some form of powerful macro facility which supports platform specific macros which can be used to expand generic IRs to a set of IRs which might have equivalent results and / or the platform specific implementation of it.

The aim would be that any code in LLVM IR can execute in any architecture and will get the best possible optimisation and full set of functionality being available in each architecture.

Suminda

This might sound a bit controversial at this stage of maturity for LLVM. Can
the community consider deprecating architecture specific features which has
come into the IR and replacing them with more generic IR codes.

On a case-by-case basis, certainly. I'm very much in favour of general
mechanisms so that work doesn't have to be duplicated on all targets.

It doesn't make sense for everything though, particularly if you want
target-specific IR to simply not exist. What would you map ARM's
"ldrex" to on x86? Or an inline assembly block?

Also some form of powerful macro facility which supports platform specific
macros which can be used to expand generic IRs to a set of IRs which might
have equivalent results and / or the platform specific implementation of it.

LLVM IR isn't really for humans to write, so macros would likely not
be a popular idea. Reusable APIs to emit the correct target-specific
IR for features would probably be the LLVM equivalent.

We keep meaning to write one to handle the details of emitting
functions compliant with C ABIs, for example.

The aim would be that any code in LLVM IR can execute in any architecture
and will get the best possible optimisation and full set of functionality
being available in each architecture.

Have you looked at what the PNaCl people are doing? As I recall they
have something close to a generic subset of LLVM IR which they then
compile to the various native object formats (they don't care about
matching any particular ABI except their own).

There's inevitably some kind of speed penalty, but I really like the idea.

Cheers.

Tim.

Some target specific IRs do not make sense in other platforms. Perhaps they
should be removed from LLVM but only generated through compilation process.
Speed penalties should be handled by intelligent compilation. It would be
helpful if all the IRs were abstract than you have platform specific IRs.
The final compilation step has more flexibility this way and also more
optimisation can be done optimise. (There could be more efficient
alternatives than explicitly using thus forcing the specific platform
specific ASM.) How I see platform specific IRs is like "register" or
"inline" in C++. With clever compilation techniques they are not needed and
the compiler can do a better job. Perhaps this should be something to be
explored in the long run.

This isn't a great example. Having load-linked / store-conditional in the IR would make a number of transforms related to atomics easier. We currently can't correctly model the weak compare-and-exchange from the C[++]11 memory model and we generate terrible code for a number of common atomic idioms on non-x86 platforms as a result.

David

This isn't a great example. Having load-linked / store-conditional in the
IR would make a number of transforms related to atomics easier. We
currently can't correctly model the weak compare-and-exchange from
the C[++]11 memory model and we generate terrible code for a number
of common atomic idioms on non-x86 platforms as a result.

Actually, I really agree there. I considered it recently, but decided
to leave it as an intrinsic for now (the new IR expansion pass happens
after most optimisations so there wouldn't be much benefit, but if we
did it earlier and the mid-end understood what an ldrex/strex meant, I
could see code getting much better).

Load linked would be fairly easy (perhaps even written as "load
linked", a minor extension to "load atomic"). Store conditional would
be a bigger change since stores don't return anything at the moment;
passes may not be expecting to have to ReplaceAllUses on them.

I'm hoping to have some more time to spend on atomics soon, after this
merge business is done. Perhaps then.

I don't suppose you have any plans to port Mips to the IR-level LL/SC
expansion? Now that the infrastructure is present it's quite a
simplification (r206490 in ARM64 for example, though you need existing
target-specific intrinsics at the moment). It would be good to iron
out any ARM-specific assumptions I've made.

But it would still be a construct that probably just couldn't be used
on x86 efficiently, not really a step towards a target independent IR.

Cheers.

Tim.

When I meant for macroing facility is something along these lines. Must
think more on it though.

http://luajit.org/dynasm_features.html

Actually, I really agree there. I considered it recently, but decided
to leave it as an intrinsic for now (the new IR expansion pass happens
after most optimisations so there wouldn't be much benefit, but if we
did it earlier and the mid-end understood what an ldrex/strex meant, I
could see code getting much better).

Load linked would be fairly easy (perhaps even written as "load
linked", a minor extension to "load atomic"). Store conditional would
be a bigger change since stores don't return anything at the moment;
passes may not be expecting to have to ReplaceAllUses on them.

The easiest solution would be to extend the cmpxchg instruction with a weak variant. It is then trivial to map load, modify, weak-cmpxchg to load-linked, modify, store-conditional (that is what weak cmpxchg was intended for in the C[++]11 memory model).

I'm hoping to have some more time to spend on atomics soon, after this
merge business is done. Perhaps then.

I don't suppose you have any plans to port Mips to the IR-level LL/SC
expansion? Now that the infrastructure is present it's quite a
simplification (r206490 in ARM64 for example, though you need existing
target-specific intrinsics at the moment). It would be good to iron
out any ARM-specific assumptions I've made.

I'd rather avoid it, because it doing it that late precludes a lot of optimisations that we're interested in. I'd much rather extend the IR to support them at a generic level.

We have a couple of plans for variations of atomic operations in our architecture, so we'll likely end up trying and throwing away a few approaches over the next couple of years.

But it would still be a construct that probably just couldn't be used
on x86 efficiently, not really a step towards a target independent IR.

On x86, we could map weak cmpxchg to the same thing as a strong cmpxchg, so it would still generate the same code. The same is true for all architectures with a non-blocking compare and exchange operation.

David

The easiest solution would be to extend the cmpxchg instruction with a
weak variant. It is then trivial to map load, modify, weak-cmpxchg to
load-linked, modify, store-conditional (that is what weak cmpxchg was
intended for in the C[++]11 memory model).

That would certainly be the easiest. But you'd get less scope for
optimising control flow around the instructions (say an early return
on failure or something). I think quite a bit can be done if LLVM
*really* knows what's going to be going on with these atomic ops on
LL/SC architectures.

I don't suppose you have any plans to port Mips to the IR-level LL/SC
expansion? Now that the infrastructure is present it's quite a
simplification (r206490 in ARM64 for example, though you need existing
target-specific intrinsics at the moment). It would be good to iron
out any ARM-specific assumptions I've made.

I'd rather avoid it, because it doing it that late precludes a lot of optimisations
that we're interested in. I'd much rather extend the IR to support them at a
generic level.

I think you might be misinterpreting what the change actually is.
Currently the expansion happens post-ISel (emitAtomicBinary and
friends building the control flow and MachineInstrs directly).

This moves it to before ISel but still late in the pipeline (actually,
you could even put it earlier: I didn't because of fears of opaque
@llvm.arm.ldrex intrinsics pessimising mid-end optimisations).
Strictly earlier than what happens now, and a reasonable
stepping-stone to generic load-linked instructions or intrinsics.

In my experience, CodeGen has improved with the change. ISelDAG gets
to make use of more information when choosing how to do the operation:
values already known to be sign/zero extended, immediates, etc.

Tim.

The easiest solution would be to extend the cmpxchg instruction with a
weak variant. It is then trivial to map load, modify, weak-cmpxchg to
load-linked, modify, store-conditional (that is what weak cmpxchg was
intended for in the C[++]11 memory model).

That would certainly be the easiest. But you'd get less scope for
optimising control flow around the instructions (say an early return
on failure or something). I think quite a bit can be done if LLVM
*really* knows what's going to be going on with these atomic ops on
LL/SC architectures.

I am not sure of any transforms that we'd want to do that aren't microarchitecture-specific that need to know about the difference between ll-modify-sc and load-modify-weak-cmpxchg.

I don't suppose you have any plans to port Mips to the IR-level LL/SC

expansion? Now that the infrastructure is present it's quite a
simplification (r206490 in ARM64 for example, though you need existing
target-specific intrinsics at the moment). It would be good to iron
out any ARM-specific assumptions I've made.

I'd rather avoid it, because it doing it that late precludes a lot of optimisations
that we're interested in. I'd much rather extend the IR to support them at a
generic level.

I think you might be misinterpreting what the change actually is.
Currently the expansion happens post-ISel (emitAtomicBinary and
friends building the control flow and MachineInstrs directly).

This moves it to before ISel but still late in the pipeline (actually,
you could even put it earlier: I didn't because of fears of opaque
@llvm.arm.ldrex intrinsics pessimising mid-end optimisations).
Strictly earlier than what happens now, and a reasonable
stepping-stone to generic load-linked instructions or intrinsics.

The problem is that the optimisations that we're most interested in should be done by the mid-level optimisers and are architecture agnostic.

In my experience, CodeGen has improved with the change. ISelDAG gets
to make use of more information when choosing how to do the operation:
values already known to be sign/zero extended, immediates, etc.

Yes, it's definitely an improvement in the short term, but I'm not convinced by the approach in the long term. It's a useful hack that works around a shortcoming in the IR, not a solution.

David

In my experience, CodeGen has improved with the change. ISelDAG gets
to make use of more information when choosing how to do the operation:
values already known to be sign/zero extended, immediates, etc.

Yes, it's definitely an improvement in the short term, but I'm not convinced
by the approach in the long term. It's a useful hack that works around a
shortcoming in the IR, not a solution.

Hmm, so it sounds like you're not actually after an IR-level LL/SC,
but a higher-level "cmpxchg weak". Fair enough, I suppose I'd
envisaged putting that burden on Clang.

Tim.

Yes. The weak cmpxchg is what the C[++]11 memory model provides, so there's a lot of work proving soundness for various transforms involving it. Once it gets to pre-codegen IR passes, it's trivial to map a load that's paired with an weak cmpxchg to a ll / ldrex and the cmpxchg to the sc / strex. This could be a generic IR pass that is parameterised with the names of the ll / sc intrinsics (or even some architecture-agnostic intrinsics for ll / sc, since they're fairly common), but ideally the optimisation would be on something that closely resembles the memory model of the source language. There are also microarchitectural optimisations that can happen later.

In clang currently, we approximate a weak cmpxchg with a strong cmpxchg, but that approximation is not quite semantically valid for all architectures (strong cmpxchg is permitted to block, weak is not) and is not ideal for optimisation either.

David

The IR is missing a weak variant of cmpxchg. But is there anything else missing at IR level? My understanding was that LLVM’s atomic memory ordering constraints are complete, but that codegen is not highly optimized, and maybe conservative for some targets. Which idiom do you have trouble with on non-x86?

-Andy

The example from our EuroLLVM talk was this:

_Atomic(int) a; a *= b;

This is (according to the spec) equivalent to this (simplified slightly):

    int expected = a;
    int desired;
    do {
      desired = expected * b;
    } while (!compare_swap_weak(current, expected, desired));

What clang generates is almost this, but with a strong compare and swap:

define void @mul(i32* %a, i32 %b) #0 {
entry:
%atomic-load = load atomic i32* %a seq_cst, align 4, !tbaa !1
br label %atomic_op

atomic_op: ; preds = %atomic_op, %entry
%0 = phi i32 [ %atomic-load, %entry ], [ %1, %atomic_op ]
%mul = mul nsw i32 %0, %b
%1 = cmpxchg i32* %a, i32 %0, i32 %mul seq_cst
%2 = icmp eq i32 %1, %0
br i1 %2, label %atomic_cont, label %atomic_op

atomic_cont: ; preds = %atomic_op
ret void
}

This maps trivially to x86:

LBB0_1:
  movl %ecx, %edx
  imull %esi, %edx
  movl %ecx, %eax
  lock
  cmpxchgl %edx, (%rdi)
  cmpl %ecx, %eax
  movl %eax, %ecx
  jne LBB0_1

For MIPS, what we *should* be generating is:

  sync 0 # Ensure all other loads / stores are globally visible
retry:
  ll $t4, $a0 # Load the current value of the atomic int
  mult $t4, $a1 # Multiply by the other argument
  mflo $t4 # Get the result
  sc $t4, $a0 # Try to write it back atomically
  bnez $t4, entry # If we failed, try the whole thing again
  sync 0 # branch delay slot - ensure seqcst behaviour here

What we actually generate is this:

# BB#0: # %entry
  daddiu $sp, $sp, -16
  sd $fp, 8($sp) # 8-byte Folded Spill
  move $fp, $sp
  addiu $3, $zero, 0
$BB0_1: # %entry
                                       # =>This Inner Loop Header: Depth=1
  ll $2, 0($4)
  bne $2, $3, $BB0_3
  nop
# BB#2: # %entry
                                       # in Loop: Header=BB0_1 Depth=1
  addiu $6, $zero, 0
  sc $6, 0($4)
  beqz $6, $BB0_1
  nop
$BB0_3: # %entry
  sync 0
$BB0_4: # %atomic_op
                                       # =>This Loop Header: Depth=1
                                       # Child Loop BB0_5 Depth 2
  move $3, $2
  mul $6, $3, $5
  sync 0
$BB0_5: # %atomic_op
                                       # Parent Loop BB0_4 Depth=1
                                       # => This Inner Loop Header: Depth=2
  ll $2, 0($4)
  bne $2, $3, $BB0_7
  nop
# BB#6: # %atomic_op
                                       # in Loop: Header=BB0_5 Depth=2
  move $7, $6
  sc $7, 0($4)
  beqz $7, $BB0_5
  nop
$BB0_7: # %atomic_op
                                       # in Loop: Header=BB0_4 Depth=1
  sync 0
  bne $2, $3, $BB0_4
  nop
# BB#8: # %atomic_cont
  move $sp, $fp
  ld $fp, 8($sp) # 8-byte Folded Reload
  jr $ra
  daddiu $sp, $sp, 16

For correctness, we *have* to implement the cmpxchg in the IR as a ll/sc loop, and so we end up with a nested loop for something that is a single line in the source.

The idiom of the weak compare and exchange loop is a fairly common one, but we generate spectacularly bad code for it.

David

The IR is missing a weak variant of cmpxchg. But is there anything else
missing at IR level? My understanding was that LLVM’s atomic memory
ordering constraints are complete, but that codegen is not highly optimized,
and maybe conservative for some targets. Which idiom do you have trouble
with on non-x86?

For myself, I don't like the fact that LLVM's atomicrmw & cmpxchg
instructions are so beholden to C. With suitable constraints, an
"atomicrmw (int x) { ... }" isn't unreasonable; but this can only be
mapped to a cmpxchg loop with the current IR.

Tim.

Another class of problems is ABI compatibility. While procedure calls
could generally be dealt with by an IR wrapper, different languages
have different meanings for structures, bitfields, unions, lists.

If the IR has only one type of "structure", padding and casts have to
be made to accommodate language ABIs. If the IR has only one type of
"bitfield", it'll behave differently on different architectures and
there will be no way to represent it in the IR, unless you introduce
target specific attributes.

LLVM IR is not the same as Java bytecode. It doesn't have the same
purpose, and making it more like bytecode will make it harder to
reason about intermediate level semantics.

There is a number of things that could be simplified with IR-wrappers
to lower high-level IR (target independent) into low-level IR (target
specific), which would make it a lot easier to port IR between
architectures, but I don't believe it's possible to do that to *all*
classes of architectural problems that we currently have without
transforming it into something like the Java bytecode.

Even though "virtual machine" is part of the name, the focus on
running IR on a virtual machine is ancient history. There are still
many people doing it, but LLVM is now much more a compilation
infrastructure than anything else. Kind of "build your own compiler"
or "mix-and-match".

cheers,
--renato

LLVM is not presently an acronym, ergo, there is no "virtual machine" in its name.