Predication on SIMD architectures and LLVM

Hello,
I'm working on a compiler based on LLVM for a SIMD architecture that supports instruction predication. We would like to implement branching on this architecture using predication.
As you know the LLVM-IR doesn't support instruction predication, so I'm not exactly sure on what is the best way to implement it.
We came up with some ways to do it in LLVM:

- Do not add any predication in the IR (except for load and stores through intrinsics), linearize the branches and substitute PHI nodes with selects for merging values . In the backend then we would custom lower the select instruction to produce a predicated mov to choose the right version of the value. I think this option doesn't make use of the possible benefits of the architecture we are targeting at all.

- Another way could be adding intrinsics for all instructions in the target to make them support predication, still linearize all the branches, but use instruction predication instead of generating cmovs . The backend then would custom lower almost any instruction into predicated custom nodes that are matched through tablegen patterns. We could generate these intrinsics in the same IR pass that linearizes branches.

- Make a custom backend that actually directly outputs predicated instructions (we really mainly only need one type of predicate , so every instruction could use that kind of predicate ...) but I think this is a nasty solution ...

Did someone already tried to do this in LLVM and if yes what solution/s did you use to solve the problem?

Regards,
Marcello

Hi Marcello,

Hi Marcello,

I am sure I've seen some postings on the list concerning architectures that support predicated execution and how to map that to LLVM IR, I'm just not sure anymore when and who was involved :).

I have implemented your first suggestion for targets that do not have predicated instructions (where control flow to data flow conversion with explicit maintaining of masks and blend operations is the only option you have for SIMD vectorization). However, I agree with you that this is not a good solution for you since you would basically ignore one of the strengths of your platform. Should you still need some code or help with this, feel free to drop me a message.

Cheers,
Ralf

We are currently doing something similar to your third option in Hexagon
backend. But it is a VLIW so predication is not the only reason for that.

Sergei

Hello,
I'm working on a compiler based on LLVM for a SIMD architecture that
supports instruction predication. We would like to implement
branching on this architecture using predication.
As you know the LLVM-IR doesn't support instruction predication, so
I'm not exactly sure on what is the best way to implement it.
We came up with some ways to do it in LLVM:

- Do not add any predication in the IR (except for load and stores
through intrinsics), linearize the branches and substitute PHI nodes
with selects for merging values . In the backend then we would
custom lower the select instruction to produce a predicated mov to
choose the right version of the value. I think this option doesn't
make use of the possible benefits of the architecture we are
targeting at all.

You may want to look at the IfConversion pass
(lib/CodeGen/IfConversion.cpp). This converts branches to predicated
instructions, and you may be able to use it for all branching if you
teach it to maintain a predicate stack. I actually looked into doing
this for the newest generation of GPUs (Southern Islands) supported by
the R600[1] backend which use predication for all branching, but opted
to go with a target specific pass until the backend is more stable.

-Tom

[1] http://cgit.freedesktop.org/~tstellar/llvm/tree/lib/Target/AMDGPU

Hi Wen-Ren,

thank you for your link, seems like an interesting read on the argument!

Marcello

Hi Ralf,

yeah, I've checked if on the list there were any kind of reference to that , but I only found scattered and incomplete information (maybe some good stuff is there, but very well hidden, I will try to check again by the way).

About your work on predication, I know of your IR-level if-conversion and vectorization and a lot of my knowledge on the matter actually comes from your talk in the last Euro-LLVM conference in London, so thanks for that! Also thank you for your availability on discussing the matter!

Marcello

Hello Sergei,

I actually don't know the hexagon platform very well, I only just briefly checked it's backend for reference on some things . Our target also has some VLIW features, but I hope we will not have to endup with the third option, I would like to keep it as a last resort.

Marcello

Hello Tom,

so basically what you are doing is in your AMDGPU backend is generating machine code like if it was a normal target (with diverging branches and stuff) and then through a custom post-ISel machine pass you do the if conversion linearizing and predicating the branches. Am I right? Seems like a much easier approach to apply than doing it at the IR level (because you don't have to add intrinsics to predicate your instructions) .

Marcello

Hi,

I’ve done work on predicated SIMD representations for LLVM.

If you search through the archives, you may find my “applymask” proposal, which is an attempt at representing predication in a very comprehensive way. I’ve since stopped pushing the proposal in part because Larrabee’s changing fortunes led to a decline of interest at the time, in part because the proposal doesn’t look intuitive to people who don’t have experience in SIMD programming, and in part because there were some technical details with my actual proposal (although I believe solutions could be found).

And, in part because a popular trend seems to be to have SIMD units which don’t trap or raise exception flags on arithmetic and which don’t go faster when predicated, such that there’s no reason to predicate anything except stores and occasionally loads. On these architectures, simply having intrinsics for stores, and perhaps loads, is basically sufficient, and less invasive.

And, in part because predication is another wrinkle for SIMD performance portability. As people start caring more about SIMD performance, there will be more pressure to tune SIMD code in target-specific ways, and it erodes the benefit of a target-independent representation. This is a complex topic though, and there are multiple considerations, and not everyone agrees with me here.

One thing that’s initially counter-intuitive is that SIMD predication cannot be done in the same way as scalar or VLIW predication, where the majority of the compiler works as if it’s on a “normal” scalar machine and predication happens during codegen, where the optimizer doesn’t have to think about it. SIMD predication must be applied by whatever code is producing SIMD instructions, and in LLVM, that’s typically in the optimizer or earlier.

Dan

Ralf Karrenberg <Chareos@gmx.de> writes:

I am sure I've seen some postings on the list concerning architectures
that support predicated execution and how to map that to LLVM IR, I'm
just not sure anymore when and who was involved :).

I was one of them. I suggested adding general predication to the LLVM
IR but that doesn't look like it's going to happen. Dan Gohman had
another idea on how to represent predicate masks but that also didn't
relaly go anywhere.

None of your proposed solutions is ideal. We really should have
first-class predication in the IR. It's only going to get more
important.

                      -David

Dan Gohman <dan433584@gmail.com> writes:

And, in part because a popular trend seems to be to have SIMD units
which don't trap or raise exception flags on arithmetic and which don't
go faster when predicated, such that there's no reason to predicate
anything except stores and occasionally loads. On these architectures,
simply having intrinsics for stores, and perhaps loads, is basically
sufficient, and less invasive.

This is going to change. Intel recently released the ISA for Knights
Corner, a machine with general predication for SIMD.

http://software.intel.com/en-us/forums/topic/278102

And, in part because predication is another wrinkle for SIMD
performance portability. As people start caring more about SIMD
performance, there will be more pressure to tune SIMD code in
target-specific ways, and it erodes the benefit of a
target-independent representation. This is a complex topic though, and
there are multiple considerations, and not everyone agrees with me
here.

It's true that a target-independent predicated IR isn't going to
translate well to a target that doesn't have predication. However, for
targets that do it's a godsend.

One thing that's initially counter-intuitive is that SIMD predication
cannot be done in the same way as scalar or VLIW predication, where
the majority of the compiler works as if it's on a "normal" scalar
machine and predication happens during codegen, where the optimizer
doesn't have to think about it. SIMD predication must be applied by
whatever code is producing SIMD instructions, and in LLVM, that's
typically in the optimizer or earlier.

Yep. This is why I think IR support is essential.

                                 -David

Perhaps I am missing something, but isn't a predicated instruction effectively an single-instruction version of an arithmetic operation followed by a select? As we can already represent this in the IR, and already match other predicated instructions (e.g. on ARM) to this pattern, what is gained by adding predication directly to the IR?

David

David Chisnall <David.Chisnall@cl.cam.ac.uk> writes:

Perhaps I am missing something, but isn't a predicated instruction
effectively an single-instruction version of an arithmetic operation
followed by a select?

No, it is not. Among other things, predication is used to avoid traps.
A vector select is an entirely different operation.

As we can already represent this in the IR, and already match other
predicated instructions (e.g. on ARM) to this pattern, what is gained
by adding predication directly to the IR?

Predicated loads, stores, divides, sqrts, etc. are essential for
correctly vectorizing loops with conditionals due to safety concerns.
If the loop body has no dangerous operations, then yes, a vector select
can be used without problems but it is often slower than predication.
Usually the hardware can optimize instructions with certain values of
predicates.

                              -David

I am talking about the LLVM select instruction, not a vector select:

http://llvm.org/docs/LangRef.html#i_select

In any non-trapping case, an arithmetic operation (or sequence of operations) followed by a select is semantically equivalent to the predicated version. This is exactly how predicated instructions on ARM are handled. For example, the following IR:

  %cmp = icmp sgt i32 %c, %b
  %add = add nsw i32 %b, 1
  %add1 = add nsw i32 %c, 2
  %retval.0 = select i1 %cmp, i32 %add, i32 %add1

Becomes this ARM assembly:

  add r2, r1, #2
  cmp r1, r0
  addgt r2, r0, #1
  mov r0, r2

An equally valid form would be:

  cmp r1, r0
  addle r2, r1, #2
  addgt r2, r0, #1
  mov r0, r2

Separating the select, which embodies the predication, from the operations allows more choice in terms of the final representation. Unless the load or store is volatile, the compiler is free to elide it if its result is not used, and is most definitely free to fold it into a predicated load. The same is obviously true of any side-effect-free operations, such as divides and square roots: folding them into predicated instructions is no less invalid than conditionally executing them in branches or removing them entirely via dead code elimination.

Just because the generated machine code must contain predicated instructions most definitely does mean that the LLVM IR must contain it, or even that we would gain anything in terms of expressive power by permitting it.

David

It’s true that a target-independent predicated IR isn’t going to
translate well to a target that doesn’t have predication. However, for
targets that do it’s a godsend.

Even for MIC (Xeon Phi), the predicated IR is not necessary. The instructions that really benefit from predication are loads and stores. MIC masks are write masks, but even if they were to help the performance of predicated instructions, there are other ways to do this. One way would be to implement masked load and mask store intrinsics, and to place ‘select’ instructions in strategic locations: before instructions that may fault, before phi-nodes, etc. A pre-register allocation pass can propagate the masks to all of the instructions that need them. But this is theoretical since only load/store really benefit from predication.

Yep. This is why I think IR support is essential.

I don’t think that we need to change the IR, even for a predicated architecture such as MIC .

Nadav Rotem <nrotem@apple.com> writes:

One way would be to implement masked load and mask store
intrinsics, and to place 'select' instructions in strategic locations:
before instructions that may fault, before phi-nodes, etc. A
pre-register allocation pass can propagate the masks to all of the
instructions that need them.

How does this work if the load is not conditional but trapping
operations that use the loaded values are conditional?

Yes, such propagation can probably be done but it's painful and every
predicated target would have to implement it. It's much easier to just
select the right operation in isel, I think, and that seems to require
IR support.

                              -David

David Chisnall <David.Chisnall@cl.cam.ac.uk> writes:

I am talking about the LLVM select instruction, not a vector select:

http://llvm.org/docs/LangRef.html#i_select

That is what I mean by a vector select.

In any non-trapping case, an arithmetic operation (or sequence of
operations) followed by a select is semantically equivalent to the
predicated version.

Yes.

Separating the select, which embodies the predication, from the
operations allows more choice in terms of the final representation.

Sure.

Just because the generated machine code must contain predicated
instructions most definitely does mean that the LLVM IR must contain
it, or even that we would gain anything in terms of expressive power
by permitting it.

Certainly such transformations *can* be done, but is it the most
efficient/best way to do things? I wonder how many different passes of
"select to predication" we will end up having, one per target.

                           -David

I don't understand what's the problem with this. Different targets
have different predication rules, so they should have different
selection from IR to produce code.

If this is an IR pass, or a DAG selection step, I don't know. I'd
think it'd be the latter, though, as it's where the target-specific
code is, but there might be other IR passes that need this information
before DAG sel.