Predicated Vector Operations

I’m trying to understand how predicated/masked instructions can be generated in llvm, specifically an instruction where a set bit in the mask will write the new result into the corresponding vector lane in the destination and a clear bit will cause the lane in the destination to remain what it was before the instruction executed.

I’ve seen a few places that suggest ‘select’ is the proper way to implement predication. I believe the predicated form cannot be explicitly expressed in LLVM asm, because it is SSA. It can be done implicitly:

%sum = add <16 x i32> %x, %y
%newvalue = select <16 x i1> %mask, <16 x i32> %sum, <16 x i32> %oldvalue

The issue becomes how to match the instruction form above in a TableGen pattern. In order for this to emit a masked instruction, %newvalue and %oldvalue must be assigned to same physical register (I’m assuming an instruction like ‘add %r0{%m0} %r1 %r2’) However, I don’t think there is even a notion of physical registers at the point that instruction selection is performed and the virtual registers will be different because everything is still in SSA form.

I suspect I’m missing something fundamental.

Jeff Bush <jeffbush001@gmail.com> writes:

I'm trying to understand how predicated/masked instructions can be
generated in llvm, specifically an instruction where a set bit in the
mask will write the new result into the corresponding vector lane in
the destination and a clear bit will cause the lane in the destination
to remain what it was before the instruction executed.

I've seen a few places that suggest 'select' is the proper way to
implement predication. I believe the predicated form cannot be
explicitly expressed in LLVM asm, because it is SSA. It can be done
implicitly:

%sum = add <16 x i32> %x, %y
%newvalue = select <16 x i1> %mask, <16 x i32> %sum, <16 x i32>
%oldvalue

This is not necessarily sufficient in general. For any operation that
can trap (pretty much any fp operation) you also need to protect the
inputs with safe values to adhere to the semantics of the LLVM IR. For
an fadd, for example:

%tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
%ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
%sum = fadd %tx, %ty
%newvalue = select %mask, %sum, %oldvalue

Then this entire pattern of selects and fadd will have to be matched in
isel, which would throw away the safe value protection overhead (because
the hardware masking takes care of the safety).

In fact the problem is more general. Even in the integer add case, you
have to ensure that llvm won't move the definition of %sum before the
definition of %mask. Otherwise you're going to end up generating a
predicated add with a bad mask value. Selects on the inputs to the
integer add also solve this problem.

The issue becomes how to match the instruction form above in a
TableGen pattern. In order for this to emit a masked instruction,
%newvalue and %oldvalue must be assigned to same physical register (I'm
assuming an instruction like 'add %r0{%m0} %r1 %r2') However, I don't
think there is even a notion of physical registers at the point that
instruction selection is performed and the virtual registers will be
different because everything is still in SSA form.

Potentially you could use the "$src = $dst" constraint as in the
two-address x86 forms. I don't know that TableGen has been generalized
enough to do this, though. I think it's pretty highly specialized to
specific x86 two-address forms at the moment.

Another problem is how to express %oldvalue in LLVM IR. Presumably
%newvalue will be consumed, possibly by another arithmetic operation.
Presumably %oldvalue can similarly come from a previous arithmetic
operation feeding into the add. If that's true, then %oldvalue is
either %x or %y. Otherwise it is some other thing highly
context-dependent.

The gpuocelot project ran into the problem and they talk about it here:

http://code.google.com/p/gpuocelot/source/browse/wiki/LLVM.wiki?r=272

The bottom line is that it is probably easier to set this up before LLVM
IR goes into SSA form.

There is a lot of interest in predication and a lot of recent
discussions about handling it in LLVM. Personally I think that
long-term we will need some IR changes. It might be as simple as adding
an IR-level predicated load and predicated store, I'm not sure.

We're pretty heavily into trying to get predicated vector code working
well in LLVM here. Expect to see many questions, code examples and
hopefully upstream code submissions over the next couple of months.

                          -David

I think that selects on the inputs+outputs of instructions is a good abstraction, and I don’t think that we need to add a mask operand to every LLVM IR instruction. However, we do need support for masked load/stores, and I think that we should implement them as target independent intrinsics. I don’t want to change the generic LLVM IR Load/Store instructions because I don’t want to modify all of the existing optimizations and I also don’t want other users of the compiler to pay for this complexity.

Nadav Rotem <nrotem@apple.com> writes:

First statement: I'm not disagreeing. :slight_smile:

That said, wouldn't a new intrinsic necessarily require all of the
passes to handle it? I'm just curious what you see the tradeoffs being
here. I'd have though adding a mask to Load/Store that just isn't
handled would be equivalent?

-eric

Hi Eric,

Most passes won’t have to handle the load/store intrinsics because they will look like a regular function calls that read/write from memory. We don’t need to change Reg2Mem or other passes that really can’t do anything about masked memory operations. On the other hand, If we do change the Load/Store instruction we will have to review all of our existing optimizations. For example, some optimizations assume that a store instruction actually writes and invalidates the memory location. However, this assumption is incorrect if we are using a mask.

Thanks,
Nadav

Nadav Rotem <nrotem@apple.com> writes:

Most passes won't have to handle the load/store intrinsics because
they will look like a regular function calls that read/write from
memory. We don't need to change Reg2Mem or other passes that really
can't do anything about masked memory operations. On the other hand,
If we do change the Load/Store instruction we will have to review all
of our existing optimizations. For example, some optimizations assume
that a store instruction actually writes and invalidates the memory
location. However, this assumption is incorrect if we are using a
mask.

Yes, this makes sense.

What happens if we add a new first-class IR instruction? I suppose
passes that have visitor-like behavior will have to add code to
handle it, but that code could simply treat it like an intrinsic,
an unknown black box.

I suppose that fact that intrinsics can be attributed so passes
automatically assume they read and/or write memory is an advantage in
that nothing special has to be done to passes to make them handle it
conservatively.

Just thinking out loud. There's something not quite sitting right with
me about making masked load and store be intrinsics but I'm not sure
what it is. It's likely due to not completely understanding the
semantics of generic intrinsics.

I would like to know how people determine whether a new instruction-like
IR change should take the form of an intrinsics or a first-class
Instruction.

                             -David

I can almost see that, but how is the intrinsic any different from a
conservative width for stores/loads where they're not handled by an
optimization pass? I'm assuming I'm missing something here.

-eric

I don’t understand what you mean by “conservative width”.

Thinking that a masked store is conservatively a store of the full
width of the store right?

But Jim pointed out that anything merging loads would then need to
merge the masks otherwise even if selection would work otherwise, any
pass that merges loads would need to learn how to deal with masks. Not
likely a deal killer since I don't think there are a lot of them, but
it does explain why it's more work than having them pass through.

-eric

Thinking that a masked store is conservatively a store of the full
width of the store right?

It depends on the optimization. Consider this example:

masked_store(Val, Ptr , M)
X = masked_load(Ptr, M2)

If you assume that your store actually overwrites everything in that memory location then you don’t need to load that memory location again. You can simply use the stored value. However, in our example X != Val.

But Jim pointed out that anything merging loads would then need to
merge the masks otherwise even if selection would work otherwise, any
pass that merges loads would need to learn how to deal with masks. Not
likely a deal killer since I don’t think there are a lot of them, but
it does explain why it’s more work than having them pass through.

I actually think that masks disrupt everything from alias analysis to SROA. If you ignore the mask bad thing will happen. You can’t be conservative because you never know which way is ‘conservative’. Should you assume that the mask is all zero or all one ?

I was thinking that they'd not be optimized, but you are correct that
it'll be invasive - examples are illuminating. :slight_smile:

I'm less convinced it's everything, but it's definitely enough that
care should be taken before adding a mask field to load/stores. It
will involve quite a bit of work and the tradeoffs should be well
known; i.e. we should know that we're going to get something out of it
more so than we would if we added it as a "target independent
intrinsic" or some such. Maybe in the long run the ability to treat
vector converts+masked loads/stores as shuffles could be useful, I
don't know either.

-eric

Thinking that a masked store is conservatively a store of the full
width of the store right?

It depends on the optimization. Consider this example:

masked_store(Val, Ptr , M)
X = masked_load(Ptr, M2)

If you assume that your store actually overwrites everything in that
memory location then you don't need to load that memory location again. You
can simply use the stored value. However, in our example X != Val.

I'm not sure I understand the full impact of this example, and I would like
to.

What are the desired memory model semantics for a masked store?
Specifically, let me suppose a simplified vector model of <2 x i64> on an
i64-word-size platform.

  masked_store(<42, 42>, Ptr, <true, false>)

Does this write to the entier <2 x i64> object stored at Ptr or not? Put
another way, consider:

  thread A:
    ...
    masked_store(<42, 42>, Ptr, <true, false>)
    ...

  thread B:
    ...
    masked_store(<42, 42>, Ptr, <false, true>)
    ...

Assuming there is no specific synchronization relevant to Ptr between these
two threads and their masked stores, does this form a data race or not?

From a memory model perspective, if this does *not* form a data race, that

makes this tremendously more complex to implement, analyze, and optimize...
I'm somewhat hopeful that the desired semantics are for this to form a
datarace (and thus require synchronization when occurring in different
threads like this).

The issue becomes how to match the instruction form above in a
TableGen pattern. In order for this to emit a masked instruction,
%newvalue and %oldvalue must be assigned to same physical register (I'm
assuming an instruction like 'add %r0{%m0} %r1 %r2') However, I don't
think there is even a notion of physical registers at the point that
instruction selection is performed and the virtual registers will be
different because everything is still in SSA form.

Potentially you could use the "$src = $dst" constraint as in the
two-address x86 forms. I don't know that TableGen has been generalized
enough to do this, though. I think it's pretty highly specialized to
specific x86 two-address forms at the moment.

I'm not familiar with the two-address constraint, but it seems like
this would be challenging to express generally. Consider the previous
example:

%tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
%ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
%sum = fadd %tx, %ty
%newvalue = select %mask, %sum, %oldvalue

I believe the generated instructions depend on whether %oldvalue is
still live after the last instruction. If it is, you need to generate
two instructions: a copy into a new physical register then predicated
write to it. If it is not used, then it is just a predicated write to
the same register.

  move r1, r0
  fadd r1{m0}, r2, r3

(r0 is now %oldvalue and r1 is %newvalue)

vs.

  fadd r0{m0}, r2, r3

(r0 was %oldvalue and is now %newvalue)

The bottom line is that it is probably easier to set this up before LLVM
IR goes into SSA form.

That makes sense, but it's unclear to me how you would preserve that
information after going into SSA form.

There is a lot of interest in predication and a lot of recent
discussions about handling it in LLVM. Personally I think that
long-term we will need some IR changes. It might be as simple as adding
an IR-level predicated load and predicated store, I'm not sure.

It seems to me that these are not really LLVM issues as much as the
fact that SSA doesn't cleanly map to predicated instructions. For
example, if predicates were hypothetically added universally to the IR
(which I don't think anyone wants to do), it's not clear to me how
that would even work. How would you specify what value the result
would be received for non-enabled lanes? Perhaps another parameter:

  %newvalue = fadd %x, %y, %mask, %previousvalue

Yuck.

Hi Chandler,

I brought the example in this email thread to show that the optimizations that we currently have won’t work on masked load/store operations because they don’t take the mask into consideration. The memory model interesting question but I am not sure how it is related. In our example you can see the problem with a single thread. Both MIC and AVX[1] have masked stores operations and they have a different memory model.

Thanks,
Nadav

[1] http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_maskstore_pd.htm

Chandler Carruth <chandlerc@google.com> writes:

What are the desired memory model semantics for a masked store?
Specifically, let me suppose a simplified vector model of <2 x i64> on
an i64-word-size platform.

masked_store(<42, 42>, Ptr, <true, false>)

Does this write to the entier <2 x i64> object stored at Ptr or not?

No. It writes one element.

Put another way, consider:

thread A:
...
masked_store(<42, 42>, Ptr, <true, false>)
...

thread B:
...
masked_store(<42, 42>, Ptr, <false, true>)
...

Assuming there is no specific synchronization relevant to Ptr between
these two threads and their masked stores, does this form a data race
or not?

It entirely depends on the hardware implementation. In most cases I
would say yes due to cache conherence issues. From a purely theoretical
machine that doesn't have false sharing, there would be no data race.

Of course this assumes that thread B won't access the element stored by
thread A and vice versa.

From a memory model perspective, if this does *not* form a data race,
that makes this tremendously more complex to implement, analyze, and
optimize... I'm somewhat hopeful that the desired semantics are for
this to form a datarace (and thus require synchronization when
occurring in different threads like this).

Most of the time the compiler will not know the mask value and will have
to be conservative. As Nadav has pointed out, what constitutes
"conservative" is entirely context-dependent.

But I don't understand why defining this as not being a data race would
complicate things. I'm assuming the mask values are statically known.
Can you explain a bit more?

                                 -David

It's an interesting question for autovectorization, for example.

Thread A:
   for (i=0;i<n;++i)
      if (i&1)
        X[i] = 0;

Thread B:
   for (i=0;i<n;++i)
      if (!(i&1))
        X[i] = 1;

The threads run concurrently without synchronization. As written, there is
no race. Can you vectorize either of these loops? If masked-out elements of
a predicated store are "in play" for racing, then vectorizing would
introduce a race. And, it'd be hard for an optimizer to prove that this
doesn't happen.

Dan

p.s. Yes, you could also vectorize these with a strided store or a scatter,
but then it raises a different question, of the memory semantics for
strided or scatter stores.

Dan Gohman <dan433584@gmail.com> writes:

    But I don't understand why defining this as not being a data race
    would complicate things. I'm assuming the mask values are
    statically known. Can you explain a bit more?

It's an interesting question for autovectorization, for example.

Thread A:
for (i=0;i<n;++i)
if (i&1)
X[i] = 0;

Thread B:
for (i=0;i<n;++i)
if (!(i&1))
X[i] = 1;

The threads run concurrently without synchronization. As written,
there is no race.

There is no race *if* the hardware cache coherence says so. :slight_smile: There
are false sharing issues here and different machines have behaved very
differently in the past.

The result entirely depends on the machine's consistency model.

LLVM is a virtual machine and the IR should define a consistency model.
Everything flows from that. I think ideally we'd define the model such
that there is no race in the scalar code and the compiler would be free
to vectorize it. This is a very strict consistency model and for
targets with relaxed semantics, LLVM would have to insert
synchronization operations or choose not to vectorize.

Presumably if the scalar code were problematic on a machine with relaxed
consistency, the user would have added synchronization primitives and
vectorization would not be possible.

Can you vectorize either of these loops? If masked-out elements of a
predicated store are "in play" for racing, then vectorizing would
introduce a race. And, it'd be hard for an optimizer to prove that
this doesn't happen.

Same answer. I don't think scalar vs. vector matters. This is mostly a
cache coherence issue.

There is one twist that our vectorization guy pointed out to me. If
when vectorizing, we have threads A and B read the entire vector, update
the values under mask and then write the entire vector, clearly there
will be a data race introduced. The Cray compiler has switches for
users to balance safety and performance, since a stride-one load and
store is generally much faster than a masked load and store.

So for vectorization, the answer is, "it depends on the target
consistency model and the style of vectorization chosen."

p.s. Yes, you could also vectorize these with a strided store or a
scatter, but then it raises a different question, of the memory
semantics for strided or scatter stores.

And again, the same answer. :slight_smile:

I'm no vectorization expert, but I believe what I said is correct. :slight_smile:

                         -David

Jeff Bush <jeffbush001@gmail.com> writes:

%tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
%ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
%sum = fadd %tx, %ty
%newvalue = select %mask, %sum, %oldvalue

I believe the generated instructions depend on whether %oldvalue is
still live after the last instruction. If it is, you need to generate
two instructions: a copy into a new physical register then predicated
write to it. If it is not used, then it is just a predicated write to
the same register.

  move r1, r0
  fadd r1{m0}, r2, r3

(r0 is now %oldvalue and r1 is %newvalue)

vs.

  fadd r0{m0}, r2, r3

(r0 was %oldvalue and is now %newvalue)

I'm assuming some parts of %oldvalue are still used. The masked fadd
could preserve them for false values of the mask, depending on how
masking was defined. Therefore, there's no need for a register copy.
If the masked operation does not preserve the old values in r0, then we
do need a register copy.

Preserving old values does complicate things for SSA, as you note.

The bottom line is that it is probably easier to set this up before LLVM
IR goes into SSA form.

That makes sense, but it's unclear to me how you would preserve that
information after going into SSA form.

I should think the semantics of select would handle that. After a
select all vector elements of the result are defined. There is no
preservation of old values. There cannot be, by definition of SSA.

It seems to me that these are not really LLVM issues as much as the
fact that SSA doesn't cleanly map to predicated instructions.

It entirely depends on how the predication is defined to work.

Good point. I was thinking of it narrowly as preserving the old value
in the register. I guess I'd amend my previous statement to say that
it actually does map just fine to SSA, but instruction selection
becomes more complex.

It sounds like the current LLVM instruction selection algorithm can't
really handle the use case I described cleanly (generating predicated
arithmetic instructions that preserve the old register value). Is
that a fair statement?

For example, if predicates were hypothetically added universally to
the IR (which I don't think anyone wants to do), it's not clear to me
how that would even work. How would you specify what value the result
would be received for non-enabled lanes? Perhaps another parameter:

  %newvalue = fadd %x, %y, %mask, %previousvalue

It depends on how you define the mask operation.

On the Cray X1, result elements mapped to false mask values are
undefined after the operation.

I assume the only point of the mask is to avoid traps/exceptions in
this case (otherwise it doesn't really do anything, right?).

Jeff Bush <jeffbush001@gmail.com> writes:

%tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
%ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
%sum = fadd %tx, %ty
%newvalue = select %mask, %sum, %oldvalue

I believe the generated instructions depend on whether %oldvalue is
still live after the last instruction. If it is, you need to generate
two instructions: a copy into a new physical register then predicated
write to it. If it is not used, then it is just a predicated write to
the same register.

move r1, r0
fadd r1{m0}, r2, r3

(r0 is now %oldvalue and r1 is %newvalue)

vs.

fadd r0{m0}, r2, r3

(r0 was %oldvalue and is now %newvalue)

I'm assuming some parts of %oldvalue are still used. The masked fadd
could preserve them for false values of the mask, depending on how
masking was defined. Therefore, there's no need for a register copy.
If the masked operation does not preserve the old values in r0, then we
do need a register copy.

Preserving old values does complicate things for SSA, as you note.

The bottom line is that it is probably easier to set this up before LLVM
IR goes into SSA form.

That makes sense, but it's unclear to me how you would preserve that
information after going into SSA form.

I should think the semantics of select would handle that. After a
select all vector elements of the result are defined. There is no
preservation of old values. There cannot be, by definition of SSA.

It seems to me that these are not really LLVM issues as much as the
fact that SSA doesn't cleanly map to predicated instructions.

It entirely depends on how the predication is defined to work.

Good point. I was thinking of it narrowly as preserving the old value
in the register. I guess I'd amend my previous statement to say that
it actually does map just fine to SSA, but instruction selection
becomes more complex.

It sounds like the current LLVM instruction selection algorithm can't
really handle the use case I described cleanly (generating predicated
arithmetic instructions that preserve the old register value). Is
that a fair statement?

I don’t think this is a fair statement. Tied register operands should handle this use case just fine. This problem is similar to that of two-address constraints. Two address instructions work as follows. When we match an instruction we “tie” input and output registers.

Say you had an LLVM-IR add:

x = add i32 y, z

for x86 we generate the following machine ir instruction during ISel:

vr0<def, tied1> = ADD32rr vr1<use, tied0>, vr2<use>

Once we go out of SSA during CodeGen we have to replace the two address constraint by copies:

vr0 = vr1
vr0 = ADD32rr vr0, vr2

Coalescing and allocation will then take care of removing unnecessary copies. I think that predicate instructions would be handled similar (for the sake of making the example shorted I replaced your sequence of IR instruction by one “virtual” IR instruction):

x = predicated_add %mask, %x, %y, %oldvalue

This (actually, your sequence of selects, and add) would be matched during ISel to:

vr0<def, tied1> = PRED_ADD mask_vr, vr1<use>, vr2<use>, vr3<use, tied0>

From here on the machinery is the same, the two-address pass would translate such instructions to:

vr0 = vr3
vr0 = PRED_ADD mask_vr, vr1, vr2, vr0

If vr3 is not live after PRED_ADD coalescing will coalesce vr0 and vr3. I don’t think there is a fundamental difficulty handling predictated instructions this way - at least wrt. to register constraints.