Handling Masked Vector Operations

We're looking at how to handle masked vector operations in architectures
like Knight's Corner. In our case, we have to translate from a fully
vectorized IR that has mask support to llvm IR which does not have mask
support.

For non-trapping instructions this is fairly straightforward:

; Input
t1 = add t2, t3, mask

; llvm IR -- assuming we want zeros in the false positions, which is not
; always the case
tt = add t2, t3
t1 = select mask, tt, 0

It's easy enough to write a TableGen pattern to match add+select and
emit a masked instruction. Alternative, we can always resort to manual
lowering (ugh).

For trapping operations this is problematic. Take a load. Here's the
same attempt with a load:

tt = load [addr]
t1 = select mask, tt, 0

The problem is that the load is unconditional. Originally it was masked
presumably because the original scalar load was under a condition,
probably protecting against traps. However, since llvm has no concept
of a masked load, to stay within the IR we need to use an unconditional
load. We can get the same result vector after the select, but that's
too late. The load has already executed and the llvm passes will assume
that load cannot fault (otherwise it's undefined behavior). The llvm IR
does not convey the same semantics as the fully predicated IR.

It seems the only solution is to create an intrinsic:

llvm_int_load_masked mask, [addr]

But this unnecessarily shuts down optimization.

Similar problems exist with any trapping instruction (div, mod, etc.).
It gets even worse when you consider than any floating point operation
can trap on a signalling NaN input.

The gpuocelot project is essentially trying to do the same thing but I
haven't dived deep enough into their notes and implementation to see how
they handle this issue. Perhaps because current GPUs don't trap it's a
non-issue. But that will likely change in the future.

So are there any ideas out there for how to efficiently handle this?
We've talked about llvm and masks before and it's clear that there is
strong resistance to adding masks to the IR. Perhaps an alternative is
to mark an instruction as "may trap" so that llvm will not perform
certain transformations on it. Of course that involves teaching all of
the passes about a new "may trap" attribute, or whatever mechanism is
devised.

I would very much appreciate thoughts and ideas on this. As it is, it
doesn't seem like it's possible to generate efficient llvm IR for fully
predicated instruction sets.

                           -David

Hi David,

It seems the only solution is to create an intrinsic:

llvm_int_load_masked mask, [addr]

But this unnecessarily shuts down optimization.

I think that using intrinsics is the right solution. I imagine that most interesting load/store optimizations happen before vectorization, so I am not sure how much we can gain by optimizing masked load/stores.

Similar problems exist with any trapping instruction (div, mod, etc.).
It gets even worse when you consider than any floating point operation
can trap on a signalling NaN input.

For DIV/MOD you can blend the inputs BEFORE the operation. You can place ones or zeros depending on the operation.

So are there any ideas out there for how to efficiently handle this?
We’ve talked about llvm and masks before and it’s clear that there is
strong resistance to adding masks to the IR.

Yes. I think that the consensus is that we don’t need to predicate the IR itself to support MIC-like processors.

Thanks,
Nadav

Nadav Rotem <nrotem@apple.com> writes:

    It seems the only solution is to create an intrinsic:
    
    llvm_int_load_masked mask, [addr]
    
    But this unnecessarily shuts down optimization.
    
I think that using intrinsics is the right solution. I imagine that
most interesting load/store optimizations happen before vectorization,
so I am not sure how much we can gain by optimizing masked
load/stores.

Perhaps that is true. If this is the only intrinsic we need (well, a
store too), maybe it's not too bad.

    Similar problems exist with any trapping instruction (div, mod,
    etc.).
    It gets even worse when you consider than any floating point
    operation
    can trap on a signalling NaN input.
    
For DIV/MOD you can blend the inputs BEFORE the operation. You can
place ones or zeros depending on the operation.

That's true but it's inefficient. I suppose we can write patterns to
match the input selects as well and just drop them, opting for the
masked operation. But this all requires that these
select/select/op/select sequences stay intact throughout llvm so isel
can match it. I'm not totally confident that's possible.

    So are there any ideas out there for how to efficiently handle
    this?
    We've talked about llvm and masks before and it's clear that there
    is
    strong resistance to adding masks to the IR.

Yes. I think that the consensus is that we don't need to predicate the
IR itself to support MIC-like processors.

Perhaps not but I think we need a little more than we have right now.
I'll ponder this some more but in the mean time, please continue to add
thoughts and ideas.

                                -David

Nadav Rotem <nrotem@apple.com> writes:

For DIV/MOD you can blend the inputs BEFORE the operation. You can
place ones or zeros depending on the operation.

Quick follow-up on this. What about using "undef" as the input for
false items:

tv1 = select mask, v1, undef
tv2 = select mask, v2, undef
tv3 = div tv1, tv2
v3 = select mask, tv3, undef

I'm always confused about the semantics of undef. Is the above safe
code? It would simplify things a bit not to have to track which input
values are safe based on the context of an operation.

                             -David

For DIV/MOD you can blend the inputs BEFORE the operation. You can
place ones or zeros depending on the operation.

Quick follow-up on this. What about using “undef” as the input for
false items:

tv1 = select mask, v1, undef
tv2 = select mask, v2, undef
tv3 = div tv1, tv2
v3 = select mask, tv3, undef

I’m always confused about the semantics of undef. Is the above safe
code? It would simplify things a bit not to have to track which input
values are safe based on the context of an operation.

-David

This is not a correct use of undef. I think that InstCombine/DagCombine will optimize tv1 into ‘undef’.

Nadav Rotem <nrotem@apple.com> writes:

        For DIV/MOD you can blend the inputs BEFORE the operation. You
        can
        place ones or zeros depending on the operation.

    Quick follow-up on this. What about using "undef" as the input for
    false items:
    
    tv1 = select mask, v1, undef
    tv2 = select mask, v2, undef
    tv3 = div tv1, tv2
    v3 = select mask, tv3, undef
    
    I'm always confused about the semantics of undef. Is the above
    safe
    code? It would simplify things a bit not to have to track which
    input
    values are safe based on the context of an operation.
    
    -David

This is not a correct use of undef. I think that
InstCombine/DagCombine will optimize tv1 into 'undef'.

But is that really a legal transformation? Clearly not all elements are
known to be undef.

But thanks for the feedback! We'll avoid this route for now.

                          -David

Hi David,

This blog post explains the motivation for modeling undefined behavior: http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html

The idea is that you want to optimize your code based on the premise that undefined behavior in C allows you to do anything you like (to the undefined value).

Thanks,
Nadav

Nadav Rotem <nrotem@apple.com> writes:

>
> For DIV/MOD you can blend the inputs BEFORE the operation. You
> can
> place ones or zeros depending on the operation.
>
> Quick follow-up on this. What about using "undef" as the input for
> false items:
>
> tv1 = select mask, v1, undef
> tv2 = select mask, v2, undef
> tv3 = div tv1, tv2
> v3 = select mask, tv3, undef
>
> I'm always confused about the semantics of undef. Is the above
> safe
> code? It would simplify things a bit not to have to track which
> input
> values are safe based on the context of an operation.

The selects with undef do indeed do what you intend here. They return
vectors where the elements corresponding to true bits in the mask have
defined values, and the other elements have undef values. Undef can indeed
exist on a per-element basis, and LLVM is quite careful to implement this
[0].

The real problem with this approach is that there's nothing guarding those
undef elements from the div. If the undef elements happen to be 0, you'll
divide by zero, which is undefined behavior. For a divide, it's either
necessary to define the behavior of divide-by-zero (not what LLVM does
today), or mask the divide itself.

> -David
>
> This is not a correct use of undef. I think that
> InstCombine/DagCombine will optimize tv1 into 'undef'.

It is not valid to optimize tv1 to undef here. tv3 is the problem.

Dan

[0] undef can also exist on a per-bit basis: and(1, undef) is a value where
the least significant bit is undef and the remaining bits are zero.
Amusingly, undef can also exist on a non-bitwise basis: add(mul(urem(undef,
3), 17), 5) is a value which fluctuates among just the values 5, 22, and 39.

Nadav Rotem <nrotem@apple.com> writes:

The idea is that you want to optimize your code based on the premise
that undefined behavior in C allows you to do anything you like (to
the undefined value).

To the undefined value, sure, but how does that get extrapolated to
treating the entire vector as undef?

                                 -David

Dan Gohman <dan433584@gmail.com> writes:

The selects with undef do indeed do what you intend here. They return
vectors where the elements corresponding to true bits in the mask have
defined values, and the other elements have undef values. Undef can
indeed exist on a per-element basis, and LLVM is quite careful to
implement this [0].

Ok, that's what I would expect.

The real problem with this approach is that there's nothing guarding
those undef elements from the div. If the undef elements happen to be
0, you'll divide by zero, which is undefined behavior. For a divide,
it's either necessary to define the behavior of divide-by-zero (not
what LLVM does today), or mask the divide itself.

The intent is to match the select/select/div/select as one masked divide
instruction. I totally understand what you're saying and I know there
is no guarantee that we will even see that construct way down in isel.
But it seems to be the best we've got right now.

Using safe values for false items is...er...safer. :slight_smile: We'll probably
do that and hope we can avoid the overhead at isel time.

                            -David

Hi David,

We're looking at how to handle masked vector operations in architectures
like Knight's Corner. In our case, we have to translate from a fully
vectorized IR that has mask support to llvm IR which does not have mask
support.

For non-trapping instructions this is fairly straightforward:

; Input
t1 = add t2, t3, mask

; llvm IR -- assuming we want zeros in the false positions, which is not
; always the case
tt = add t2, t3
t1 = select mask, tt, 0

there seems to be a plan to get rid of the select instruction and just use
branches and phi nodes instead. Amongst other things this requires boosting
the power of codegen so that branches+phi nodes can be turned into cmov or
whatever when appropriate.

It's easy enough to write a TableGen pattern to match add+select and
emit a masked instruction. Alternative, we can always resort to manual
lowering (ugh).

For trapping operations this is problematic. Take a load. Here's the
same attempt with a load:

tt = load [addr]
t1 = select mask, tt, 0

This would not be problematic at the IR level if it was done by branching to
one of two basic blocks based on the condition, and doing the load in the
appropriate basic block. Codegen would however need to become powerful enough
to turn this construct into your target's predicated load.

Ciao, Duncan.

Hi Duncan,

Thanks for commenting on this. I completely disagree with the proposal to remove SELECTS from the IR, and David’s email is a perfect example of how useful Selects are. I’d be happy to comment about the proposal when it is presented to the community.

Thanks,
Nadav

Duncan Sands <baldrick@free.fr> writes:

there seems to be a plan to get rid of the select instruction and just use
branches and phi nodes instead. Amongst other things this requires boosting
the power of codegen so that branches+phi nodes can be turned into cmov or
whatever when appropriate.

This is a very BAD idea. Are you telling me that every predicated
instruction will need to be in its own basic block so it can be
represented with a phi? Certainly this will be true for any
instructions that do not share masks.

PLEASE do not do this! This would be a huge step backward in vector
support.

It's easy enough to write a TableGen pattern to match add+select and
emit a masked instruction. Alternative, we can always resort to manual
lowering (ugh).

For trapping operations this is problematic. Take a load. Here's the
same attempt with a load:

tt = load [addr]
t1 = select mask, tt, 0

This would not be problematic at the IR level if it was done by branching to
one of two basic blocks based on the condition, and doing the load in the
appropriate basic block. Codegen would however need to become powerful enough
to turn this construct into your target's predicated load.

How will that ever happen? isel has never known much about control flow
at all.

Please do NOT remove select until we have a solid replacement in place,
something that's tested and known to work.

I cannot object strongly enough. I've bit my tongue at a few IR
changes, but not this one.

Who propsed this change? Why has it not been discussed on the list?

                             -David

FWIW, I don’t think that removing select has ever been really proposed. I would also be pretty against such a thing.

That said, there is a question about how much if conversion should happen on IR vs at the machine level. There are interesting design problems to be answered there, but AFAIK, no design would remove select entirely.

-Chris

Chris Lattner <clattner@apple.com> writes:

FWIW, I don't think that removing select has ever been really
proposed. I would also be pretty against such a thing.

Whew. Good to know.

That said, there is a question about how much if conversion should
happen on IR vs at the machine level. There are interesting design
problems to be answered there, but AFAIK, no design would remove
select entirely.

Ok. Yes, I think there are some interesting design choices in this
area. I look forward to a good discussion!

                           -David

Has anyone done a comparision between the "fully vectorized IR and "LLVM IR" ?
If someone has already invented a "fully vectorized IR", it might be
beneficial to not re-invent it for LLVM.
For example, if you are optimizing a loop, and splitting it into 3
loops, one of which can then be fully vectorized, it would be useful
to represent that optimization/translation at the IR level. Adding
mask support to LLVM IR would therefore seem a sensible course to me.
It might be a short term pain, but would possibly benefit the longterm
optimization goals of LLVM.

James Courtier-Dutton <james.dutton@gmail.com> writes:

We're looking at how to handle masked vector operations in architectures
like Knight's Corner. In our case, we have to translate from a fully
vectorized IR that has mask support to llvm IR which does not have mask
support.

Has anyone done a comparision between the "fully vectorized IR and "LLVM IR" ?
If someone has already invented a "fully vectorized IR", it might be
beneficial to not re-invent it for LLVM.
For example, if you are optimizing a loop, and splitting it into 3
loops, one of which can then be fully vectorized, it would be useful
to represent that optimization/translation at the IR level. Adding
mask support to LLVM IR would therefore seem a sensible course to me.
It might be a short term pain, but would possibly benefit the longterm
optimization goals of LLVM.

The vectorized IR we are translating from has explicit masking at the
leaf nodes and implied masking at the inner nodes. For example:

                 ___MERGE___
                / \
               + -
              / \ / \
             / \ / \
         [a#m1] [b#m1] [a#m2] [b#m2]

So the add is assumed to operate under #m1 and the subtract is assumed
to operate under #m2. Then there is an explicit merge operation to form
the final vector. In this case #m2 = ~#m1.

I believe we can represent this in LLVM IR with selects as long as we
have predication at the leaves. The trick is to have isel match all of
these selects and generate an efficient predicated operation. I'm
working on trying that experiment to see if it will suffice.

So I don't know that a fully predicated IR would be any better than
selects + predication at the leaves. That's why I'm doing the
experiment. :slight_smile:

                                -David