Ideas for representing vector gather/scatter and masks in LLVM IR

Here are some ideas that came out of brainstorming during and after
the dev meeting. Disclaimer: they're just ideas :-).

* Vector Gather/Scatter

One way to do gather/scatter would be to extend vector types to
support pointer element types, and extend load, store, and getelementptr
to operate on vectors of pointers.

A typical gather sequence would then look like this:

  %vi = load <2 x i64>* %indices ; load a vector of indices
  %vp = gep <2 x f32*> %base, <2 x i64> %vi ; compute an address vector
  %vx = load <2 x f32*> %vp ; gather

This is significantly less ugly than a build-it-yourself gather, and
would be much friendlier to masks (below) than the "individually extract
each i1 from the mask" approach.

Note that this wouldn't support multiple alignments or multiple
address spaces in a single gather/scatter. Similarly, volatile
would be all-or-nothing. These don't seem like show-stoppers though.

This would complicate analyses that look at load and store addresses,
but if we really want to do gather/scatter without messes, this might be
an acceptable tradeoff.

* Masks

While adding a mask operand to every instruction that needs it would
serve the intended purpose, it would also enlarge and complicate IR,
even in code that doesn't need masks. It's a common use-case to have a
single mask used by many adjacent instructions, so this would also be
highly redundant.

An alternative that exploits this common use-case is to add a new
applymask instruction:

  %w = applymask <2 x f32> %v, <2 x i1> %m

The semantics would be to copy %v into %w, and implicitly apply mask %m
to all users (recursively) of %w, unless overridden by another
applymask. For example:

  %p = applymask <2 x f32*> %q, <2 x i1> %m
  %x = load <2 x f32*>* %p ; implicitly masked by %m
  %y = add <2 x f32> %x, %w ; implicitly masked by %m
  %z = mul <2 x f32> %y, %y ; implicitly masked by %m

This approach minimizes enlargement of IR in code that doesn't use
masks, and it reduces the impact on complexity -- a pass like
instcombine that wants to combine an instruction and one of its operands
into a single instruction could safely do so without knowing anything
about masks.

The applymask instruction could apply to scalars as well as vectors.

The set of instructions requiring mask legalization would be found by
recursively traversing the uses of all applymask instructions.

extractvalue on a masked-out value yields undef. overriding a value's
mask with a new applymask instruction that de-masks any elements leaves
those elements set to undef.

It would be illegal for an instruction with multiple operands to have
multiple distinct masks. This includes PHI nodes.

Insertelement and shufflevector with masked-out values would not be
immune to the mask; assigning a value to a masked-out element of a
vector does not make that element non-undef. Code wanting to fill in
masked-out values must apply a new mask to the operands first.

Masks would not themselves be maskable. That is, the result of an
applymask could not be used (even indirectly) as the second operand
of another applymask.

Dan

Hi,

The set of instructions requiring mask legalization would be found by
recursively traversing the uses of all applymask instructions.

what if the applymask is in one function, but the use of it is in another?

Ciao,

Duncan.

I think we could just disallow this.

If we get to the point of wanting mask-aware library routines, it might
make sense to introduce an unmask instruction, which would be a sort of
inverse to applymask that would return the original value, with any
masked-out portion set to undef. Front-ends would then explicitly
pass the mask along with the unmasked arguments to library routines,
which would reapply the mask as needed. It's conceivable that we might
want something even fancier than that in the future, but we can figure
that out later.

Thanks,

Dan

* Vector Gather/Scatter

One way to do gather/scatter would be to extend vector types to
support pointer element types, and extend load, store, and getelementptr
to operate on vectors of pointers.

A typical gather sequence would then look like this:

  %vi = load <2 x i64>* %indices ; load a vector of indices
  %vp = gep <2 x f32*> %base, <2 x i64> %vi ; compute an address vector
  %vx = load <2 x f32*> %vp ; gather

This looks very good to me. It will make vector gather/scatter code much
cleaner than the "extract data and mask bits" stuff we talked about at the
meeting.

Note that this wouldn't support multiple alignments or multiple
address spaces in a single gather/scatter. Similarly, volatile
would be all-or-nothing. These don't seem like show-stoppers though.

Nope. If alignment is a concern, my assumption would be that the same
alignment would be required on all elements, otherwise one would not be
able to vectorize the gather/scatter. One doesn't generally mix data types
in a gather/scatter operations.

This would complicate analyses that look at load and store addresses,
but if we really want to do gather/scatter without messes, this might be
an acceptable tradeoff.

By "complicate" do you mean "need to look at multiple addresses from a
single instruction?" Or is there more than that? I'm trying to understand
all the implications.

While adding a mask operand to every instruction that needs it would
serve the intended purpose, it would also enlarge and complicate IR,
even in code that doesn't need masks. It's a common use-case to have a
single mask used by many adjacent instructions, so this would also be
highly redundant.

But explicit is better than implicit in my experience. It's also the LLVM
philosophy to be as explicit as possible.

An alternative that exploits this common use-case is to add a new
applymask instruction:

  %w = applymask <2 x f32> %v, <2 x i1> %m

The semantics would be to copy %v into %w, and implicitly apply mask %m
to all users (recursively) of %w, unless overridden by another
applymask. For example:

  %p = applymask <2 x f32*> %q, <2 x i1> %m
  %x = load <2 x f32*>* %p ; implicitly masked by %m
  %y = add <2 x f32> %x, %w ; implicitly masked by %m
  %z = mul <2 x f32> %y, %y ; implicitly masked by %m

Yuck. I don't like this at all. It makes reading the IR harder because now
you need to worry about context. Not all dependencies are readily expressed
in the instructions. How would one express TableGen patterns for such
things?

My understanding is that we came away with a general agreement to add
mask support to operations that can trap and to memory operations, That
would mean adding masks to floating-point arithmetic and memory operations.
As I recall, Chris experssed some interest in create separate integer and fp
arithmetic instructions anyway, so it doesn't seem to be a lot of additional
work to add masks to the fp side since instcombine, et. al. will need to know
about entirely new operations anyway.

We concluded that operation results would be undefined for vector elements
corresponding to a zero mask bit.

We also talked about adding a vector select, which is crucial for any code
that uses masks.

                                                                -Dave

I agree with David. I don't like the implicit use of masks. That seems to unnecessarily complicate LLVM IR.

Evan

* Vector Gather/Scatter

This would complicate analyses that look at load and store addresses,
but if we really want to do gather/scatter without messes, this might be
an acceptable tradeoff.

By "complicate" do you mean "need to look at multiple addresses from a
single instruction?" Or is there more than that? I'm trying to understand
all the implications.

I mean just that -- we have a fair amount of code built around looking
at the addresses of load and store nodes that in some cases would need
to be restructured if it would cope with multiple addresses at a time.

While adding a mask operand to every instruction that needs it would
serve the intended purpose, it would also enlarge and complicate IR,
even in code that doesn't need masks. It's a common use-case to have a
single mask used by many adjacent instructions, so this would also be
highly redundant.

But explicit is better than implicit in my experience. It's also the LLVM
philosophy to be as explicit as possible.

An alternative that exploits this common use-case is to add a new
applymask instruction:

%w = applymask <2 x f32> %v, <2 x i1> %m

The semantics would be to copy %v into %w, and implicitly apply mask %m
to all users (recursively) of %w, unless overridden by another
applymask. For example:

%p = applymask <2 x f32*> %q, <2 x i1> %m
%x = load <2 x f32*>* %p ; implicitly masked by %m
%y = add <2 x f32> %x, %w ; implicitly masked by %m
%z = mul <2 x f32> %y, %y ; implicitly masked by %m

Yuck. I don't like this at all. It makes reading the IR harder because now
you need to worry about context.

I don't disagree with these. I think it's a trade-off, with LLVM
design philosophy and IR cleanliness arguments on both sides.

The applymask approach leverages use-def information rather than
what can be thought of as duplicating a subset of it, making the IR
less cluttered. And, it makes it trivially straightforward to write
passes that work correctly on both masked and unmasked code.

Not all dependencies are readily expressed
in the instructions. How would one express TableGen patterns for such
things?

The syntax above is an idea for LLVM IR. SelectionDAG doesn't necessarily
have to use the same approach.

My understanding is that we came away with a general agreement to add
mask support to operations that can trap and to memory operations, That
would mean adding masks to floating-point arithmetic and memory operations.
As I recall, Chris experssed some interest in create separate integer and fp
arithmetic instructions anyway, so it doesn't seem to be a lot of additional
work to add masks to the fp side since instcombine, et. al. will need to know
about entirely new operations anyway.

I think we all recognize the need, and in the absence of better
alternatives are willing to accept the mask operand approach. It would
have a significant impact on everyone, even those that don't use masks.
I don't want to stand in the way of progress, but this alternative
approach seems promising enough to be worth consideration.

We concluded that operation results would be undefined for vector elements
corresponding to a zero mask bit.

We also talked about adding a vector select, which is crucial for any code
that uses masks.

Right. This applymask idea doesn't conflict with these.

Dan

> By "complicate" do you mean "need to look at multiple addresses from a
> single instruction?" Or is there more than that? I'm trying to
> understand
> all the implications.

I mean just that -- we have a fair amount of code built around looking
at the addresses of load and store nodes that in some cases would need
to be restructured if it would cope with multiple addresses at a time.

Ok. I should think that this would be feasible to do. In the worst case it's
an N^2 loop looking at all pairs. And N is usually going to be small.

>> %p = applymask <2 x f32*> %q, <2 x i1> %m
>> %x = load <2 x f32*>* %p ; implicitly masked by %m
>> %y = add <2 x f32> %x, %w ; implicitly masked by %m
>> %z = mul <2 x f32> %y, %y ; implicitly masked by %m
>
> Yuck. I don't like this at all. It makes reading the IR harder
> because now
> you need to worry about context.

I don't disagree with these. I think it's a trade-off, with LLVM
design philosophy and IR cleanliness arguments on both sides.

The applymask approach leverages use-def information rather than
what can be thought of as duplicating a subset of it, making the IR

I don't understand what you mean by "duplicating" here. You need some
kind of use-def information for the masks themselves because at some
point they need to be register-allocated.

less cluttered. And, it makes it trivially straightforward to write
passes that work correctly on both masked and unmasked code.

I had a thought on this, actually. Let's say the mask is the very last
operand on masked instructions. Most passes don't care about the mask
at all. They can just ignore it. Since they don't look at the extra operand
right now, there shouldn't be many changes necessary (some asserts
may need to be fixed, etc.).

Think about instcombine. It's matching patterns. If the matcher doesn't
look at masks, that may be ok most of the time (mod corner cases which
I fully appreciate can be a real pain to track down). If we want fancy
instcombine tricks that understand masks, we can add those later.

> Not all dependencies are readily expressed
> in the instructions. How would one express TableGen patterns for such
> things?

The syntax above is an idea for LLVM IR. SelectionDAG doesn't
necessarily
have to use the same approach.

What do you mean by "ideal for LLVM IR?" This looks very much _not_ ideal to
me from a debugging standpoint. It's difficult to understand. It took me
reading through the proposal a few times to grok what you are talking about.

I think we all recognize the need, and in the absence of better
alternatives are willing to accept the mask operand approach. It would
have a significant impact on everyone, even those that don't use masks.

How do you define "significant impact?" Compile time? Development effort?
Transition pain? All of the above? More?

For architectures that don't use masks, either the mask gets set to all 1's or
we have non-masked versions of operators. I honestly don't know which is
the desireable route to take. My guess is that the optimizers will have to
understand whether or not the target architecture supports masks and not
generate them (e.g. no if-conversion) if the target doesn't support them.

I wonder if there is some way to un-if-convert to eliminate masks if
necessary. I'm thinking about code portability and JIT issues when
readfing in LLVM IR that was produced at some earlier time. Perhaps
this isn't an issue we need to worry about right now.

I don't want to stand in the way of progress, but this alternative
approach seems promising enough to be worth consideration.

Alternatives are always welcome and worth considering. I'm looking at the
kind of things the LLVM community is going to want to support and I'm
pretty sure masks are going to be a very big part of architectures in the
future. We're done with clock speed improvements, so we need to rely on
architecture more. Vectorization is a well-known technique to improve
single thread performance and masks are critical to producing efficient vector
code.

If y'all agree with this premise, it seems to me that we want to support
such architectures in as straightforward a way as possible so as to minimize
future pain when we're all writing complex and beautiful vector hacks. :slight_smile:

What can we learn from the IA64 and ARM backends? How do they handle
their masks (scalar predication)? Is all the if-conversion done in
target-specific passes?

> We concluded that operation results would be undefined for vector
> elements
> corresponding to a zero mask bit.
>
> We also talked about adding a vector select, which is crucial for
> any code
> that uses masks.

Right. This applymask idea doesn't conflict with these.

Yep. I just wanted to be thorough.

                                                  -Dave

The applymask approach leverages use-def information rather than
what can be thought of as duplicating a subset of it, making the IR

I don't understand what you mean by "duplicating" here.

If you look just at the case where every instruction in a
given use-def sub-dag uses the same mask, adding that mask as
an operand to all of them is largely just duplicating the
information about them all being connected. This is the common
case that applymask is aimed at.

In the case where multiple masks are used, applymask can still
cope, but the neat thing is that in this case it serves to
mark the dataflow edges where masks change.

You need some
kind of use-def information for the masks themselves because at some
point they need to be register-allocated.

What I'm talking about here is just in LLVM IR. I agree that we want
mask registers as operands during register allocation, and probably
also instruction selection.

less cluttered. And, it makes it trivially straightforward to write
passes that work correctly on both masked and unmasked code.

I had a thought on this, actually. Let's say the mask is the very last
operand on masked instructions. Most passes don't care about the mask
at all. They can just ignore it. Since they don't look at the extra
operand
right now, there shouldn't be many changes necessary (some asserts
may need to be fixed, etc.).

Think about instcombine. It's matching patterns. If the matcher doesn't
look at masks, that may be ok most of the time (mod corner cases which
I fully appreciate can be a real pain to track down). If we want fancy
instcombine tricks that understand masks, we can add those later.

If masks are operands, instcombine will need to check if all the
relevent masks match before many of the transformations it does,
and it'll need to take care to put the mask operand in the
instructions it creates.

With applymask, I believe instcombine wouldn't require any
modifications, except things like "case ApplyMaskInst: break;" in
a few places. Applymask makes masks in the IR so easy to reason
about, most passes won't need to do any special reasoning.

> Not all dependencies are readily expressed
> in the instructions. How would one express TableGen patterns for such
> things?

The syntax above is an idea for LLVM IR. SelectionDAG doesn't
necessarily
have to use the same approach.

What do you mean by "ideal for LLVM IR?" This looks very much _not_ ideal
to
me from a debugging standpoint. It's difficult to understand. It took me
reading through the proposal a few times to grok what you are talking
about.

I said "idea", not "ideal" :-). But I just meant that LLVM IR
and SelectionDAG don't have to do the same thing.

I think we all recognize the need, and in the absence of better
alternatives are willing to accept the mask operand approach. It would
have a significant impact on everyone, even those that don't use masks.

How do you define "significant impact?" Compile time? Development
effort?
Transition pain? All of the above? More?

With mask operands, many passes will need to explicitly check for
masks even if they don't care and just want to be conservatively
correct.

With applymask, passes will often be able to operate on masked IR
just as aggressively as non-masked IR.

I don't want to stand in the way of progress, but this alternative
approach seems promising enough to be worth consideration.

Alternatives are always welcome and worth considering. I'm looking at the
kind of things the LLVM community is going to want to support and I'm
pretty sure masks are going to be a very big part of architectures in the
future. We're done with clock speed improvements, so we need to rely on
architecture more. Vectorization is a well-known technique to improve
single thread performance and masks are critical to producing efficient
vector
code.

If y'all agree with this premise, it seems to me that we want to support
such architectures in as straightforward a way as possible so as to
minimize
future pain when we're all writing complex and beautiful vector hacks. :slight_smile:

I think we basically agree here :-). For me, that applymask
simplifies the reasoning that optimizers must do for masked
instructions is a large part of what motivates it for
consideration.

What can we learn from the IA64 and ARM backends? How do they handle
their masks (scalar predication)? Is all the if-conversion done in
target-specific passes?

It's in lib/CodeGen/IfConversion.cpp, but it wouldn't be usable for
vectors. If-conversion for vectors must be done as part of the
vectorization (whether that's the user/front-end or the optimizer).

Dan

>> The applymask approach leverages use-def information rather than
>> what can be thought of as duplicating a subset of it, making the IR
>
> I don't understand what you mean by "duplicating" here.

If you look just at the case where every instruction in a
given use-def sub-dag uses the same mask, adding that mask as
an operand to all of them is largely just duplicating the
information about them all being connected. This is the common
case that applymask is aimed at.

But why couldn't one argue the same thing for other operands and make
LLVM IR a stack machine? It just seems strange to me to treat a
fundamental part of the IR differently simply because it came along
later.

In the case where multiple masks are used, applymask can still
cope, but the neat thing is that in this case it serves to
mark the dataflow edges where masks change.

What advantage does that give over explicit masks in SSA form? PHI nodes do
the same thing.

> You need some
> kind of use-def information for the masks themselves because at some
> point they need to be register-allocated.

What I'm talking about here is just in LLVM IR. I agree that we want
mask registers as operands during register allocation, and probably
also instruction selection.

So how does isel go from an implicit-masked LLVM IR to an
explicit-masked MachineInstruction? And then wouldn't all Machine
passes need to understand mask operands?

> Think about instcombine. It's matching patterns. If the matcher doesn't
> look at masks, that may be ok most of the time (mod corner cases which
> I fully appreciate can be a real pain to track down). If we want fancy
> instcombine tricks that understand masks, we can add those later.

If masks are operands, instcombine will need to check if all the
relevent masks match before many of the transformations it does,

I'm not sure about "many." Many of instcombines tricks are based on
expression reorderings which are valid with or without masks. We might
need to do some mask manipulation (ANDing or ORing them, for example)
but we'd have to do that anyway with applymask if the incoming masks
to expressions being combined don't match. We'd have to introduce a new
applymask that takes the result of the logical mask operation.

So, it seems to me that instcombine needs to check and manipulate masks
either way.

We'd have to be careful about not introducing traps but again instcombine will
have to worry about that with applymask as well in the general case.

It strikes me that you're indicating that with applymask, instcombine can just
ignore the implicit mask operand on every instruction (because it isn't
explicit and instcombine doesn't look at it right now). I'm in fact saying
the exact same thing, except masks would be explicit and instcombine would
just not look at those operands.

Neither solution eliminates the need for instcombine to be careful and consult
masks from time to time.

Perhaps I'm totally missing something. Concrete examples would be helpful.

and it'll need to take care to put the mask operand in the
instructions it creates.

Or create a new applymask if it needs to generate new masks as the result of
combining.

With applymask, I believe instcombine wouldn't require any
modifications, except things like "case ApplyMaskInst: break;" in
a few places. Applymask makes masks in the IR so easy to reason
about, most passes won't need to do any special reasoning.

I don't think this is quite right. See the comment above.

>> The syntax above is an idea for LLVM IR. SelectionDAG doesn't
>> necessarily
>> have to use the same approach.
>
> What do you mean by "ideal for LLVM IR?" This looks very much _not_
> ideal to
> me from a debugging standpoint. It's difficult to understand. It took
> me reading through the proposal a few times to grok what you are talking
> about.

I said "idea", not "ideal" :-). But I just meant that LLVM IR
and SelectionDAG don't have to do the same thing.

Oops, sorry about the misread. :slight_smile:

I don't think SelectionDAG is really the issue. The issue is how to reason
about vector code and debug compilers. In my experience, explicit is
usually the better way to go for maintainability.

> How do you define "significant impact?" Compile time? Development
> effort?
> Transition pain? All of the above? More?

With mask operands, many passes will need to explicitly check for
masks even if they don't care and just want to be conservatively
correct.

If passes need to worry about masks for correctness, that doesn't change
with the representation of masks. That's a program dataflow issue and applies
no matter how it's represented.

With applymask, passes will often be able to operate on masked IR
just as aggressively as non-masked IR.

And they should be able to do so with any other representation as well. The
compiler needs to maintain semantics, not syntax.

> If y'all agree with this premise, it seems to me that we want to support
> such architectures in as straightforward a way as possible so as to
> minimize
> future pain when we're all writing complex and beautiful vector hacks.
> :slight_smile:

I think we basically agree here :-). For me, that applymask
simplifies the reasoning that optimizers must do for masked
instructions is a large part of what motivates it for
consideration.

I'm not convinced that there is a simplification of reasoning but I'm open to
changing my mind. Again, concrete examples would be really helpful.

I do agree that in some cases code rewrites might be avoided because existing
Instructions will retain the same number of operands they have now. But
that's not a "reasoning" issue. That's a mechanics issue (how you implement
the algorithm).

Avoiding the one-time pain of converting this kind of code doesn't seem worth
the cost to maintainability to me.

The key with all of this is going to be introducing it incrementally. The
first step is to support vector of i1 all the way through isel. Next is
vector select. Then we can start selectively adding masks to Instructions.
As we do this, if a vector operation can trap and a mask isn't available for
that Instruction yet, then we just can't vectorize that code yet. There's no
loss from what we have right now, which is total inability to vectorize
conditional code in a reasonable way, save for a few tricks for corner cases.

If instcombine isn't ready to handle masks on some Instructions, then we can't
vectorize in those cases. We don't *need* to convert instcombine or any other
pass all at once. We can do it gradually and introduce more vector capability
as we go.

> What can we learn from the IA64 and ARM backends? How do they handle
> their masks (scalar predication)? Is all the if-conversion done in
> target-specific passes?

It's in lib/CodeGen/IfConversion.cpp, but it wouldn't be usable for
vectors. If-conversion for vectors must be done as part of the
vectorization (whether that's the user/front-end or the optimizer).

I meant using them as examples of representation. I know that scalar
if-conversion is different than vector if-conversion. I'm curious about how
IfConversion.cpp represents the predicates it creates. Since it's in CodeGen
I guess the MachineInstruction representation just includes them as extra
operands.

                                                            -Dave

>> The set of instructions requiring mask legalization would be found by
>> recursively traversing the uses of all applymask instructions.
>
> what if the applymask is in one function, but the use of it is in
> another?

I think we could just disallow this.

Likewise for phi nodes in which some have applymask and others do not?

Ciao,

Duncan.

Ok, so I took my own advice and thought about CSE and instcombine a bit.
I wrote the code by hand in a sort of pseudo-llvm language, so don't
crucify me for mistakes. :slight_smile:

CSE is an optimization that needs to pay attention to masks. You can't
(easily) CSE an expression with different masks.

Using a mask-per-operation setup, code might look like this:

CSE Mask Example

Mask-per-operation:
%m1 = ...
%m2 = ...
%r1 = add %r2, %r3, %m1
%r4 = add %r2, %r3, %m2

We can only CSE the adds if %m1 == %m2. I'd think it would be
sufficient to just check whether the mask registers are the same, though
you could imagine walking back through the use-def chains to do more
complex analysis. Esentially, the mask operand becomes part of the
expression you're trying to match. CSE definitely needs to be
changed to understand masks.

With applymask, things look something like this:

applymask:
%m1 = ...
%m2 = ...
%r5 = applymask %r2, %m1
%r6 = applymask %r3, %m1
%r7 = applymask %r2, %m2
%r8 = applymask %r3, %m2
%r1 = add %r5, %r6
%r4 = add %r7, %r8

This is interesting because CSE will not do the optimization since the
operands of the adds are different. CSE does not have to change to
understand masks. Value numbering will have to understand applymask to
make sure it doesn't do something bad because it doesn't care about the
names of operands but the renaming of registers required by applymask is
somewhat attractive.

Now let's look at an instcombine example:

Instcombine example: -A + -B --> -(A + B)

Mask-per-operation:
%m1 = ...
%m2 = ...
%m3 = ...
%r1 = neg %r2, %m1
%r1 = select %r1, %r2, %m1 # Define all elements
%r3 = neg %r4, %m2
%r3 = select %r3, %r4, %m2 # Define all elements
%r5 = add %r1, %r3, %m3

Since the masks on the negs are different we have to be careful in how
we do this. In the easiest case, we simply disallow combining here,
which requires instcombine to examine the masks and, as in the CSE
example above, potentially just look at the mask register names to
determine equality.

One could imagine doing the combining this way:

%m1 = ...
%m2 = ...
%m3 = ...
# Do the combining where legal
%m4 = and %m1, %m2
%r6 = add %r2, %r4, %m4
%r7 = neg %r6, %m4
%r7 = select %r7, %r6, %m4 # Define all elements for final merge
# Take care of the rest
%m6 = xor %m1, %m4
%m7 = xor %m2, %m4
%m8 = xor %m3, %m4
%r1 = neg %r2, %m6
%r1 = select %r1, %r2, %m6 # Define all elements for add
%r3 = neg %r4, %m7
%r3 = select %r3, %r4, %m7 # Define all elements for add
%r5 = add %r1, %r3, %m8
%r5 = select %r5, %r1, %m7 # Define all elements for final merge
# Do the final merge
%r5 = select %r7, %r5, %m4

But this is obviously a loss. We may be able to get away with
eliminating some merges (I'm being conservative about undefined elements
here) but it would still be a loss.

applymask:

%m1 = ...
%m2 = ...
%m3 = ...
%r6 = applymask %r2, %m1
%r7 = applymask %r4, %m2
%r1 = neg %r6
%r3 = neg %r7
%r8 = applymask %r1. %m3
%r9 = applymask %r3, %m3
%r5 = add %r8, %r9

This is really no different than the mask-per-operation case.
Instcombine still has to consider masks when evaluating the legality and
profitability of combining. Here instcombine would have to walk back
through the def-use lists on the operands of the neg and add to find if
they are masked and then determine the mask value (or mask register name
for simpler equality testing).

Again, doing the combining in this case would be detrimental. My guess
is that most combining turns out worse code when operations are masked
and the masks are not equal.

Given these two examples, I see some benefit to applymask for
transformations like CSE and other pattern-matching algorithms where the
renaming of registers that applymask requires converts previously
matching patterns into patterns that don't match. For something like
instcombine or value numbering that is based on value flow and/or pure
semantics, I don't see any benefit.

Of course, these are just two examples. One could repeat the exercise
for any transformation we're interested in.

I hope this is helpful in moving us along in the discussion. Feedback
and comments are welcome!

                                                -Dave

This thread is already churning, so I apologize if by being late to the party I have missed important information.

The “apply_mask” approach is very familiar to me, in that I spent a lot of time thinking about how masking could be added pervasively to LLVM without disrupting the currently nice property that most of the standard arithmetic instructions are “overloaded” to work on all types.

The space of options that I saw available (and please don’t judge these options just yet, I am well aware that some of them are abhorrent):

1 - Make masking explicit and change every potentially side-effecting operation to include an i1 mask in the scalar case and i1 vector mask in the vector case.
2 - Add specialized masked versions of these operations (distinct from the unmasked versions) either as intrinsics or new instructions
3 - Add an implicit mask that all vector operations occur “under” along with operations to set/get (or push/pop) this mask.
4 - Like 3 but add some notion of a “vector branch” - that is a conditional branch on a vector of i1 values instead of a single i1, which would implicitly do “the right thing” for masking.
5 - Add a new type for “partial” vectors that combines a vector and same-sized mask. Operations overloaded for normal vectors would also work on partial vectors (and not operate on “missing” elements), and would produce partial results.

1 and 2 are the more-or-less straightforward approaches being considered, and involve either plumbing through new operations, or new semantics for old operations. There is nontrivial (I expect) but straightforward (again, I expect) work involved in either. The rest of the options try to avoid some or all of that work and/or provide niceties for end users at the expense of being “clever”.

Options 3 and 4 are almost certainly outside of the realm of what LLVM should consider. One of the best things about SSA is how explicit it makes dependencies, so adding any kind of implicit mask that is carried behind people’s backs is really scary.

Option 5 is (to my understanding) equivalent to this whole “apply_mask” approach being discussed, although the latter tries to avoid introducing a new type. Even though “apply_mask” doesn’t explicitly introduce a new type, there are all kinds of special rules about when and where a value that has been “apply_mask”'d can be used. Adding a type to represent partiality allows these rules to be made explicit as part of type-checking the IR.

Such partial vectors are still fairly non-orthogonal, so it is unclear whether the end-user experience is worth whatever complexity it adds to the IR.

Anyway, that’s my breakdown of the situation. I welcome any feedback from those with a better understanding of the consequences of any of these approaches…

  • Tim

Good contribution, thanks. I've read the thread but don't have much insight. Dan's suggestion sounds plausible, but not especially elegant.

- Mark

Neither solution eliminates the need for instcombine to be careful and
consult masks from time to time.

Perhaps I'm totally missing something. Concrete examples would be helpful.

Ok, so I took my own advice and thought about CSE and instcombine a bit.
I wrote the code by hand in a sort of pseudo-llvm language, so don't
crucify me for mistakes. :slight_smile:

CSE is an optimization that needs to pay attention to masks. You can't
(easily) CSE an expression with different masks.

Using a mask-per-operation setup, code might look like this:

CSE Mask Example

Mask-per-operation:
%m1 = ...
%m2 = ...
%r1 = add %r2, %r3, %m1
%r4 = add %r2, %r3, %m2

We can only CSE the adds if %m1 == %m2. I'd think it would be
sufficient to just check whether the mask registers are the same, though
you could imagine walking back through the use-def chains to do more
complex analysis. Esentially, the mask operand becomes part of the
expression you're trying to match. CSE definitely needs to be
changed to understand masks.

With applymask, things look something like this:

applymask:
%m1 = ...
%m2 = ...
%r5 = applymask %r2, %m1
%r6 = applymask %r3, %m1
%r7 = applymask %r2, %m2
%r8 = applymask %r3, %m2
%r1 = add %r5, %r6
%r4 = add %r7, %r8

This is interesting because CSE will not do the optimization since the
operands of the adds are different. CSE does not have to change to
understand masks. Value numbering will have to understand applymask to
make sure it doesn't do something bad because it doesn't care about the
names of operands but the renaming of registers required by applymask is
somewhat attractive.

Right. One way to think about it is that applymask can conservatively be
treated as plain binary operator, for register-oriented passes at least.

Now let's look at an instcombine example:

Instcombine example: -A + -B --> -(A + B)

Mask-per-operation:
%m1 = ...
%m2 = ...
%m3 = ...
%r1 = neg %r2, %m1
%r1 = select %r1, %r2, %m1 # Define all elements
%r3 = neg %r4, %m2
%r3 = select %r3, %r4, %m2 # Define all elements
%r5 = add %r1, %r3, %m3

Since the masks on the negs are different we have to be careful in how
we do this. In the easiest case, we simply disallow combining here,
which requires instcombine to examine the masks and, as in the CSE
example above, potentially just look at the mask register names to
determine equality.

One could imagine doing the combining this way:

%m1 = ...
%m2 = ...
%m3 = ...
# Do the combining where legal
%m4 = and %m1, %m2
%r6 = add %r2, %r4, %m4
%r7 = neg %r6, %m4
%r7 = select %r7, %r6, %m4 # Define all elements for final merge
# Take care of the rest
%m6 = xor %m1, %m4
%m7 = xor %m2, %m4
%m8 = xor %m3, %m4
%r1 = neg %r2, %m6
%r1 = select %r1, %r2, %m6 # Define all elements for add
%r3 = neg %r4, %m7
%r3 = select %r3, %r4, %m7 # Define all elements for add
%r5 = add %r1, %r3, %m8
%r5 = select %r5, %r1, %m7 # Define all elements for final merge
# Do the final merge
%r5 = select %r7, %r5, %m4

But this is obviously a loss. We may be able to get away with
eliminating some merges (I'm being conservative about undefined elements
here) but it would still be a loss.

applymask:

%m1 = ...
%m2 = ...
%m3 = ...
%r6 = applymask %r2, %m1
%r7 = applymask %r4, %m2
%r1 = neg %r6
%r3 = neg %r7
%r8 = applymask %r1. %m3
%r9 = applymask %r3, %m3
%r5 = add %r8, %r9

This is really no different than the mask-per-operation case.
Instcombine still has to consider masks when evaluating the legality and
profitability of combining. Here instcombine would have to walk back
through the def-use lists on the operands of the neg and add to find if
they are masked and then determine the mask value (or mask register name
for simpler equality testing).

Again, doing the combining in this case would be detrimental. My guess
is that most combining turns out worse code when operations are masked
and the masks are not equal.

Given these two examples, I see some benefit to applymask for
transformations like CSE and other pattern-matching algorithms where the
renaming of registers that applymask requires converts previously
matching patterns into patterns that don't match. For something like
instcombine or value numbering that is based on value flow and/or pure
semantics, I don't see any benefit.

I agree about value numbering. I think instcombine is a little
different. But only with the help of a pretty non-trivial
assumption.

One of the cornerstone assumptions of the applymask approach is
that code like what's in your examples is relatively uncommon,
and that the common case is larger groups of instructions
sharing mask values. If this assumption isn't true, many of
the advantages of applymask no longer hold. I should encourage
people to question this assumption first, because pretty much
everything related to applymask depends on it :-).

But if it is true, then we can say that instcombine's job is
made easier by applymask. Instcombine primarily does very
localized analysis, so it can look at something like this:

   %r0 = neg %r888
   %r1 = neg %r999
   %r2 = add %r0, %r1

and without knowing anything about r888 or r999 and whether
or not there were applymasks involved in producing their
value, it can safely do the transformation.

A key rule in my initial email is that the operands of any
operator must be under the same mask. The Verifier can enforce
this constraint. One problem though is that this places an
interesting burden on front-ends and vectorizers. I'll try to
write more about this soon.)

There are things instcombine can do when the masks are different.
I think you mentioned earlier it could do clever tricks ANDing
masks and such. However, based on the assumption above, this
wouldn't be used very often, so a conservative implementation of
instcombine that doesn't do any of this is mostly good enough.

Of course, these are just two examples. One could repeat the exercise
for any transformation we're interested in.

I hope this is helpful in moving us along in the discussion. Feedback
and comments are welcome!

Thanks for the examples! I think they have been helpful. I'll
try to add a few more here.

First, dead store elimination. It clearly needs to know if a
store is predicated. Mask operands would make this pretty
obvious. Applymask would make it less obvious, but still
doable, and would probably want memoization to be
efficient. This is case where mask operands are favored.

Here's one more example, or at least a skeleton of several
examples. A loop to be vectorized:

   for (...) {
     A
     if (...) {
       B
     } else {
       C
     }
     D
   }

Assuming there's a bunch of code in B and a bunch in C, then
we have four bunches of code and three mask conditions -
A and D are unmasked, B is masked, and C is masked with the
inverse mask value. This code could easily have more if
branches, nested loops, early exits, and so on, but the main
idea is that there are blocks of instructions grouped together
under the same mask, cross-block optimization opportunities
exist but are limited, and that this is assumed to be
basically representative.

There are some interesting cross-block cases here.
If you can prove that something in B and/or C can be run
unmasked, it could be CSE'd/LICM'd/PRE'd/etc.
If D has a PHI merging a value from B and C, it might be
nice to do a vector merge. It's interesting to look at
out how these kinds of cases work under both
mask operands and applymask.

Dan

Dan,

A key rule in my initial email is that the operands of any
operator must be under the same mask. The Verifier can enforce
this constraint. One problem though is that this places an
interesting burden on front-ends and vectorizers. I'll try to
write more about this soon.)

I totally missed that restriction. I think this is error-prone. I certainly
wouldn't have caught the problem in my code even if I had known about the
restriction. I suppose the fix in this case is to set the mask to all 1's
before the add. This could get rather complicated in interesting cases.

There are things instcombine can do when the masks are different.
I think you mentioned earlier it could do clever tricks ANDing
masks and such. However, based on the assumption above, this
wouldn't be used very often, so a conservative implementation of
instcombine that doesn't do any of this is mostly good enough.

Since we don't support ANY vectorization of conditional code right now, any
addition is a win. That's true for whatever implementation of masks we
go with.

Here's one more example, or at least a skeleton of several
examples. A loop to be vectorized:

   for (...) {
     A
     if (...) {
       B
     } else {
       C
     }
     D
   }

Assuming there's a bunch of code in B and a bunch in C, then
we have four bunches of code and three mask conditions -
A and D are unmasked, B is masked, and C is masked with the
inverse mask value. This code could easily have more if
branches, nested loops, early exits, and so on, but the main
idea is that there are blocks of instructions grouped together
under the same mask, cross-block optimization opportunities
exist but are limited, and that this is assumed to be
basically representative.

I think that's right.

There are some interesting cross-block cases here.
If you can prove that something in B and/or C can be run
unmasked, it could be CSE'd/LICM'd/PRE'd/etc.
If D has a PHI merging a value from B and C, it might be
nice to do a vector merge. It's interesting to look at
out how these kinds of cases work under both
mask operands and applymask.

Yes. Unfortunately, I'm out of spare cycles at the moment.

I was thinking about this more yesterday and came up with what I think is a
useful question to ask.

If LLVM had had masks way back when it started, what would it look like?

I realize that there are constraints on what kind of transition pain we want
to go through but it gets back to my earlier point that core IR infrastructure
should first first-class no matter when it is added. And "what it would have
looked like from the start" is a reasonable definition of "first-class."

                                                 -Dave

It also seems hard to enforce and completely arbitrary :slight_smile:

Not to mention value numbering could easily be taught to combine masks
when value numbering to simply dependent applymasks, etc.

Neither solution eliminates the need for instcombine to be careful and
consult masks from time to time.

Perhaps I'm totally missing something. Concrete examples would be helpful.

Ok, so I took my own advice and thought about CSE and instcombine a bit.
I wrote the code by hand in a sort of pseudo-llvm language, so don't
crucify me for mistakes. :slight_smile:

CSE is an optimization that needs to pay attention to masks. You can't
(easily) CSE an expression with different masks.

If you are using a value based CSE, you can easily CSE expressions
with different masks.
I'd agree with lexical expression based CSE, you can't, but that's a
pretty weak form of CSE.

Using a mask-per-operation setup, code might look like this:

...

..

With applymask, things look something like this:

applymask:
%m1 = ...
%m2 = ...
%r5 = applymask %r2, %m1
%r6 = applymask %r3, %m1
%r7 = applymask %r2, %m2
%r8 = applymask %r3, %m2
%r1 = add %r5, %r6
%r4 = add %r7, %r8

This is interesting because CSE will not do the optimization since the
operands of the adds are different.

LLVM's main CSE these days is really GVN based.
GVN should happily compute these to be the same value, and if it
doesn't, you have much larger problems.

GCC's value numberer already combines shifts and masks during value
numbering over multiple binary operations.
It is not hard to make LLVM's do the same to the small degree it
doesn't (the value numbering does not value number phi nodes/cycles
iteratively so it misses identities between induction variables and
things based on them in some cases. SCEV saves it in others)
You don't need to do it to the level instcombine does to get good results.

This is all with my "Person who implemented SCC based value numbering
and value based PRE in GCC" hat on :slight_smile:
--Dan

This is basically preventing the applymask instruction itself from being
masked, which doesn't really make
sense -- "conditionally use this mask using this other mask".

Mask values are just vectors of i1, so the way to compute the
intersection of two mask values is to use an 'and' instruction.

Dan

> A key rule in my initial email is that the operands of any
> operator must be under the same mask. The Verifier can enforce
> this constraint. One problem though is that this places an
> interesting burden on front-ends and vectorizers. I'll try to
> write more about this soon.)

I totally missed that restriction. I think this is error-prone. I certainly
wouldn't have caught the problem in my code even if I had known about the
restriction. I suppose the fix in this case is to set the mask to all 1's
before the add. This could get rather complicated in interesting cases.

I'm studying this problem and I have some ideas, but I don't
have any clear answers yet.

> There are things instcombine can do when the masks are different.
> I think you mentioned earlier it could do clever tricks ANDing
> masks and such. However, based on the assumption above, this
> wouldn't be used very often, so a conservative implementation of
> instcombine that doesn't do any of this is mostly good enough.

Since we don't support ANY vectorization of conditional code right now, any
addition is a win. That's true for whatever implementation of masks we
go with.

> Here's one more example, or at least a skeleton of several
> examples. A loop to be vectorized:
>
> for (...) {
> A
> if (...) {
> B
> } else {
> C
> }
> D
> }
>
> Assuming there's a bunch of code in B and a bunch in C, then
> we have four bunches of code and three mask conditions -
> A and D are unmasked, B is masked, and C is masked with the
> inverse mask value. This code could easily have more if
> branches, nested loops, early exits, and so on, but the main
> idea is that there are blocks of instructions grouped together
> under the same mask, cross-block optimization opportunities
> exist but are limited, and that this is assumed to be
> basically representative.

I think that's right.

> There are some interesting cross-block cases here.
> If you can prove that something in B and/or C can be run
> unmasked, it could be CSE'd/LICM'd/PRE'd/etc.
> If D has a PHI merging a value from B and C, it might be
> nice to do a vector merge. It's interesting to look at
> out how these kinds of cases work under both
> mask operands and applymask.

Yes. Unfortunately, I'm out of spare cycles at the moment.

I understand :-). And I should clarify for everyone following along
here that there are some more basic projects that need to be completed
before LLVM is ready to take on generalized predication, so we have
some time.

I was thinking about this more yesterday and came up with what I think is a
useful question to ask.

If LLVM had had masks way back when it started, what would it look like?

I realize that there are constraints on what kind of transition pain we want
to go through but it gets back to my earlier point that core IR infrastructure
should first first-class no matter when it is added. And "what it
would have
looked like from the start" is a reasonable definition of
"first-class."

One thing I want to clarify is that my interest in applymask is actually
not so much motivated by one-time transition pain avoidance here. I'm
more interested in keeping the IR simple and easy to understand, as much
for the benefit of code that isn't yet written as code that is. I've
gotten some good feedback and there are some outstanding problems, but
I'm hoping that generally acceptable solutions can be found.

Dan