Predicated Vector Operations

Ah, I think I get it now. This was mentioned earlier in the thread,
but it didn't click at the time. It sounds like I can do instruction
selection with a pattern like (omitting selection of the sources):

let Constraints = "$dst = $oldvalue" in {
    def MASKEDARITH : MyInstruction<
        (outs VectorReg:$dst),
        (ins MaskReg:$mask, VectorReg:$src1, VectorReg:$src2,
VectorReg:$oldvalue),
        "add $dst {$mask}, $src1, $src2",
        [(set v16i32:$dst, (vselect v16i1:$mask, (add v16i32:$src1,
v16i32:$src2), v16i32:$oldvalue))]>;
}

That's actually pretty clean.

Thanks

Jeff Bush <jeffbush001@gmail.com> writes:

It seems to me that these are not really LLVM issues as much as the
fact that SSA doesn't cleanly map to predicated instructions.

It entirely depends on how the predication is defined to work.

Good point. I was thinking of it narrowly as preserving the old value
in the register. I guess I'd amend my previous statement to say that
it actually does map just fine to SSA, but instruction selection
becomes more complex.

Yes, if you want to take advantage of the hardware preserving old
values. There's nothing that requires you to do that, however. You can
always write to a new hardware register and effectively get the
"undefined for false mask" behavior.

Maybe a later pass could even clean things up and opportunistically use
the old value preserving behavior by rewriting register names and
combining instructions. Just thinking off the top of my head.

It sounds like the current LLVM instruction selection algorithm can't
really handle the use case I described cleanly (generating predicated
arithmetic instructions that preserve the old register value). Is
that a fair statement?

I actually don't know yet. I'll let you know in a few weeks once I've
tested some things. :slight_smile:

For example, if predicates were hypothetically added universally to
the IR (which I don't think anyone wants to do), it's not clear to me
how that would even work. How would you specify what value the result
would be received for non-enabled lanes? Perhaps another parameter:

  %newvalue = fadd %x, %y, %mask, %previousvalue

It depends on how you define the mask operation.

On the Cray X1, result elements mapped to false mask values are
undefined after the operation.

I assume the only point of the mask is to avoid traps/exceptions in
this case (otherwise it doesn't really do anything, right?).

On the X1 it also potentially increases performance as masked elements
can be skipped and the operation can finish early.

On a machine like Knights Corner, the mask is the only way to have a
dynamic vector length. Traditional vector machines have a vector length
register that controls how many elements an operation should affect.
This is used to avoid the need for a remainder loop after the main
vectorized loop. A mask can be used for the same purpose. So a vector
mask is used for more than just traditional if-conversion.

                           -David

Jeff Bush <jeffbush001@gmail.com> writes:

Ah, I think I get it now. This was mentioned earlier in the thread,
but it didn't click at the time. It sounds like I can do instruction
selection with a pattern like (omitting selection of the sources):

let Constraints = "$dst = $oldvalue" in {
    def MASKEDARITH : MyInstruction<
        (outs VectorReg:$dst),
        (ins MaskReg:$mask, VectorReg:$src1, VectorReg:$src2,
VectorReg:$oldvalue),
        "add $dst {$mask}, $src1, $src2",
        [(set v16i32:$dst, (vselect v16i1:$mask, (add v16i32:$src1,
v16i32:$src2), v16i32:$oldvalue))]>;
}

Ok, but where does $oldvalue come from? That is the trickty part as far
as I can see and is why this isn't quite the same as handling
two-address instructions.

I agree that the pattern itself is straightforward. It's bascially what
I've written here.

                                -David

%tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
%ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
%sum = fadd %tx, %ty
%newvalue = select %mask, %sum, %oldvalue << From here?

If you had a designated predicated instruction you would have the same issue. The %oldvalue has to come from somewhere (or be undefined).

%oldval = ... | undef
%newvalue = predicated_fadd %mask, %left, %right, %oldval

I guess, I don’t understand your question.

Instcombine might remove the select %mask, %sum, undef but that is another issue ...

Arnold Schwaighofer <aschwaighofer@apple.com> writes:

Ok, but where does $oldvalue come from? That is the trickty part as far
as I can see and is why this isn't quite the same as handling
two-address instructions.

From the semantics of your program?

%tx = select %mask, %x, <0.0, 0.0, 0.0 ...>
%ty = select %mask, %y, <0.0, 0.0, 0.0 ...>
%sum = fadd %tx, %ty
%newvalue = select %mask, %sum, %oldvalue << From here?

Well yes. The issue isn't really in codegen, it's in IR generation.
I don't think it's unsolvable, just tricky.

If you had a designated predicated instruction you would have the same
issue. The %oldvalue has to come from somewhere (or be undefined).

%oldval = ... | undef
%newvalue = predicated_fadd %mask, %left, %right, %oldval

Certainly. This is an SSA issue, not a predication issue, really.

                             -David

I may be missing some important detail here, but I assumed $oldvalue
and $dst were just SSA names for the same variable. For example,
given the following snippet for a compute kernel:

   if (x > 10)
       x = x - 10

If you wanted to run a bunch of parallel instances with each vector
lane representing an instance, I assume the IR would be something
roughly like (ignoring source selects for brevity):

   %mask = cmp gt %x1, 10
   %diff = sub %x1, 10
   %x2 = select %mask, %diff, %x1

At this point, %x1 is dead. %x1 and %x2 represent 'x' in the program above.

Even in architectures, like MIC, that support masks, using masks is not always beneficial because it adds register dependency.
We analyzed these aspects and got into conclusion that vectorizer can use masked load/store and gather/scatter intrinsics to protect memory access.
Adding masks to instructions is partially possible on a target specific machine pass.

- Elena

As I understand, one of the tricks for converting to SSA form in LLVM
is to make all variable references be loads/stores, then use Mem2Reg
to convert those back to registers. It seems like using intrinsics
would preclude this and require the front end of the compiler to do
it, which is quite a bit of work.

I'm not sure I understand the full impact of this example, and I would
like to.

What are the desired memory model semantics for a masked store?
Specifically, let me suppose a simplified vector model of <2 x i64> on an
i64-word-size platform.

Hi Chandler,

I brought the example in this email thread to show that the optimizations
that we currently have won't work on masked load/store operations because
they don't take the mask into consideration. The memory model interesting
question but I am not sure how it is related.

I understand why you originally brought up the example, but I'm trying to
dig deeper. The reason why the memory model is related is because it
drastically changes the set of options available for representing masked
loads and stores. Masked loads and stores must participate in the memory
model (just as all other loads and stores do), and we will need to update
it to reflect them if they are added to the IR. I'm trying to understand
the potential requirements of the actual code in question.

In our example you can see the problem with a single thread.

Sure, but any solution is going to have to solve both single thread
problems, and concurrent problems. Also, the memory model isn't *only*
about concurrent issues. It's a fundamental description of how loads and
stores interact. That's why I presented the examples I did, and I would
very much appreciate you responding to the entirety of my email.

Both MIC and AVX[1] have masked stores operations and they have a
different memory model.

Ok, but you've only given me a link to one set of instructions, and it
doesn't really have a proper description of the memory model. If I assume
that this instruction's specification is the complete spec for AVX masked
stores, then they are atomic stores (!!!) which means that they do not form
a data race on *any* of the memory locations. This is, to say the least, a
deeply surprising and restrictive model. Certainly, it seem too strong of a
guarantee to enshrine as the *only* form of masked store in the IR.

Chandler Carruth <chandlerc@google.com> writes:

> What are the desired memory model semantics for a masked store?
> Specifically, let me suppose a simplified vector model of <2 x i64> on
> an i64-word-size platform.
>
> masked_store(<42, 42>, Ptr, <true, false>)
>
> Does this write to the entier <2 x i64> object stored at Ptr or not?

No. It writes one element.

Is this a statement about all of the existing hardware that supports masked
stores, or about the desired semantics in your mind for the IR model?

> Put another way, consider:
>
> thread A:
> ...
> masked_store(<42, 42>, Ptr, <true, false>)
> ...
>
> thread B:
> ...
> masked_store(<42, 42>, Ptr, <false, true>)
> ...
>
> Assuming there is no specific synchronization relevant to Ptr between
> these two threads and their masked stores, does this form a data race
> or not?

It entirely depends on the hardware implementation.

Well, in practice yes, but I'm asking how it should be modeled in the IR.
We aren't constrained to the option of having a *different* masked store
intrinsic for every hardware architecture that supports masked stores, and
it would seem strange to not look for a reasonable target independent
abstraction which we can teach the middle-end optimizers about (even if it
does take the form of intrinsics). Maybe there is no such abstraction? That
in and of itself would be surprising to me.

In most cases I
would say yes due to cache conherence issues. From a purely theoretical
machine that doesn't have false sharing, there would be no data race.

I think you're trying to reason about this from a hardware perspective, and
I'm trying to talk about what the right theoretical model for the memory
model is... While hardware is one constraint on that, there are others as
well, so it's worrisome to talk about both a theoretical machine without
false sharing and cache coherency issues when trying to determine if two
operations form a data race.

Of course this assumes that thread B won't access the element stored by
thread A and vice versa.

If we assume that, we have the conclusion -- there is on datarace. The
entire question is whether or not the masked element is notionally "stored"
(but without changing the value afterward), or not.

From Nadav's link, for example, it appears that AVX *does* actually do a

full store of the 256-bit vector, but it does it atomically which precludes
data races.

From a memory model perspective, if this does *not* form a data race,
> that makes this tremendously more complex to implement, analyze, and
> optimize... I'm somewhat hopeful that the desired semantics are for
> this to form a datarace (and thus require synchronization when
> occurring in different threads like this).

Most of the time the compiler will not know the mask value and will have
to be conservative. As Nadav has pointed out, what constitutes
"conservative" is entirely context-dependent.

But I don't understand why defining this as not being a data race would
complicate things. I'm assuming the mask values are statically known.
Can you explain a bit more?

If my example would form a datarace, then when the optimizer sees such a
non-atomic stores, *and doesn't know the mask statically* (as I agree, that
is the common case), it would know that the entire vector store was
independent from stores to the same memory locations in other threads. It
could even model the operation of a masked store as a load from that
address, masking both the loaded vector and the incoming vector
(literally), or-ing (or blending) them together, and storing the result
back out. This is a fairly simple model, and easy to reason about. I would
even suggest that perhaps this is how we should represent it in IR.

However, if my example does not form a datarace, then when the optimizer
sees such a non-atomic store, *and doesn't know the mask statically*, it
has to assume that the mask may dynamically preclude storing to a memory
location that is being concurrently accessed. It cannot speculatively load
the vector stored there, and perform an explicit mask and blend and a full
store. It essentially means that if we see a masked store and don't know
the mask, then even if we know the address statically, that doesn't matter
because the mask could effectively index into that address and select a
single element to store to.

Dan Gohman <dan433584@gmail.com> writes:

> But I don't understand why defining this as not being a data race
> would complicate things. I'm assuming the mask values are
> statically known. Can you explain a bit more?
>
> It's an interesting question for autovectorization, for example.
>
> Thread A:
> for (i=0;i<n;++i)
> if (i&1)
> X[i] = 0;
>
> Thread B:
> for (i=0;i<n;++i)
> if (!(i&1))
> X[i] = 1;
>
> The threads run concurrently without synchronization. As written,
> there is no race.

There is no race *if* the hardware cache coherence says so. :slight_smile: There
are false sharing issues here and different machines have behaved very
differently in the past.

Let's not conflate races with false sharing. They're totally different, and
false sharing is *not* what we're discussing here.

The result entirely depends on the machine's consistency model.

LLVM is a virtual machine and the IR should define a consistency model.
Everything flows from that. I think ideally we'd define the model such
that there is no race in the scalar code and the compiler would be free
to vectorize it. This is a very strict consistency model and for
targets with relaxed semantics, LLVM would have to insert
synchronization operations or choose not to vectorize.

LLVM already has a memory model. We don't need to add one. ;] It's here for
reference: LLVM Language Reference Manual — LLVM 18.0.0git documentation

Also, cache coherency is *not* the right way to think of a memory model. It
makes it extremely hard to understand and define what optimization passes
are allowed to do. I think LLVM's memory model does a very good job of this
for both scalar and vector code today. If you spot problems with it, let's
start a thread to address them. I suspect myself, Jeffrey, and Owen will
all be extremely interested in discussing any such issues.

The only thing that isn't in the model that is relevant here is something
that isn't in LLVM today -- masked loads and stores. And that was what
inspired my original question. =D

Chandler Carruth <chandlerc@google.com> writes:

    > What are the desired memory model semantics for a masked store?
    > Specifically, let me suppose a simplified vector model of <2 x
    i64> on
    > an i64-word-size platform.
    >
    > masked_store(<42, 42>, Ptr, <true, false>)
    >
    > Does this write to the entier <2 x i64> object stored at Ptr or
    not?
    
    No. It writes one element.

Is this a statement about all of the existing hardware that supports
masked stores, or about the desired semantics in your mind for the IR
model?

I made the comment thinking about hardware. If it were to write all
elements, what would it write? The old value? That could be useful in
some cases (the performance issue I mentioned below). But it also
presents problems for mem2reg/SSA.

This discussion might lead us to wanting a couple of flavors of masked
load and store. I'm not sure.

    > Put another way, consider:
    >
    > thread A:
    > ...
    > masked_store(<42, 42>, Ptr, <true, false>)
    > ...
    >
    > thread B:
    > ...
    > masked_store(<42, 42>, Ptr, <false, true>)
    > ...
    >
    > Assuming there is no specific synchronization relevant to Ptr
    between
    > these two threads and their masked stores, does this form a data
    race
    > or not?
    
    It entirely depends on the hardware implementation.

Well, in practice yes, but I'm asking how it should be modeled in the
IR. We aren't constrained to the option of having a *different* masked
store intrinsic for every hardware architecture that supports masked
stores, and it would seem strange to not look for a reasonable target
independent abstraction which we can teach the middle-end optimizers
about (even if it does take the form of intrinsics). Maybe there is no
such abstraction? That in and of itself would be surprising to me.

I agree. We should choose the semantics that gives the optimizer the
most freedom.

From Nadav's link, for example, it appears that AVX *does* actually do
a full store of the 256-bit vector, but it does it atomically which
precludes data races.

By "atomically," you mean that all elements are written before any other
operation is allowed to read or store to them?

If my example would form a datarace, then when the optimizer sees such
a non-atomic stores, *and doesn't know the mask statically* (as I
agree, that is the common case), it would know that the entire vector
store was independent from stores to the same memory locations in
other threads. It could even model the operation of a masked store as
a load from that address, masking both the loaded vector and the
incoming vector (literally), or-ing (or blending) them together, and
storing the result back out. This is a fairly simple model, and easy
to reason about. I would even suggest that perhaps this is how we
should represent it in IR.

Note that from a hardware perspective, the store may or may not cause a
data race depending on alignment and whether the store crosses a cache
line boundary.

However, if my example does not form a datarace, then when the
optimizer sees such a non-atomic store, *and doesn't know the mask
statically*, it has to assume that the mask may dynamically preclude
storing to a memory location that is being concurrently accessed. It
cannot speculatively load the vector stored there, and perform an
explicit mask and blend and a full store. It essentially means that if
we see a masked store and don't know the mask, then even if we know
the address statically, that doesn't matter because the mask could
effectively index into that address and select a single element to
store to.

You want to do a full vector load, update and store. So do we most of
the time, for performance. :slight_smile:

I think you may be getting hung up a bit. What you describe in this
paragraph isn't a masked store at all. In fact you explicitly state
it's "a full store."

Given your code snippet, if we assume there is no data race in the
scalar code *and* we assume that vector stores are atomic, then the
compiler has two choices on how to translate the code. It can be unsafe
and do the fast full load, merge, full store sequence or it can do a
slower hardware masked store. The Cray compiler will do both depending
on switches or directives given by the user. It is too difficult to
know statically which is the best choice. I believe that by default we
do the fast code and the user can turn on switches to generate safe code
and see if that fixes problems. :slight_smile:

I think we wil need both options in LLVM. There's just no way to pick
one and always have the write answer. I think we only need one masked
store intrinsic. That intrinsics would *not* write to masked elements.
If the compiler wants to be entirely safe, it should use that.
Otherwise it should feel free to use the full load/merge/full store
sequence.

I will run this by our vector expert to see what he thinks.

                                -David

Chandler Carruth <chandlerc@google.com> writes:

    There is no race *if* the hardware cache coherence says so. :slight_smile:
    There
    are false sharing issues here and different machines have behaved
    very
    differently in the past.

Let's not conflate races with false sharing. They're totally
different, and false sharing is *not* what we're discussing here.

But in the real world false sharing exists and the compiler has to deal
with it. We can say, "make codegen deal with it," but these issues
bubble up to the target-independent optimizer nonetheless.

A theoretical memory model is good to have but it's often not
sufficient.

    The result entirely depends on the machine's consistency model.
    
    LLVM is a virtual machine and the IR should define a consistency
    model.
    Everything flows from that. I think ideally we'd define the model
    such
    that there is no race in the scalar code and the compiler would be
    free
    to vectorize it. This is a very strict consistency model and for
    targets with relaxed semantics, LLVM would have to insert
    synchronization operations or choose not to vectorize.

LLVM already has a memory model. We don't need to add one. ;] It's
here for reference: LLVM Language Reference Manual — LLVM 18.0.0git documentation

I started to look at LLVM Atomic Instructions and Concurrency Guide — LLVM 18.0.0git documentation first for a
genter introduction and immediately spotted a problem. Your first
example precluding register promotion for the update of x is hugely
pessimistic. I don't particularly care because our optimizer has
already done the transformation before we hit LLVM. :slight_smile: But with that
restriction you're leaving a ton of performance on the table.

The same goes for vector code generation, in general. Our vectorizer
has already done it. But let's get this right for everyone.

The only thing that isn't in the model that is relevant here is
something that isn't in LLVM today -- masked loads and stores. And
that was what inspired my original question. =D

FWIW, informally, the Cray compiler ignores any concurrency it did not
itself create. It won't generally introduce loads and stores that
weren't there, but it will certainly eliminate any loads and stores it
can. We do have atomic operations which generally behave like the LLVM
atomics. The memory model looks a lot like the C abstract machine. We
generally give the compiler free reign.

We let the Cray compiler do some unsafe optimization from time to time.
Turning a masked load/operation/masked store into a full load/blend/full
store is a common case. Users can disable it if they want to be extra
careful. We worry about false sharing, but only after a certain point
in translation. These have proven to be very practical and effective
techniques.

I wrote about masked stores vs. full stores in a previous message. I
believe mask stores should only write to unmasked elements. It should
not trap on masked elements. If a developer needs something more
flexible for performance, he or she can do an unsafe transformation,
knowing the implications of doing so.

                                  -David

Jeff Bush <jeffbush001@gmail.com> writes: