Adding masked vector load and store intrinsics

Hi,

We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.

The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.
The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed.

call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask)

%data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask)

where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef).

Comments so far, before we dive into more details?

Thank you.

  • Elena and Ayal

From: "Elena Demikhovsky" <elena.demikhovsky@intel.com>
To: llvmdev@cs.uiuc.edu
Cc: dag@cray.com
Sent: Friday, October 24, 2014 6:24:15 AM
Subject: [LLVMdev] Adding masked vector load and store intrinsics

Hi,

We would like to add support for masked vector loads and stores by
introducing new target-independent intrinsics. The loop vectorizer
will then be enhanced to optimize loops containing conditional
memory accesses by generating these intrinsics for existing targets
such as AVX2 and AVX-512. The vectorizer will first ask the target
about availability of masked vector loads and stores. The SLP
vectorizer can potentially be enhanced to use these intrinsics as
well.

The intrinsics would be legal for all targets; targets that do not
support masked vector loads or stores will scalarize them.
The addressed memory will not be touched for masked-off lanes. In
particular, if all lanes are masked off no address will be accessed.

call void @llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4,
<16 x i1> %mask)

%data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32>
%passthru, i32 4, <8 x i1> %mask)

where %passthru is used to fill the elements of %data that are
masked-off (if any; can be zeroinitializer or undef).

Comments so far, before we dive into more details?

For the stores, I think this is a reasonable idea. The alternative is to represent them in scalar form with a lot of control flow, and I think that expecting the backend to properly pattern match that after isel is not realistic.

For the loads, I'm must less sure. Why can't we represent the loads as select(mask, load(addr), passthru)? It is true, that the load might get separated from the select so that isel might not see it (because isel if basic-block local), but we can add some code in CodeGenPrep to fix that for targets on which it is useful to do so (which is a more-general solution than the intrinsic anyhow). What do you think?

Thanks again,
Hal

For the loads, I'm must less sure. Why can't we represent the loads as select(mask, load(addr), passthru)? It is true, that the load might get separated from the select so that isel might not see it (because isel if basic-block local), but we can add some code in CodeGenPrep to fix that for targets on which it is useful to do so (which is a more-general solution than the intrinsic anyhow). What do you think?

We generate the vector-masked-intrinsic on IR-to-IR pass. It is too far from instruction selection. We'll need to guarantee that all subsequent IR-to-IR passes will not break the sequence. And only for one or two specific targets. Then we'll keep the logic in type legalizer, which may split or extend operations. Then we are taking care in DAG-combine.
In my opinion, this is just unsafe.

- Elena

This looks to be a reasonable proposal. However native instructions that support such masked ld/st may have a high latency ? Also, it would be good to state some workloads where this will have a positive impact.

-dibyendu

I wrote a loop with conditional load and store and measured performance on AVX2, where masking support is very basic, relatively to AVX-512.

I got 2x speedup with vpmaskmovd.

The maskmov instruction is slower than one vector load or store, but much faster than 8 scalar memory operations and 8 branches.

Usage of masked instructions on AVX-512 will give much more. There is no latency on target in comparison to the regular vector memop.

From: "Elena Demikhovsky" <elena.demikhovsky@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: dag@cray.com, llvmdev@cs.uiuc.edu, "Ayal Zaks" <ayal.zaks@intel.com>
Sent: Friday, October 24, 2014 8:07:18 AM
Subject: RE: [LLVMdev] Adding masked vector load and store intrinsics

> For the loads, I'm must less sure. Why can't we represent the loads
> as select(mask, load(addr), passthru)? It is true, that the load
> might get separated from the select so that isel might not see it
> (because isel if basic-block local), but we can add some code in
> CodeGenPrep to fix that for targets on which it is useful to do so
> (which is a more-general solution than the intrinsic anyhow). What
> do you think?

We generate the vector-masked-intrinsic on IR-to-IR pass. It is too
far from instruction selection. We'll need to guarantee that all
subsequent IR-to-IR passes will not break the sequence.

I'm fully aware of this issue. This needs to be weighed against the cost of updating all other optimizations that operate on loads to also understand this intrinsic.

And only for
one or two specific targets.

Regardless, they're certainly targets many users care about :wink:

Then we'll keep the logic in type
legalizer, which may split or extend operations. Then we are taking
care in DAG-combine.
In my opinion, this is just unsafe.

If this were really a question of safety, I'd agree. And if we were talking about gather loads, I'd agree. For a regular vector loads, I don't see this as a safety issue. We should outline what the downside of emitting a regular load would actually be should some optimization be done to the select. Can you please elaborate on this?

Thanks again,
Hal

From: "Hal Finkel" <hfinkel@anl.gov>
To: "Elena Demikhovsky" <elena.demikhovsky@intel.com>
Cc: dag@cray.com, llvmdev@cs.uiuc.edu
Sent: Friday, October 24, 2014 8:39:56 AM
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

> From: "Elena Demikhovsky" <elena.demikhovsky@intel.com>
> To: "Hal Finkel" <hfinkel@anl.gov>
> Cc: dag@cray.com, llvmdev@cs.uiuc.edu, "Ayal Zaks"
> <ayal.zaks@intel.com>
> Sent: Friday, October 24, 2014 8:07:18 AM
> Subject: RE: [LLVMdev] Adding masked vector load and store
> intrinsics
>
> > For the loads, I'm must less sure. Why can't we represent the
> > loads
> > as select(mask, load(addr), passthru)? It is true, that the load
> > might get separated from the select so that isel might not see it
> > (because isel if basic-block local), but we can add some code in
> > CodeGenPrep to fix that for targets on which it is useful to do
> > so
> > (which is a more-general solution than the intrinsic anyhow).
> > What
> > do you think?
>
> We generate the vector-masked-intrinsic on IR-to-IR pass. It is too
> far from instruction selection. We'll need to guarantee that all
> subsequent IR-to-IR passes will not break the sequence.

I'm fully aware of this issue. This needs to be weighed against the
cost of updating all other optimizations that operate on loads to
also understand this intrinsic.

> And only for
> one or two specific targets.

Regardless, they're certainly targets many users care about :wink:

> Then we'll keep the logic in type
> legalizer, which may split or extend operations. Then we are taking
> care in DAG-combine.
> In my opinion, this is just unsafe.

If this were really a question of safety, I'd agree. And if we were
talking about gather loads, I'd agree. For a regular vector loads, I
don't see this as a safety issue. We should outline what the
downside of emitting a regular load would actually be should some
optimization be done to the select. Can you please elaborate on
this?

Nevermind :wink: -- I changed my mind, the safety issue is with non-aligned loads that might cross page boundaries. Is that right? If so, I think this proposal is good (although obviously the docs need to make clear what the faulting behavior of these intrinsics is).

Thanks again,
Hal

Why can't we represent the loads as select(mask, load(addr), passthru)?

This suggests masked-off lanes are free to speculatively load from memory. Whereas proposed semantics is that:

The addressed memory will not be touched for masked-off lanes. In
particular, if all lanes are masked off no address will be accessed.

Ayal.

From: "Ayal Zaks" <ayal.zaks@intel.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Elena Demikhovsky" <elena.demikhovsky@intel.com>
Cc: dag@cray.com, llvmdev@cs.uiuc.edu
Sent: Friday, October 24, 2014 9:46:01 AM
Subject: RE: [LLVMdev] Adding masked vector load and store intrinsics

> Why can't we represent the loads as select(mask, load(addr),
> passthru)?

This suggests masked-off lanes are free to speculatively load from
memory. Whereas proposed semantics is that:

> The addressed memory will not be touched for masked-off lanes. In
> particular, if all lanes are masked off no address will be
> accessed.

Agreed -- as I said in an e-mail that you probably did not see before you wrote this :wink: -- but we should make sure to explicitly state this in the rationale. "touched" is not really the right term here. The underlying issue is that it allows us to deal with unaligned loads that cross page boundaries - i.e. that a masked-off load is safe to speculate.

On a related note, I presume that the 'i32 4' in the provided example is the alignment. Is that correct?

Thanks again,
Hal

On a related note, I presume that the 'i32 4' in the provided example is the alignment. Is that correct?

yes.

- Elena

Hal Finkel <hfinkel@anl.gov> writes:

For the loads, I'm must less sure. Why can't we represent the loads as
select(mask, load(addr), passthru)?

Because that does not specify the correct semantics. This formulation
expects the load to happen before the mask is applied. The load could
trap. The operation needs to be presented as an atomic unit.

The same problem exists with any potentially trapping instruction
(e.g. all floating point computations). The need for intrinsics goes
way beyond loads and stores.

                             -David

Hal Finkel <hfinkel@anl.gov> writes:

I'm fully aware of this issue. This needs to be weighed against the
cost of updating all other optimizations that operate on loads to also
understand this intrinsic.

In my experience, LLVM's behavior of treating unknwon intrinsics
conservatively works just fine.

If this were really a question of safety, I'd agree. And if we were
talking about gather loads, I'd agree. For a regular vector loads, I
don't see this as a safety issue.

It absolutely is a safety issue. Not only could loop control flow cause
some vector elements to be skipped that would otherwise trap if loaded,
there are some vector optimizations that assume masking behavior will
handle overindexing and other such problems.

Masking is an extremely powerful concept and the sooner LLVM understands
it, the better.

                           -David

Hal Finkel <hfinkel@anl.gov> writes:

If this were really a question of safety, I'd agree. And if we were
talking about gather loads, I'd agree. For a regular vector loads, I
don't see this as a safety issue. We should outline what the
downside of emitting a regular load would actually be should some
optimization be done to the select. Can you please elaborate on
this?

Nevermind :wink: -- I changed my mind, the safety issue is with
non-aligned loads that might cross page boundaries. Is that right?

That's just one safety issue. There are others.

If so, I think this proposal is good (although obviously the docs need
to make clear what the faulting behavior of these intrinsics is).

The behavior should be not to ever fault on an element whose mask bit is
false, and behave as a regular load (wrt trapping) for any element whose
mask bit is true.

                              -David

From: dag@cray.com
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Elena Demikhovsky" <elena.demikhovsky@intel.com>, llvmdev@cs.uiuc.edu
Sent: Friday, October 24, 2014 11:56:14 AM
Subject: Re: [LLVMdev] Adding masked vector load and store intrinsics

Hal Finkel <hfinkel@anl.gov> writes:

>> If this were really a question of safety, I'd agree. And if we
>> were
>> talking about gather loads, I'd agree. For a regular vector loads,
>> I
>> don't see this as a safety issue. We should outline what the
>> downside of emitting a regular load would actually be should some
>> optimization be done to the select. Can you please elaborate on
>> this?
>
> Nevermind :wink: -- I changed my mind, the safety issue is with
> non-aligned loads that might cross page boundaries. Is that right?

That's just one safety issue. There are others.

Can you be more specific? You mentioned overindexing in your other e-mail, exactly what do you mean by that?

Thanks again,
Hal

One can at least imagine using a masked load to access device memory which might have access granularity smaller than the vector size (this seems like a terrible idea to me, but at least I can conceive of cases where the semantics would matter beyond just page-crossing loads).

That said, page-crossing loads are a good-enough reason to support this on their own.
– Steve

"Demikhovsky, Elena" <elena.demikhovsky@intel.com> writes:

%data = call <8 x i32> @llvm.masked.load (i32* %addr, <8 x i32>
%passthru, i32 4, <8 x i1> %mask)
where %passthru is used to fill the elements of %data that are
masked-off (if any; can be zeroinitializer or undef).

So %passthrough can *only* be undef or zeroinitializer? If that's the
case it might make more sense to have two intrinsics, one that fills
with undef and one that fills with zero. Using a general vector operand
with a restriction on valid values seems odd and potentially misleading.

Another option is to always fill with undef and require a select on top
of the load to fill with zero. The load + select would be easily
matchable to a target instruction.

I'm trying to think beyond just AVX-512 to what other future
architectures might want. It's not a given that future architectures
will fill with zero *or* undef though those are the two most likely fill
values.

                             -David

"Das, Dibyendu" <Dibyendu.Das@amd.com> writes:

This looks to be a reasonable proposal. However native instructions
that support such masked ld/st may have a high latency ? Also, it
would be good to state some workloads where this will have a positive
impact.

Any significant vector workload will see a giant gain from this.

The masked operations really shouldn't have any more latency. The time
of the memory operation itself dominates.

                            -David

Hal Finkel <hfinkel@anl.gov> writes:

> Nevermind :wink: -- I changed my mind, the safety issue is with
> non-aligned loads that might cross page boundaries. Is that right?

That's just one safety issue. There are others.

Can you be more specific? You mentioned overindexing in your other
e-mail, exactly what do you mean by that?

Accessing past the end of an array. Some vector optimizations do that
and assume the masking will prevent traps. Aggressive vectorizers can
do all kinds of "unsafe" transformations that are safe in the presence
of masks.

Any time there is control flow in the loop protecting a dereference of a
NULL pointer, a mask is needed and it needs to be applied at the time of
the load, not at the time of the write to the loaded-to register.
That's why select doesn't work. This same issues extends to any trap
situation like a divide-by-zero or use of a NaN. It's not only the
write to the register that needs protection, it's the operation itself.

                              -David

select(mask, load(addr), passthru)?

David is right, "select(mask, load(addr), passthru)" is like vector load + blending ... which involves memory access speculation, and not safe in some cases, so it does not have same semantics of masking-lane-off,

Xinmin

Hi,

We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.

The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.

I do agree that we would like to have one IR node to capture these so that they survive until ISel and that their specific semantics can be expressed. However, can you discuss the other options (new IR instructions, target-specific intrinsics) and why you went with target-independent intrinsics.

My intuition would have been to go with target-specific intrinsics until we have something solid implemented and then potentially turn this into native IR instructions as the next step (for other targets, etc.). I am particularly worried whether we really want to generate these for targets that don’t have vector predication support.

There is also the related question of vector predicating any other instruction beyond just loads and stores which AVX512 supports. This is probably a smaller gain but should probably be part of the plan as well.

Adam