`llvm.vp.*` mask semantics not being honored when targeting x86?

Hi all,

I’m hitting a correctness issue when lowering an llvm.vp.fadd to x86 with AVX512 support. It looks like the mask is dropped by the expandvp pass even though the intrinsic is masked:

*** IR Dump Before Expand vector predication intrinsics (expandvp) ***                                                                                          
define <16 x float> @test(<16 x float> %a, <16 x float> %b, <16 x i1> %mask) #0 {                                                                       
  %c = call <16 x float> @llvm.vp.fadd.v16f32(<16 x float> %a, <16 x float> %b, <16 x i1> %mask, i32 16)                                                        
  ret <16 x float> %c                                                                                                                                           
}                                                                                                                                                               
*** IR Dump After Expand vector predication intrinsics (expandvp) ***                                                                                           
define <16 x float> @test(<16 x float> %a, <16 x float> %b, <16 x i1> %mask) #0 {                                                                       
  %c1 = fadd <16 x float> %a, %b                                                                                                                                
  ret <16 x float> %c1                                                                                                                                          
}   

llc test.ll -mcpu=cascadelake

I know that a vectorizer may decide to unmask non-side-effecting instructions when it’s safe to do so but in this particular case we are explicitly generating a masked intrinsic and the backend doesn’t seem to be honoring the mask semantics. Is this an expected behavior of VP intrinsics?

Thanks!
Diego

@simoll, @rofirrim, @topperc

To support the mask we would need properly support VP intrinsics in X86 instead of expanding them before SelectionDAG. Or the expansion pass would need to fully scalarize it.

What are your reason for wanting it masked? The two I can think of are
-supressing exceptions
-not triggering a denormal microcode assist on garbage inputs.

To support the mask we would need properly support VP intrinsics in X86 instead of expanding them before SelectionDAG. Or the expansion pass would need to fully scalarize it.

Would it make sense to expand it to operation + select? I’ve seen this pattern around sometimes.

What are your reason for wanting it masked?

We sometimes vectorize two dimensions in MLIR and then “legalize” one of them by unrolling it when there is no 2-D vector support for that operation. We may mask both vector dimensions and then keep the mask also for the unrolled dimension (i.e., effectively, we unroll using masking!). This significantly reduces the code size in some case where code size is a major issue at the expenses of potentially executing some no-ops if the mask along the unrolled dimension becomes all-zero.

In this scenario, if we have two dimensions [i, j] and there is an f32 reduction over j, we can:

  1. unroll(i, UF), vectorize(j, VF): we generate UF x <VFxf32> llvm.vp.reduce.fadd intrinsics with masks.
  2. vectorize(i, VF), unroll(j, UF): we generate UF x <VFxf32> llvm.vp.fadd intrinsics with masks.

In #1 we perform “horizontal” reductions with masks, as we are vectorizing along the reduction dimension. This works because the mask is honored for llvm.vp.reduce.* intrinsics. In #2, each vector is computing VF independent reductions (one per lane) and each unrolled llvm.vp.fadd is accumulating one element for each one of these VF reductions. If we drop the mask of llvm.vp.fadd, we would be adding garbage to each VF reduction.

I was discussing this further with @rofirrim and just honoring the mask wouldn’t work for this case because the VP intrinsics don’t have a passthru value and masked-out elements are set to poison! I would need the masked-out elements to propagate the previous value of the accumulator register. I guess I can emulate that with an unmasked vector add + select instruction…

I’m wondering, though, why vp intrinsics don’t have a passthru value and default to poison. Is that something that was left out of the picture for some reason or just not needed for now?

Thanks!
Diego

The intrinsics were designed based on how the loop vectorizer works. When control flow is vectorized, the instructions inside an “if” are vectorized but there is no passthru value. The phi at the merge point is vectorized into a select. That select is the only point where there is a passthru value.

1 Like