Use Galois field New Instructions (GFNI) to combine affine instructions

Hello everyone,

On the last couple of days, I have been experimenting with teaching LLVM how to combine a
set of affine instructions into an instruction that uses the GFNI [1] AVX512 extension,
especially GF2P8AFFINEQB [2]. While the general idea seems to work, I have some questions
about my current implementation (see below). FTR, I have named this transformation
AffineCombineExpr (ACE).

Let's first introduce the general idea, which is to transform code like this:

static uint8_t rol2(uint8_t v) {
  return (v<<2)^(v>>6);
}
uint8_t func(uint8_t v) {
  v = rol2(v);
  v ^= 0xAA;
  return v;
}

into this:

define zeroext i8 @func(i8 zeroext %v) local_unnamed_addr #0 {
  %0 = insertelement <16 x i8> undef, i8 %v, i64 0
  %1 = call <16 x i8> @llvm.x86.vgf2p8affineqb.128(<16 x i8> %0, <16 x i8> bitcast (<2 x
i64> <i64 4647715923615551520, i64 undef> to <16 x i8>), i8 -86)
  %2 = extractelement <16 x i8> %1, i64 0
  ret i8 %2
}

(if that's profitable, which might not be the case here, see below)

Another more interesting example where we could see potential benefits is this one:
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-21dd247f3b8aa49860ae8122fe3ea698R22

This gets even more interesting with vectorized code, with an example here:

* original C code: https://pastebin.com/4JjF7DPu
* LLVM IR after clang -O2 -mgfni -mavx2: https://pastebin.com/Ti0Vm2gj [3]
* LLVM IR after ACE (using opt -aggressive-instcombine -S): https://pastebin.com/2zFU7J6g
(interesting things happened at line 67)

If, like me, you don't have a GFNI-enabled CPU, you can use Intel SDE [4] to run the
compiled code.

The code of the pass is available here:

https://github.com/aguinet/llvm-project/blob/feature/gfni_combine/llvm/lib/Transforms/AggressiveInstCombine/AffineExprCombine.cpp

And there are test cases here:
https://github.com/aguinet/llvm-project/tree/feature/gfni_combine/llvm/test/Transforms/AggressiveInstCombine
(aec_*.ll)

Questions

I can tell you that your avx512 issue is that v64i8 gfni instructions also require avx512bw to be enabled to make v64i8 a supported type. The C intrinsics handling in the front end know this rule. But since you generated your own intrinsics you bypassed that.

Indeed that's the issue... I was stick with what Intel announces here
(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2907), but
I guess I should have checked the C intrinsics.

I will fix my code to verify the presence of avx512bw if I ever need v64i8.

Thanks for the hint!

I’m guessing AggressiveInstCombine only runs once in the pipeline and its probably before the vectorizers. Its existing transforms probably output things the vectorizer can understand and vectorize. In your case you’re fully dependent on vectorized code.

We don’t like to form target specific intrinsics in the middle end pipeline. We’d prefer to do something in the X86 specific IR pipeline or Machine IR pipeline run by llc. Or have a generic concept in IR that we can express like llvm.ctlz, llvm.cttz, llvm.popcnt or llvm.bitreverse. We have methods in TargetTransformInfo to query for targets supporting them or in the worst case we’re able to generate reasonable code if the target doesn’t support it natively.

I’ll try to point some more people here at Intel towards this thread.

I'm guessing AggressiveInstCombine only runs once in the pipeline and its
probably before the vectorizers. Its existing transforms probably output
things the vectorizer can understand and vectorize. In your case you're
fully dependent on vectorized code.

Yes. I think it would generate more efficient code if I would do the combination within
the loop vectorization algorithm (that is, if I understood correctly, in VPlan). It would
for instance give more opportunity to "hide" the latency of GF2P8AFFINEQB, by e.g fine
tuning the loop unrolling factor. But I have to take more time to figure out how to make
this happen.

We don't like to form target specific intrinsics in the middle end
pipeline. We'd prefer to do something in the X86 specific IR pipeline or
Machine IR pipeline run by llc. Or have a generic concept in IR that we can
express like llvm.ctlz, llvm.cttz, llvm.popcnt or llvm.bitreverse. We have
methods in TargetTransformInfo to query for targets supporting them or in
the worst case we're able to generate reasonable code if the target doesn't
support it natively.

I thought about putting that in the X86 pipeline, but it might remove some opportunities:

* supporting equivalent instructions from another vendor without rewriting everything
* fine tuning the loop vectorization process like said above

The approach with defining a generic intrinsic in the IR can be an interesting one, the
remaining question would be which API should we put there, so that it's generic enough (to
be a generic intrinsic) but doesn't prevent some optimization opportunities. That being
said, something like:

llvm.gf2p8.XX(<i8, XX> value, <i8, XX> matrix, i8 cst)

seems reasonnable to me.

I guess that's the eternal debate on where optimization X should happen and how... :slight_smile:

I'll try to point some more people here at Intel towards this thread.

Thanks a lot!

Adrien.