[RFC] Starting an AVX512 Target-Specific Dialect - Rebooted

What is the overall goal of the dialect?

The Vector Dialect document discusses the vector abstractions that MLIR supports and tradeoffs. One of the layer that is missing in OSS atm is the Hardware Vector Ops (HWV) level. This proposal is for adding a new Targets/AVX512 Dialect that would directly model AVX512-specific intrinsics.

This proposal will allow trading off HW-specific vs generic abstractions in MLIR.

What is the first implementation milestone?

The first implementation milestone will consist of adding the dialect and implementing some basic operations (say rndscale and `scalef). Like other intrinsics in the LLVM dialect, they would be very lightweight and represented as custom ops.

The Tablegen specification would resemble:

def LLVM_x86_avx512_mask_rndscale_ps_512 :
    AVX512_IntrOp<"mask.rndscale.ps.512">,
    Arguments<(ins LLVM_Type, LLVM_Type, LLVM_Type, LLVM_Type, LLVM_Type)>;

def LLVM_x86_avx512_mask_scalef_ps_512 :
  AVX512_IntrOp<"mask.scalef.ps.512">,
  Arguments<(ins LLVM_Type, LLVM_Type, LLVM_Type, LLVM_Type, LLVM_Type)>;

The LLVM dialect form would resemble:

llvm.func @LLVM_x86_avx512_mask_ps_512(%a: !llvm<"<16 x float>">,
                                       %b: !llvm.i32,
                                       %c: !llvm.i16) -> (!llvm<"<16 x float>">) {
  %0 = "avx512.mask.rndscale.ps.512"(%a, %b, %a, %c, %b) :
    (!llvm<"<16 x float>">, !llvm.i32, !llvm<"<16 x float>">, !llvm.i16, !llvm.i32) -> !llvm<"<16 x float>">
  %1 = "avx512.mask.scalef.ps.512"(%a, %a, %a, %c, %b) :
    (!llvm<"<16 x float>">, !llvm<"<16 x float>">, !llvm<"<16 x float>">, !llvm.i16, !llvm.i32) -> !llvm<"<16 x float>">
  llvm.return %1: !llvm<"<16 x float>">
}

They would have a counterpart operation specified on MLIR 1-D vector types for the purpose of type checking and progressive lowering. For instance the Tablegen specification for MaskRndScaleOp would resemble:

def MaskRndScaleOp : AVX512_Op<"mask.rndscale", [NoSideEffect,
    AllTypesMatch<["src", "a", "dst"]>]>,
    // Supports vector<16xf32> and vector<8xf64>.
    Arguments<(ins VectorOfLengthAndType<[16, 8], [F32, F64]>:$src,
                   I32:$k,
                   VectorOfLengthAndType<[16, 8], [F32, F64]>:$a,
                   AnyTypeOf<[I16, I8]>:$imm,
                   // TODO(ntv): figure rounding out (optional operand?).
                   I32:$rounding
              )>,
    Results<(outs VectorOfLengthAndType<[16, 8], [F32, F64]>:$dst)>  {
  let summary = "Masked roundscale op";
  let description = [{
    The mask.rndscale is an AVX512 specific op that can lower to the proper
    `llvm::Intrinsic::x86_avx512_mask_rndscale_ps_512` or
    `llvm::Intrinsic::x86_avx512_mask_rndscale_pd_512` instruction depending on
    the type of MLIR vectors it is applied to.

    From the Intel Intrinsics Guide:
    ================================
    Round packed floating-point elements in `a` to the number of fraction bits
    specified by `imm`, and store the results in `dst` using writemask `k`
    (elements are copied from src when the corresponding mask bit is not set).
  }];
  // Fully specified by traits.
  let verifier = ?;
  let assemblyFormat =
    // TODO(riverriddle, ntv): type($imm) should be dependent on type($dst).
    "$src `,` $k `,` $a `,` $imm `,` $rounding attr-dict `:` type($dst) `,` type($imm)";
}

And the avx512 operation on MLIR vector types would resemble:

func @avx512_mask_rndscale(%a: vector<16xf32>, i32: i32, %i16: i16, %i8: i8) -> vector<16xf32>
{
  %0 = avx512.mask.rndscale %a, %i32, %a, %i16, %i32 : vector<16xf32>, i16
  return %0: vector<16xf32>
}

This is a good starting point to support mixed target-agnostic and target-specific lowering.

How does it fit into the MLIR dialect ecosystem?

Connection: how does it connect to the existing dialects in a compilation pipeline(s)?

The AVX-512 dialect would be the first OSS dialect for the HWV layer in the following diagram (extracted from the Vector Dialect document ):

Atm, we rely exclusively on LLVM’s peephole optimizer to do a good job from small insertelement/extractelement/shufflevector. This proposal will allow targeting AVX512-specific instructions directly and mixing them with retargetable abstractions that rely on peephole optimizations. This is expected to create opportunities for better code generation with non-surprising performance.

In the limit, such an abstraction can be used as a form of intrinsic programming in MLIR and arbitrarily mixed with other abstractions. In the future, as LLVM VP intrinsics are developed, we expect the mix of target-specific and retargetable abstractions, that are required for good performance, to evolve.

Consolidation: is there already a dialect with a similar goal or matching abstractions; if so, can it be improved instead of adding a new one?

There is precedent in target-specific MLIR dialects for internal Google projects involving xPUs. This is an abstraction that has shown to work well but is not yet present in OSS.

Reuse: how does it generalize to similar but slightly different use-cases?

This proposal is for a Target-specific abstraction, as such it does not generalize to other targets. It is however expected that other Target-specific dialects such as SVE will follow a similar approach of defining a new Dialect.

An alternative would be to cram together all intrinsics in the LLVM and Vector dialects and continue extending them but target-specific abstractions that can be enabled/disabled more globally are expected be useful.

Who are the future contributors/maintainers beyond those who propose the dialect?

It is expected that Target-specific dialects, and AVX512 in particular, will be a generally useful abstraction layer to the MLIR community and that the community itself will contribute to extending and maintaining the abstractions.

Remark

This RFC is related to the previous, abandonned, proposal that was deemed to conflate too many things: both the definition of the dialect and its use in an implementation of XNNPack in MLIR. This RFC focuses on the AVX512-specific parts as outlined above. An experimental XNNPack dialect which targets AVX512 is also in early development

With the clean split out of XNNPack and the focus on matching the LLVM intrinsic, this makes a lot of sense to me!

I assume that the lowering from the vector level to the LLVM intrinsic will be very mechanical? Could it be auto-generated maybe?
Also could you have with one entry in TableGen both the variant operating on vector and the one operating on llvm types?

Thank you @mehdi_amini.
I have sent out some code to go along with this RFC: https://reviews.llvm.org/D75987.

Everything that can be auto-generated should, I would definitely welcome help in that area too as I have not yet found enough time to invest in writing Tablegen myself.

SVE? (perhaps that was in the previous discussion, but can’t recall :slight_smile: )

Thanks, makes it easier to see. Agreed with Mehdi, seems much clearer now given mapping to LLVM intrinsics.

And is the difference here that for the HWV ops that those ops wouldn’t make sense as generic vector ops? E.g., avx512.mask.rndscale does not have a 1:1 mapping to generic op that would be usable or a generic useful op?

Yes was already in the mentioned abandoned RFC.

https://developer.arm.com/docs/dui0965/latest/sve-overview/introducing-sve

This is related to ongoing discussions with ARM that is contemplating starting their an SVE dialect for the scalable vector extensions that we don’t support atm.

There may be generic ops in some / many cases but it cannot be assumed that the generic ops are hooked up to the HW-specific ops in LLVM. This allows people to metaprogram / SIMD-intrinsics program particular HW with guarantees about instruction selection for which LLVM is a pure blackbox. This gives more control and achieves unsurprising performance on key hardware.

Note that some of the best libraries are developed using SIMD intrinsics (depsite these things often being C macros around IR abstractions (yuck…but that is beside the point) ) to get close to peak, for instance XNNPack.

If you look a bit deeper, the programming model is very close to what you’d get with EDSCs. In fact it is so close that I am considering automatically turning these kernels into rewrites with a bit of sed … but that story is for another day :slight_smile:

If you extrapolate a bit, all this is related to ModelBuilder + vectors that we are putting more effort on these days to build close to peak bottom-up abstractions, that we will reuse in various top-down Linalg → X scenarios.

Did you consider using the LLVM dialect for this? It should already model all the target specific intrinsics.

I believe this falls into this bucket:

An alternative would be to cram together all intrinsics in the LLVM and Vector dialects and continue extending them but target-specific abstractions that can be enabled/disabled more globally are expected be useful.

Yes, about 1/2 of this is using all the existing LLVM intrinsics (see RFC impl) infra but grouped in an AVX512-specific dialect (the other 1/2 make it more useful for MLIR progressive lowerings).

From 10k miles a dialect is “just a namespace”.
So a part of this is effectively “just” isolating HW-specific LLVM intrinsics in a namespace.

There are also some modifications to layer the ModuleTranslation and solve e.g. name mangling issues (some intrinsics must be called with a type some without a type…), (courtesy of @ftynse and @aartbik).

I think the NVVM and ROCM dialects were created to express the LLVM intrinsics for these targets already, so this wouldn’t be a first.

What is specific here is the part that would make these op work on non-LLVM types (vector types) in MLIR, which is indeed interesting.

Makes sense to me.

As @mehdi_amini mentions, we already have NVVM and ROCm dialects that are similarly derived from LLVM IR intrinsics. We discussed several ideas about dialect relations back in the day we were introducing NVVM, and decided that a separate dialect that uses the types defined in the LLVM dialect made most sense for target-specific intrinsics. We are unlikely to need all of the target-specific intrinsic sets at the same time, and dialects are MLIR’s way of logically grouping operations for any purpose from repository structuring to specifying the legalization target.

A minor note, I would strongly encourage to place this dialect similarly to NVVM and ROCm in the tree. That is, do not introduce lib/Dialect/Target unless you intend to also move the two existing dialects there. In practice, I am not convinced that we need further separation under lib/Dialect and, if we do, we should think about the whole ecosystem, not just one new dialect.

It seems people are generally in agreement over this rebooted RFC.
Could we start reviewing https://reviews.llvm.org/D75987 ?

Thank you!

+1 on adding this as a dialect