What is the overall goal of the dialect?
The Vector Dialect document discusses the vector abstractions that MLIR supports and tradeoffs. One of the layer that is missing in OSS atm is the Hardware Vector Ops (HWV) level. This proposal is for adding a new Targets/AVX512
Dialect that would directly model AVX512-specific intrinsics.
This proposal will allow trading off HW-specific vs generic abstractions in MLIR.
What is the first implementation milestone?
The first implementation milestone will consist of adding the dialect and implementing some basic operations (say rndscale
and `scalef). Like other intrinsics in the LLVM dialect, they would be very lightweight and represented as custom ops.
The Tablegen specification would resemble:
def LLVM_x86_avx512_mask_rndscale_ps_512 :
AVX512_IntrOp<"mask.rndscale.ps.512">,
Arguments<(ins LLVM_Type, LLVM_Type, LLVM_Type, LLVM_Type, LLVM_Type)>;
def LLVM_x86_avx512_mask_scalef_ps_512 :
AVX512_IntrOp<"mask.scalef.ps.512">,
Arguments<(ins LLVM_Type, LLVM_Type, LLVM_Type, LLVM_Type, LLVM_Type)>;
The LLVM dialect form would resemble:
llvm.func @LLVM_x86_avx512_mask_ps_512(%a: !llvm<"<16 x float>">,
%b: !llvm.i32,
%c: !llvm.i16) -> (!llvm<"<16 x float>">) {
%0 = "avx512.mask.rndscale.ps.512"(%a, %b, %a, %c, %b) :
(!llvm<"<16 x float>">, !llvm.i32, !llvm<"<16 x float>">, !llvm.i16, !llvm.i32) -> !llvm<"<16 x float>">
%1 = "avx512.mask.scalef.ps.512"(%a, %a, %a, %c, %b) :
(!llvm<"<16 x float>">, !llvm<"<16 x float>">, !llvm<"<16 x float>">, !llvm.i16, !llvm.i32) -> !llvm<"<16 x float>">
llvm.return %1: !llvm<"<16 x float>">
}
They would have a counterpart operation specified on MLIR 1-D vector types for the purpose of type checking and progressive lowering. For instance the Tablegen specification for MaskRndScaleOp would resemble:
def MaskRndScaleOp : AVX512_Op<"mask.rndscale", [NoSideEffect,
AllTypesMatch<["src", "a", "dst"]>]>,
// Supports vector<16xf32> and vector<8xf64>.
Arguments<(ins VectorOfLengthAndType<[16, 8], [F32, F64]>:$src,
I32:$k,
VectorOfLengthAndType<[16, 8], [F32, F64]>:$a,
AnyTypeOf<[I16, I8]>:$imm,
// TODO(ntv): figure rounding out (optional operand?).
I32:$rounding
)>,
Results<(outs VectorOfLengthAndType<[16, 8], [F32, F64]>:$dst)> {
let summary = "Masked roundscale op";
let description = [{
The mask.rndscale is an AVX512 specific op that can lower to the proper
`llvm::Intrinsic::x86_avx512_mask_rndscale_ps_512` or
`llvm::Intrinsic::x86_avx512_mask_rndscale_pd_512` instruction depending on
the type of MLIR vectors it is applied to.
From the Intel Intrinsics Guide:
================================
Round packed floating-point elements in `a` to the number of fraction bits
specified by `imm`, and store the results in `dst` using writemask `k`
(elements are copied from src when the corresponding mask bit is not set).
}];
// Fully specified by traits.
let verifier = ?;
let assemblyFormat =
// TODO(riverriddle, ntv): type($imm) should be dependent on type($dst).
"$src `,` $k `,` $a `,` $imm `,` $rounding attr-dict `:` type($dst) `,` type($imm)";
}
And the avx512 operation on MLIR vector types would resemble:
func @avx512_mask_rndscale(%a: vector<16xf32>, i32: i32, %i16: i16, %i8: i8) -> vector<16xf32>
{
%0 = avx512.mask.rndscale %a, %i32, %a, %i16, %i32 : vector<16xf32>, i16
return %0: vector<16xf32>
}
This is a good starting point to support mixed target-agnostic and target-specific lowering.
How does it fit into the MLIR dialect ecosystem?
Connection: how does it connect to the existing dialects in a compilation pipeline(s)?
The AVX-512 dialect would be the first OSS dialect for the HWV layer in the following diagram (extracted from the Vector Dialect document ):
Atm, we rely exclusively on LLVM’s peephole optimizer to do a good job from small insertelement
/extractelement
/shufflevector
. This proposal will allow targeting AVX512-specific instructions directly and mixing them with retargetable abstractions that rely on peephole optimizations. This is expected to create opportunities for better code generation with non-surprising performance.
In the limit, such an abstraction can be used as a form of intrinsic programming in MLIR and arbitrarily mixed with other abstractions. In the future, as LLVM VP intrinsics are developed, we expect the mix of target-specific and retargetable abstractions, that are required for good performance, to evolve.
Consolidation: is there already a dialect with a similar goal or matching abstractions; if so, can it be improved instead of adding a new one?
There is precedent in target-specific MLIR dialects for internal Google projects involving xPUs. This is an abstraction that has shown to work well but is not yet present in OSS.
Reuse: how does it generalize to similar but slightly different use-cases?
This proposal is for a Target-specific abstraction, as such it does not generalize to other targets. It is however expected that other Target-specific dialects such as SVE will follow a similar approach of defining a new Dialect.
An alternative would be to cram together all intrinsics in the LLVM and Vector dialects and continue extending them but target-specific abstractions that can be enabled/disabled more globally are expected be useful.
Who are the future contributors/maintainers beyond those who propose the dialect?
It is expected that Target-specific dialects, and AVX512 in particular, will be a generally useful abstraction layer to the MLIR community and that the community itself will contribute to extending and maintaining the abstractions.
Remark
This RFC is related to the previous, abandonned, proposal that was deemed to conflate too many things: both the definition of the dialect and its use in an implementation of XNNPack in MLIR. This RFC focuses on the AVX512-specific parts as outlined above. An experimental XNNPack dialect which targets AVX512 is also in early development