[Abandoned][RFC] AVX512-specific Dialect for implementing and benchmarking XNNPack in MLIR

Hello everyone,

The Vector Dialect document discusses the vector abstractions that MLIR supports and tradeoffs. One of the layer that is missing in OSS atm is the Hardware Vector Ops (HWV) level.

I am interested in experimenting in core with an AVX512-specific dialect for the specific purpose of implementing portions of XNNPack in MLIR and benchmarking them.

I am proposing to add a new Dialect/Targets/AVX512 Dialect that would directly target useful intrinsics to implement XNNPack. The first function I am interested in implementing is exp-avx-512.

I think it is time for such dialects because atm, we rely too much on LLVM’s peephole optimizer to do a good job from small insertelement/extractelement/shufflevector. We have some intrinsics defined and used in the LLVMDialect but these are all “portable” intrinsics, I am looking for defining layering to attack the right instructions in avx512 directly.

I think iterating at this level of abstraction in core will be a useful scouting work to getting the abstractions and layering right, and pave the way for a future ARM SVE dialect and other non-generic CPU dialects. Of course, when possible generic abstractions should be preferred. We also expect to learn more about when HW-specific vs generic abstractions should be used and how they compose in MLIR.

Edit: It was pointed that I should use the template for new dialects so here goes.

  • What is the overall goal of the dialect?

Start filling the void in OSS between target-agnostic and target-specific vector operations.

  • What is the first implementation milestone?

MLIR vector<16xf32> to AVX512 LLVM intrinsics for the exp-avx-512 function.

  • How does it fit into the MLIR dialect ecosystem?

It is the first HWV dialect(s) in OSS (see the Vector Dialect doc ).

  • Connection: how does it connect to the existing dialects in a compilation pipeline(s)?

VectorOps -> AVX512-MLIR -> AVX512-LLVM -> LLVM

  • Consolidation: is there already a dialect with a similar goal or matching abstractions; if so, can it be improved instead of adding a new one?


  • Reuse: how does it generalize to similar but slightly different use-cases?

There will be different HW-specific dialects we want to target. The union of all the ops in HW-independent and HW-specific dialects will represent the set of valid ops for a particular Target.

  • What is the community of users that it is serving?

CPU users who want performance with AVX512 intrinsics.

  • Who are the future contributors/maintainers beyond those who propose the dialect?

Anyone interested in AVX512 and making it a successful target for MLIR.

Please let me know if you have questions or concerns.

Thanks all!

1 Like

Thanks Nicolas! That sounds interesting! I think it could give us a good idea of the challenges in mixing high-level dialects with low-level target-specific dialects.

On a side note, have you thought about having an low-level VL-independent/target-independent dialect that provides a common ground for writing generic vector code and that can be “instantiated” and reasonably lowered to a VL-specific/target-specific vector dialect later on? I know there would be quite a few challenges in some scenarios but it could help reduce the target-specific kernel versioning for some simple cases.

The intention is to continue extending the VectorOps dialect which plays that role.
Are you suggesting an extra dialect between the VectorOps and say AVX512?
Would you have some example of ops that you would see as interesting and that wouldn’t fit in either VectorOps or AVX512?

Side note, we are attacking this both from the top-down (i.e. Linalg → VectorOps → AVX512 → LLVM) and bottom up (lift intrinsics to the Vector level). One concrete use case as I am building this end to end is the need for a portable vector.fma. See this [revision](see ⚙ D74075 [mlir][VectorOps] Introduce a `vector.fma` op that works on n-D vectors and lowers to `llvm.intrin.fmuladd` ) if you have cycles reviewing :wink:

VL independence is quite a challenge! Since MLIR’s vector shapes are fixed size (static shape), this would require an elaborate design change / changing the type system for vector types; it would also bring up the question of merging tensor and vector. An alternative design point could be to allow tensors as an elt type of memrefs, i.e., allow memrefs of tensors (and as a result, memrefs of dynamically shaped tensors), and the tensor element types could be later materialized to fixed size vector types once the vector widths were known.

After looking at ⚙ D74056 [mlir][AVX512] Start an AVX512 dialect, I think I misunderstood where you were heading with the AVX512 dialect :). I thought you were planning to add the basic AVX512 intrinsics (i.e., _mm512_set1_ps, _mm512_loadu_ps, _mm512_fmadd_ps, etc.) to this dialect so that you could write a function that implements XNNPack exp function for AVX512. However, you plan to add XNNPack exp as a “kernel” operation of the AVX512 dialect, right?

That’s why I thought that having a VL/target independent dialect to write these kernels could be useful. For simple cases, we could write a single generic vector kernel and then lower it to SSE, AVX, AVX512, etc. I know this would be challenging but it sounds something that might be worth investigating :slight_smile:

For the simple cases this is already happening by virtue of just targeting LLVM and letting LLVM do the work. In the particular case of the first motivating example, I am indeed working through a set of revisions that will expose avx512.mask.rndscale.ps.512 and avx512.mask.scalef.ps.512 on MLIR LLVM Dialect types. I will probably have to expose more portable vector ops that may or may not have a vanilla LLVM lowering.

Ultimately, I think the question is “where do we want to put the switch?”.
VectorOps are VL/target independent and lower to LLVM + “portable” intrinsics.
AVX512 adds an intermediate layer that is target dependent and through which we can go to target specific intrinsics (i.e. when LLVM “portable” intrinsics + ISel do not/are not expected to work in a desired way).

It feels like the name “AVX512” is misleading given the goal of the dialect. I naturally assumed it was intended to model AVX512 instruction set (and so did the others from what I can see), but it is not the case. Instead, it seems to model some slightly higher-level operation that can be rewritten to a sequence of AVX512 instructions. I would argue that what you propose should not be called “AVX512”. It may be “xnnpack” if you wish.

IMO, we can have an AVX512 that is derived from the corresponding LLVM IR intrinsics, potentially in an automated way that we now support. I would prefer it to only have operations derived from intrinsics and, if strongly necessary, support operations (similarly to llvm.constant that we need to fit one IR into the other). This can live in the same place as NVVM and ROCDL dialects that are similarly derived from LLVM IR intrinsics. Moving them under lib/Dialect/Target is a separate discussion.

I think it might be interesting to have a non instruction-level vector dialect between abstract nD vectors and platform-specific operations, but I’d like to better understand how it connects to the existing things on both sides, and how it can be generalized.

Atm, it is intended to be the place for AVX512-specific abstractions (i.e. that would only trigger if the AVX512 backend is specified). As usual when connecting things, there is a top-down and a bottom-up view of the world.

So far, I have been going at it top down and the op + type at this level are MLIR vector<16xf32>.
There is also a case to be made about op + type on mlir LLVM vector: llvm<16 x float>.

The ExpP5xxxOp in this dialect is an MLIR op on vector<16xf32> inspired by a concrete XNNPack use-case. I wouldn’t go as far as doing an XNNPack dialect because we will want many different types of MLIR vector<16xf32> ops that are AVX512-specific and need to know about the instruction mix.

In other words, I am interested in both AVX512-mlir-vector<16xf32> ops and AVX512-mlir-llvm<16 x float> ops and connecting them all the way to emitting this type of code:

avx512_exp_p5_scale:                    # @avx512_exp_p5_scale
# %bb.0:
        vfmadd231ps     .LCPI0_3(%rip){1to16}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
        vbroadcastss    .LCPI0_4(%rip), %zmm3 # zmm3 = [8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3,8.28929059E-3]
        vfmadd213ps     %zmm2, %zmm0, %zmm3 # zmm3 = (zmm0 * zmm3) + zmm2
        vmulps  %zmm3, %zmm1, %zmm0
        .size   avx512_exp_p5_scale, .Lfunc_end0-avx512_exp_p5_scale
                                        # -- End function

As usual, we are hitting the hardest problem in computer science: how should we name the concepts (dialects?) for AVX512-mlir-vector<16xf32> ops and AVX512-mlir-llvm<16 x float> ops?

I am fine with having a pure LLVM autogenerated layer named AVX512, in which case how should we call its MLIR counterpart (that I am now calling AVX512)?

I was hoping that we could have both AVX512-mlir-vector<16xf32> ops and AVX512-mlir-llvm<16 x float> ops in a single dialect (after all a dialect is just a namespace) so that things that talk 1-1 are closer together. I see the potential issues coming with conversions and blurry dialect boundaries though, but I don’t think they too problematic.

Thoughts / suggestions on name for AVX512-mlir-vector<16xf32> if people insist it must be a separate dialect?

I updated the original post using the common recommended template for a new Dialect.

I find this proposal confusing: as other I assumed that you wanted to model exactly what clang exposes as intrinsics to target this instruction set. At this point it isn’t clear to me what this is about really: if this is exposing the XNNPack operations, then I don’t really see why it should be a core dialect and why having an implementation specific for one instruction set is particularly appealing and/or reusable?

There is all the part about “learn more about when HW-specific vs generic abstractions should be used and how they compose in MLIR” which seems a bit scary to not have a plan for before adding this dialect in-tree.

I would argue that what you propose should not be called “AVX512”. It may be “xnnpack” if you wish.
I would prefer it to only have operations derived from intrinsics and, if strongly necessary, support operations (similarly to llvm.constant that we need to fit one IR into the other)

This sounds reasonable to me.

I wouldn’t go as far as doing an XNNPack dialect because we will want many different types of MLIR vector<16xf32> ops that are AVX512-specific and need to know about the instruction mix.

Not sure I follow why a vector<16xf32> XNNPack-related op needs to be AVX512-specific when we are not modeling the actual low level implementation of it at that particular level of abstraction. Wouldn’t it make sense to have a high level vector<16xf32> XNNPack op that we can later lower to AVX512 intrinsics or any other vector ISA?

For the simple cases this is already happening by virtue of just targeting LLVM and letting LLVM do the work.

Not really. AFAIK, if you pass LLVM vector types with a fixed vector length, LLVM won’t “re-vectorize” the code to better fit a, for instance, wider VL of the target ISA.

VectorOps are VL/target independent and lower to LLVM + “portable” intrinsics.

I think VectorOps would be VL dependent unless we could use some kind of VL independent vector types, as Uday mentioned, and then we would need to fill the semantic gap of how those ops would evolve given a specific VL instantiation. That’s why I thought that exploring another dialect with different vector semantics could be interesting if the goal is using it to write kernel in MLIR. Again, this was a random idea. Maybe a discussion for another time.

Thanks for your comments all.
I can see where the conflation of concepts is too high for a dialect proposal and how NNPack should have been kept out of the discussion.
The consensus is to not go forward with this in this form, I will retitle the RFC to make it less confusing as to what was rejected.

I wouldn’t necessarily say it’s rejected (unless you want to rescind the proposal). We don’t fully understand the scope and the goals.

Orthogonally to the RFC, I’d like to finish the discussion on the following topics. Please feel free to move them to a separate thread if you think it is better.

This seems like you’re proposing some raising transformation here that would start from smaller vector and rewrite as larger vectors in a target-specific way? If so, this transformation is definitely out of the scope of things I am considering atm. The claim is that VectorOps answers a good part of “write a single generic vector kernel and then lower it to SSE, AVX, AVX512, etc”, if one uses vectors that are larger than the HW vector size (see the section Hardware as vector Machines of Minimum Granularity in the Vector Dialect doc ). So I would say that “wider VL of the target ISA” is a symptom that an undesirable decision has been made upstream.

In general it is my impression that LLVM does a good job at taking ops on llvm<123xf32> and slicing and dicing them to all the supported HW and I am not looking at duplicating that in MLIR. If an MLIR transformation (or a user) uses vector<3xf32> on a machine with wide vectors, then the only thing I plan to provide is functional correctness.

Does this make sense?

having a VL/target independent dialect to write these kernels could be useful.

Then I think I am missing what VL independent means. Let me state a few claims on VectorOps and please let me know whether you disagree:

  1. VectorOps are target independent by design.
  2. We lower VectorOps to the MLIR LLVM dialect and other HW-specific vector dialects (none in OSS as of today though…)
  3. If one targets multiple of the HW vector size then we can unroll to operations that fit the HW vector size in MLIR (or just let LLVM do it for the 1-D vector case)
  4. If one targets non-multiple sizes (either vector<123xf32> or vector<3xf32>), the mapping to proper HW vectors is fully offloaded to LLVM
  5. VectorOps is VL-independent: ops can be used with any length (and some any ranked) vector.
  6. VectorOps is VL-independent but operates on static vector length.

Is my understanding that by VL-independent you mean something like vector<?xf32> with symbolic length?
If so then I have thought about this but did not come to a satisfactory conclusion.
In practice I think the tradeoffs would be quite complex to unpack and I have not seen a concrete use case where this would be necessary yet (SVE are more restricted than just an arbitrary ? and I think can and should be modeled differently, but I have not thought about it deeply enough yet).

But at this point I think I am over-speculating here. Could you please give a bit more insight in what you mean by VL-independent?

Thanks for raising these great points @dcaballe !

I am rescinding it in this form indeed.

I have been following the path of solving a top-down problem end-to-end and providing new features that are concerned with higher-level abstractions.

It seems there is agreement that it can be split into multiple things.
Doing the same exploration bottom-up will likely get me to the point where I can propose something closer to what people understood by AVX512 dialect.

I agree there is a concrete need for exposing target-specific intrinsics in MLIR so we can build on top of them. This RFC was trying to also build on top of them, which I can see is premature, at least as a core dialect.

Thanks for the feedback !

LLVM does provide a correct lowering, but it is not great at this on most targets. The issue is that this lowering happens in SelectionDAG, which works a single basic-block at a time. This prevents a lot of algebraic simplifications and other things from happening cross block after vectors get split.

There is some work to improve this (some folks from Apple are working on matrix support for LLVM with better lowering), but I’m not aware of the status. You could ask on llvm-dev if you’re curious.


Are you talking about @fhahn https://reviews.llvm.org/D70456?

Yes, that’s it! Thank you for finding that

There Is a queue of others [Matrix] changes already under review.

Just as reference the vector predication RFC mentioned also in the [Matrix] RFC and the new DOC under review: