[RFC] Simplify x86 intrinsic generation

This RFC proposes simplification of lowering x86vector operations which target various CPU extensions (currently AVX and AVX512) through LLVM intrinsics.

Current state

Various MLIR dialects (x86vector, amx, arm_sve, nvgpu etc.) target LLVM intrinsics through the following steps: MLIR ops -> MLIR LLVM ops -> LLVM IR

The infrastructure is streamlined through tablegen definitions, slim LLVM legalization conversion, and autogenerated conversions to LLVM IR. This layering generally results in multiple named operations being defined with the following hierarchy:

  • 1 high-level named MLIR op - abstracting whole intrinsic class e.g., multiple variants of dpbf16ps instruction

  • 1 to N named MLIR LLVM intrinsic ops - providing intermediate 1-to-1 LLVM-compatible layer between MLIR and LLVM IR intrinsics

  • 1 generic LLVM IR op - final target intrinsic, usually an LLVM function call - concrete instructions are materialized later

CPU abstractions usually gather both types of MLIR operations under a single dialect e.g., x86vector, arm_sve. While GPU setup splits the two layers into separate dialects e.g., nvgpu -> nvvm, amdgpu -> rocdl.

The problem

Note: This analysis focuses x86 ecosystem issues which may not be applicable to the other stacks mentioned above.

The existing structure has scalability issue.

Each machine intrinsic requires at least two operations - MLIR op and MLIR LLVM wrapper which generates a lot of boilerplate. While most of it is created automatically, LLVM legalization pass still involves a manual conversion layer to glue the two abstractions together and a separate x86vector to LLVM IR translation.

Right now, x86vector dialect contains 22 operations - only 8 of those are actual MLIR operations which are useful for lowering from higher level dialects. The rest is the overhead introduced by the current LLVM lowering strategy.

There are thousands of x86 intrinsics covering different instructions sets which have multiple variants of the same core operation, different name mangling patterns, extra extensions (e.g., x86.avx512 vs x86.avx512bf16). This diversity prevents simple named intrinsic ops generalization or use of existing tools like OneToOneConvertToLLVMPattern.

Today, only a small fraction of intrinsics is exposed. However, each new intrinsic significantly increases the dialect size, impacts documentation and code readability, and still requires mechanical copy-pasting of tablegen and conversion code.

Proposal

Remove the intermediate layer of intrinsic wrappers using MLIR LLVM named op.

Instead, gather all lowering logic (type conversion, name mangling etc.) under LLVM legalization step and use existing llvm.call_intrinsic which allows reuse of existing generic lowering for emitting LLVM IR.

The goal is to:

  1. Minimize dialect size - expose only higher-level MLIR ops abstracting variants of the same intrinsic or even multiple intrinsics.

  2. Make dialect easier to use - cleaner docs, fewer choices (MLIR op vs specific MLIR LLVM intrinsic variant).

  3. Simplify adding new ops - now reduced to one tablegen entry and one conversion pattern.

The core motivation for removal of MLIR LLVM named ops (e.g., x86vector.avx512.intr.dpbf16ps.128|256|512) from x86vector dialect is that today they only serve role of a transient middle layer. Analysis and transformations should target abstract MLIR ops (e.g., x86vector.avx512.dot) and further low-level optimizations are better left for LLVM backend targeting concrete instructions.

When possible, resolving overloaded intrinsics is left to llvm.call_intrinsic lowering instead of separate tablegen op definitions.

The implementation PR:

Alternatives

Split into two dialects

Following the GPU approach, split x86vector dialect into an abstract MLIR dialect and an MLIR LLVM-compatible one - for example, x86vector and x86llvm dialects.

This approach keeps two separate lowering steps (legalization plus x86llvm conversion) and results in at two named operations per intrinsic class assuming x86llvm applies name mangling.

Such split addresses dialect usability. However, the named ops explosion still remains.

Additionally, applying custom name mangling at x86llvm level still breaks importing LLVM IR as named MLIR LLVM ops. As an example, see nvvm.mma.sync which after a round-trip mlir-translate --mlir-to-llvmir | mlir-translate --import-llvm is materialized back as a generic MLIR function call:
llvm.call_intrinsic "llvm.nvvm.mma.MANGLED_VARIANT"(...)

Lower directly to LLVM IR

Similar to the original proposal but lowers directly to LLVM IR.

While feasible, this is less progressive and breaks the current staging of: MLIR ops -> MLIR LLVM-compatible ops -> LLVM IR. Also, llvm.call_intrinsic separates concerns better legalization vs egress translation.

Split into smaller dialects

Split x86vector dialect into smaller dialects per extension e.g., x86.avx512, x86.avx etc.

This addresses usability and dialect size to some extent as only necessary dialects can be loaded. This might be necessary in the future as more intrinsic ops are added and x86vector dialect grows in sizes.

Out of scope for this RFC.

Load only target-specific ops

Selectively load parts of the x86vector ops.

Just mentioning for completeness, not even sure about implementation details of such solution.

This addresses runtime overhead of the dialect without dialect explosion. User experience is impacted as dialect itself becomes more complex and contains large number of ops.

Out of scope for this RFC.

4 Likes

This is great - thank you for working on this :folded_hands:!

From what I’ve heard from folks who’ve been involved with hardware dialects much longer than I have, the guiding principle has generally been to keep them slim by either:

  • abstracting at a higher level (e.g. via the Vector dialect), or
  • offloading complexity to LLVM (which is essentially what you’re proposing here).

So this definitely feels like a step in the right direction.

Are there any downsides you’ve encountered? Just curious what your ā€œlessons learnedā€ have been so far.

The only potential issue that comes to mind is the reduced ability to optimize in MLIR itself. If LLVM is mature enough to handle that, then great - but in some cases (like ArmSME), MLIR still plays a big role because LLVM has some catching up to do. (I’m actually supposed to be reviewing some of that soon.)

Also curious: your proposal doesn’t mention AMX. Is that already considered ā€œslim enoughā€?

-Andrzej

AMX dialect would be a great candidate for the same cleanup. I leave that for a simple follow-up depending on the outcome of this RFC.

I’ve chosen to focus on x86vector dialect to limit the scope and to avoid opening the discussion about dialect grouping i.e., whether AMX should merge with x86dialect or x86vector should be split into smaller extension (AVX, AVX512 etc.) dialects. I’m leaning toward the latter (when x86vector grows further) but let’s leave that for the future.

Nothing that really impacts the current usage which treats MLIR LLVM as a straight-forward last mile egress step. This is also the core premise of this proposal - all major transformation should’ve been performed at the higher abstraction level i.e., named MLIR ops. The moment specific intrinsics are materialized, the rest is left to the LLVM backend.

However, I see potentially two minor limitation:

  • llvm.call_intrinsic is more opaque - named MLIR LLVM intrinsics can have traits to expose different properties or side-effects
  • llvm.call_intrinsic does not pass one-shot bufferization - the op does not model side effects and does not implement CallOpInterface

The first one reinforces the separation of concerns and the two abstractions. The MLIR ops can still attach all necessary traits and verification predicates can be moved to legalization pass.
AFAIK, in general none of the CPU intrinsic ops model any particular properties at MLIR LLVM level so, no regressions here.

The second limitation could realistically only occur if IR is lowered ā€œDFSā€ style. I’d argue one should stop at MLIR ops and apply further conversion only at the egress. However, this limitation could be addressed by improving llvm.call_intrinsic op itself and/or bufferization infrastructure.

I haven’t encountered this particular problem (yet?) for x86 but I’m eager to learn more here - it’d be great feedback.
Is there anything you would need to do specifically at MLIR LLVM IR level that could not be handled earlier on ā€œproperā€ MLIR ops?

I am supportive of this RFC.

I agree there’s not much value in having two distinct explicit operations for such target-specific intrinsics. As mentioned earlier, any optimizations that require understanding the operation semantics could be performed at the target dialect level. Should llvm.call_intrinsic prove to be problematic because of the missing side-effect modeling, we may extend it to model side-effects similar to how it has been done for the inline assembly operation (following how LLVM models side effects for intrinsics / calls).

I spoke too soon. We used to operate at that level, but later realized we needed higher-level abstractions. For whatever reason, I assumed some of the existing logic still worked at that level. Sorry for the confusion.

+1 (we can cross that bridge when we get there)

Overall, +1. Great write-up, Adam!