This RFC proposes simplification of lowering x86vector operations which target various CPU extensions (currently AVX and AVX512) through LLVM intrinsics.
Current state
Various MLIR dialects (x86vector, amx, arm_sve, nvgpu etc.) target LLVM intrinsics through the following steps: MLIR ops -> MLIR LLVM ops -> LLVM IR
The infrastructure is streamlined through tablegen definitions, slim LLVM legalization conversion, and autogenerated conversions to LLVM IR. This layering generally results in multiple named operations being defined with the following hierarchy:
-
1 high-level named MLIR op - abstracting whole intrinsic class e.g., multiple variants of
dpbf16ps
instruction -
1 to N named MLIR LLVM intrinsic ops - providing intermediate 1-to-1 LLVM-compatible layer between MLIR and LLVM IR intrinsics
-
1 generic LLVM IR op - final target intrinsic, usually an LLVM function call - concrete instructions are materialized later
CPU abstractions usually gather both types of MLIR operations under a single dialect e.g., x86vector, arm_sve. While GPU setup splits the two layers into separate dialects e.g., nvgpu -> nvvm
, amdgpu -> rocdl
.
The problem
Note: This analysis focuses x86 ecosystem issues which may not be applicable to the other stacks mentioned above.
The existing structure has scalability issue.
Each machine intrinsic requires at least two operations - MLIR op and MLIR LLVM wrapper which generates a lot of boilerplate. While most of it is created automatically, LLVM legalization pass still involves a manual conversion layer to glue the two abstractions together and a separate x86vector to LLVM IR translation.
Right now, x86vector dialect contains 22 operations - only 8 of those are actual MLIR operations which are useful for lowering from higher level dialects. The rest is the overhead introduced by the current LLVM lowering strategy.
There are thousands of x86 intrinsics covering different instructions sets which have multiple variants of the same core operation, different name mangling patterns, extra extensions (e.g., x86.avx512
vs x86.avx512bf16
). This diversity prevents simple named intrinsic ops generalization or use of existing tools like OneToOneConvertToLLVMPattern
.
Today, only a small fraction of intrinsics is exposed. However, each new intrinsic significantly increases the dialect size, impacts documentation and code readability, and still requires mechanical copy-pasting of tablegen and conversion code.
Proposal
Remove the intermediate layer of intrinsic wrappers using MLIR LLVM named op.
Instead, gather all lowering logic (type conversion, name mangling etc.) under LLVM legalization step and use existing llvm.call_intrinsic
which allows reuse of existing generic lowering for emitting LLVM IR.
The goal is to:
-
Minimize dialect size - expose only higher-level MLIR ops abstracting variants of the same intrinsic or even multiple intrinsics.
-
Make dialect easier to use - cleaner docs, fewer choices (MLIR op vs specific MLIR LLVM intrinsic variant).
-
Simplify adding new ops - now reduced to one tablegen entry and one conversion pattern.
The core motivation for removal of MLIR LLVM named ops (e.g., x86vector.avx512.intr.dpbf16ps.128|256|512
) from x86vector dialect is that today they only serve role of a transient middle layer. Analysis and transformations should target abstract MLIR ops (e.g., x86vector.avx512.dot
) and further low-level optimizations are better left for LLVM backend targeting concrete instructions.
When possible, resolving overloaded intrinsics is left to llvm.call_intrinsic
lowering instead of separate tablegen op definitions.
The implementation PR:
Alternatives
Split into two dialects
Following the GPU approach, split x86vector dialect into an abstract MLIR dialect and an MLIR LLVM-compatible one - for example, x86vector and x86llvm dialects.
This approach keeps two separate lowering steps (legalization plus x86llvm conversion) and results in at two named operations per intrinsic class assuming x86llvm applies name mangling.
Such split addresses dialect usability. However, the named ops explosion still remains.
Additionally, applying custom name mangling at x86llvm level still breaks importing LLVM IR as named MLIR LLVM ops. As an example, see nvvm.mma.sync
which after a round-trip mlir-translate --mlir-to-llvmir | mlir-translate --import-llvm
is materialized back as a generic MLIR function call:
llvm.call_intrinsic "llvm.nvvm.mma.MANGLED_VARIANT"(...)
Lower directly to LLVM IR
Similar to the original proposal but lowers directly to LLVM IR.
While feasible, this is less progressive and breaks the current staging of: MLIR ops -> MLIR LLVM-compatible ops -> LLVM IR
. Also, llvm.call_intrinsic
separates concerns better legalization vs egress translation.
Split into smaller dialects
Split x86vector dialect into smaller dialects per extension e.g., x86.avx512, x86.avx etc.
This addresses usability and dialect size to some extent as only necessary dialects can be loaded. This might be necessary in the future as more intrinsic ops are added and x86vector dialect grows in sizes.
Out of scope for this RFC.
Load only target-specific ops
Selectively load parts of the x86vector ops.
Just mentioning for completeness, not even sure about implementation details of such solution.
This addresses runtime overhead of the dialect without dialect explosion. User experience is impacted as dialect itself becomes more complex and contains large number of ops.
Out of scope for this RFC.