[RFC] Simplify x86 intrinsic generation

asiemien · March 31, 2025, 10:04am

This RFC proposes simplification of lowering x86vector operations which target various CPU extensions (currently AVX and AVX512) through LLVM intrinsics.

Current state

Various MLIR dialects (x86vector, amx, arm_sve, nvgpu etc.) target LLVM intrinsics through the following steps: MLIR ops -> MLIR LLVM ops -> LLVM IR

The infrastructure is streamlined through tablegen definitions, slim LLVM legalization conversion, and autogenerated conversions to LLVM IR. This layering generally results in multiple named operations being defined with the following hierarchy:

1 high-level named MLIR op - abstracting whole intrinsic class e.g., multiple variants of dpbf16ps instruction
1 to N named MLIR LLVM intrinsic ops - providing intermediate 1-to-1 LLVM-compatible layer between MLIR and LLVM IR intrinsics
1 generic LLVM IR op - final target intrinsic, usually an LLVM function call - concrete instructions are materialized later

CPU abstractions usually gather both types of MLIR operations under a single dialect e.g., x86vector, arm_sve. While GPU setup splits the two layers into separate dialects e.g., nvgpu -> nvvm, amdgpu -> rocdl.

The problem

Note: This analysis focuses x86 ecosystem issues which may not be applicable to the other stacks mentioned above.

The existing structure has scalability issue.

Each machine intrinsic requires at least two operations - MLIR op and MLIR LLVM wrapper which generates a lot of boilerplate. While most of it is created automatically, LLVM legalization pass still involves a manual conversion layer to glue the two abstractions together and a separate x86vector to LLVM IR translation.

Right now, x86vector dialect contains 22 operations - only 8 of those are actual MLIR operations which are useful for lowering from higher level dialects. The rest is the overhead introduced by the current LLVM lowering strategy.

There are thousands of x86 intrinsics covering different instructions sets which have multiple variants of the same core operation, different name mangling patterns, extra extensions (e.g., x86.avx512 vs x86.avx512bf16). This diversity prevents simple named intrinsic ops generalization or use of existing tools like OneToOneConvertToLLVMPattern.

Today, only a small fraction of intrinsics is exposed. However, each new intrinsic significantly increases the dialect size, impacts documentation and code readability, and still requires mechanical copy-pasting of tablegen and conversion code.

Proposal

Remove the intermediate layer of intrinsic wrappers using MLIR LLVM named op.

Instead, gather all lowering logic (type conversion, name mangling etc.) under LLVM legalization step and use existing llvm.call_intrinsic which allows reuse of existing generic lowering for emitting LLVM IR.

The goal is to:

Minimize dialect size - expose only higher-level MLIR ops abstracting variants of the same intrinsic or even multiple intrinsics.
Make dialect easier to use - cleaner docs, fewer choices (MLIR op vs specific MLIR LLVM intrinsic variant).
Simplify adding new ops - now reduced to one tablegen entry and one conversion pattern.

The core motivation for removal of MLIR LLVM named ops (e.g., x86vector.avx512.intr.dpbf16ps.128|256|512) from x86vector dialect is that today they only serve role of a transient middle layer. Analysis and transformations should target abstract MLIR ops (e.g., x86vector.avx512.dot) and further low-level optimizations are better left for LLVM backend targeting concrete instructions.

When possible, resolving overloaded intrinsics is left to llvm.call_intrinsic lowering instead of separate tablegen op definitions.

The implementation PR:

github.com/llvm/llvm-project

[mlir][x86vector] Simplify intrinsic generation

main ← adam-smnk:x86vector-to-llvm

opened 10:05AM - 31 Mar 25 UTC

adam-smnk

+228 -537

Replaces separate x86vector named intrinsic operations with direct calls to LLVM… intrinsic functions. This rework reduces the number of named ops leaving only high-level MLIR equivalents of whole intrinsic classes e.g., variants of AVX512 dot on BF16 inputs. Dialect conversion applies LLVM intrinsic name mangling further simplifying lowering logic. The separate conversion step translating x86vector intrinsics into LLVM IR is also eliminated. Instead, this step is now performed by the existing llvm dialect infrastructure. RFC: https://discourse.llvm.org/t/rfc-simplify-x86-intrinsic-generation/85581

Alternatives

Split into two dialects

Following the GPU approach, split x86vector dialect into an abstract MLIR dialect and an MLIR LLVM-compatible one - for example, x86vector and x86llvm dialects.

This approach keeps two separate lowering steps (legalization plus x86llvm conversion) and results in at two named operations per intrinsic class assuming x86llvm applies name mangling.

Such split addresses dialect usability. However, the named ops explosion still remains.

Additionally, applying custom name mangling at x86llvm level still breaks importing LLVM IR as named MLIR LLVM ops. As an example, see nvvm.mma.sync which after a round-trip mlir-translate --mlir-to-llvmir | mlir-translate --import-llvm is materialized back as a generic MLIR function call:
llvm.call_intrinsic "llvm.nvvm.mma.MANGLED_VARIANT"(...)

Lower directly to LLVM IR

Similar to the original proposal but lowers directly to LLVM IR.

While feasible, this is less progressive and breaks the current staging of: MLIR ops -> MLIR LLVM-compatible ops -> LLVM IR. Also, llvm.call_intrinsic separates concerns better legalization vs egress translation.

Split into smaller dialects

Split x86vector dialect into smaller dialects per extension e.g., x86.avx512, x86.avx etc.

This addresses usability and dialect size to some extent as only necessary dialects can be loaded. This might be necessary in the future as more intrinsic ops are added and x86vector dialect grows in sizes.

Out of scope for this RFC.

Load only target-specific ops

Selectively load parts of the x86vector ops.

Just mentioning for completeness, not even sure about implementation details of such solution.

This addresses runtime overhead of the dialect without dialect explosion. User experience is impacted as dialect itself becomes more complex and contains large number of ops.

Out of scope for this RFC.

banach-space · April 7, 2025, 7:16am

This is great - thank you for working on this !

From what I’ve heard from folks who’ve been involved with hardware dialects much longer than I have, the guiding principle has generally been to keep them slim by either:

abstracting at a higher level (e.g. via the Vector dialect), or
offloading complexity to LLVM (which is essentially what you’re proposing here).

So this definitely feels like a step in the right direction.

Are there any downsides you’ve encountered? Just curious what your “lessons learned” have been so far.

The only potential issue that comes to mind is the reduced ability to optimize in MLIR itself. If LLVM is mature enough to handle that, then great - but in some cases (like ArmSME), MLIR still plays a big role because LLVM has some catching up to do. (I’m actually supposed to be reviewing some of that soon.)

Also curious: your proposal doesn’t mention AMX. Is that already considered “slim enough”?

-Andrzej

asiemien · April 7, 2025, 10:51am

AMX dialect would be a great candidate for the same cleanup. I leave that for a simple follow-up depending on the outcome of this RFC.

I’ve chosen to focus on x86vector dialect to limit the scope and to avoid opening the discussion about dialect grouping i.e., whether AMX should merge with x86dialect or x86vector should be split into smaller extension (AVX, AVX512 etc.) dialects. I’m leaning toward the latter (when x86vector grows further) but let’s leave that for the future.

Nothing that really impacts the current usage which treats MLIR LLVM as a straight-forward last mile egress step. This is also the core premise of this proposal - all major transformation should’ve been performed at the higher abstraction level i.e., named MLIR ops. The moment specific intrinsics are materialized, the rest is left to the LLVM backend.

However, I see potentially two minor limitation:

llvm.call_intrinsic is more opaque - named MLIR LLVM intrinsics can have traits to expose different properties or side-effects
llvm.call_intrinsic does not pass one-shot bufferization - the op does not model side effects and does not implement CallOpInterface

The first one reinforces the separation of concerns and the two abstractions. The MLIR ops can still attach all necessary traits and verification predicates can be moved to legalization pass.
AFAIK, in general none of the CPU intrinsic ops model any particular properties at MLIR LLVM level so, no regressions here.

The second limitation could realistically only occur if IR is lowered “DFS” style. I’d argue one should stop at MLIR ops and apply further conversion only at the egress. However, this limitation could be addressed by improving llvm.call_intrinsic op itself and/or bufferization infrastructure.

I haven’t encountered this particular problem (yet?) for x86 but I’m eager to learn more here - it’d be great feedback.
Is there anything you would need to do specifically at MLIR LLVM IR level that could not be handled earlier on “proper” MLIR ops?

gysit · April 8, 2025, 7:34am

I am supportive of this RFC.

I agree there’s not much value in having two distinct explicit operations for such target-specific intrinsics. As mentioned earlier, any optimizations that require understanding the operation semantics could be performed at the target dialect level. Should llvm.call_intrinsic prove to be problematic because of the missing side-effect modeling, we may extend it to model side-effects similar to how it has been done for the inline assembly operation (following how LLVM models side effects for intrinsics / calls).

banach-space · April 8, 2025, 4:04pm

I spoke too soon. We used to operate at that level, but later realized we needed higher-level abstractions. For whatever reason, I assumed some of the existing logic still worked at that level. Sorry for the confusion.

+1 (we can cross that bridge when we get there)

Overall, +1. Great write-up, Adam!

asiemien · May 15, 2025, 1:17pm

As promised, I’m working on follow-up toward AMX dialect cleanup using the same approach as for the x86vector.
In the meantime, the intrinsic generation infrastructure got some minor improvements after adding a few more intrinsic ops. Primarily, type conversion is now easier to handle compared to the first version.

Refactor proposal

Before getting to the dialect itself, I propose moving the intrinsic op interface to LLVMIR to allow easier code reuse. After the simplification, the low-level x86vector ops directly abstract concrete LLVM intrinsics which effectively creates strong direct dependency on LLVM.

AMX dialect ops can use the same design flow by either creating its own intrinsic interface or adding an extra dependency on x86vector to reuse its interface. However, the intrinsic modelling is already closely related to LLVMIR and currently its llvm.call_intrinsic op. Placing the intrinsic interface in the LLVMIR dialect would create more logical dependency stack and allow for easy code reuse - each low-level dialect such as x86vector, amx etc. could derive from that interface.

The implementation PR:

github.com/llvm/llvm-project

[mlir][llvm][x86vector] One-to-one intrinsic op interface

main ← adam-smnk:llvmir-intrinsic-op-interface

opened 01:17PM - 15 May 25 UTC

adam-smnk

+229 -200

Adds an LLVMIR op interface that can used by external operations to model LLVM i…ntrinsics. Related 'op to llvm.call_intrinsic' rewriter helper is moved into common LLVM conversion patterns. The x86vector dialect is refactored to use the new common abstraction. The one-to-one intrinsic op is tied to LLVM intrinsic call semantics. Thus, the op interface, previously defined as a part of x86vector dialect, is moved into the LLVMIR interfaces to allow other low-level dialects to define operations abstracting specific intrinsic semantics while minimizing infrastructure duplication. Related RFC: https://discourse.llvm.org/t/rfc-simplify-x86-intrinsic-generation/85581

Note that x86vector still has its own light interface wrapper to allow for easy matching of all x86vector intrinsic ops without the risk of processing ops from other dialects (once more dialects use the base common interface). Perhaps, there’s a better approach to that.

Alternatives

Dialect-specific interfaces

Other dialects, like AMX, could have their own copies of the intrinsic op interface - essentially copy-paste of the x86vector interface.
That would work fine but feels like unnecessary code duplication. I aim for the intrinsic interface to be generic enough to serve as a common bridge between low-level ops and call to intrinsics. Having one base interface would help to polish it further if needed and promote reuse.

Different interface location

Right now, it seemed the easiest to throw the interface into LLVMInterfaces.
However, there could be a better place for that. I think it conceptually belongs to LLVMIR dialect but it could be better to give it own separate place like LLVMExtInterfaces.
No strong opinions here.

asiemien · May 19, 2025, 3:26pm

Then finally for completeness, AMX dialect cleanup following the same approach as the x86vector simplification:

github.com/llvm/llvm-project

[mlir][amx] Simplify intrinsic generation

main ← adam-smnk:simplify-amx

opened 03:26PM - 19 May 25 UTC

adam-smnk

+432 -446

Replaces separate amx named intrinsic operations with direct calls to LLVM intri…nsic functions. The existing amx tests are updated and expanded. The separate conversion step translating amx intrinsics into LLVM IR is eliminated. Instead, this step is now performed by the existing llvm dialect infrastructure. Related RFC: https://discourse.llvm.org/t/rfc-simplify-x86-intrinsic-generation/85581/7

ftynse · May 21, 2025, 12:51pm

This makes sense. Intrinsics are a fundamental part of LLVM IR, given the design goals of the LLVM dialect, it fully makes sense to have the interface modeling them in the dialect itself.

Topic		Replies	Views
Move / Remove `vcix` dialect MLIR riscv	48	763	July 3, 2025
[Abandoned][RFC] AVX512-specific Dialect for implementing and benchmarking XNNPack in MLIR MLIR	22	1791	March 4, 2020
[RFC] Starting an AVX512 Target-Specific Dialect - Rebooted MLIR	11	1758	March 17, 2020
Exposing LLVM ops to the MLIR type system MLIR	36	2378	January 8, 2021
[RFC] Vector Dialects: Neon and SVE MLIR	15	3457	December 8, 2020

[RFC] Simplify x86 intrinsic generation

Current state

The problem

Proposal

Alternatives

Split into two dialects

Lower directly to LLVM IR

Split into smaller dialects

Load only target-specific ops

Refactor proposal

Alternatives

Dialect-specific interfaces

Different interface location

Related topics