Support for more hardware intrinsics on ARMNeon dialect

jwkim98 · November 12, 2023, 3:01am

I’m currently working on building a compiler that targets ARM devices, utilizing ARM Neon intrinsics.

I found out that ARM Neon has many specialized intrinsics (for fixed-point arithmetics especially). Despite these intrinsics can be implemented using combination of other operations, I need to use them to produce best performance. But I don’t find way that MLIR would produce those intrinsics without user customization (maybe adding new dialect and providing conversion rules directly to LLVM IR).

For example,

VQDMULHQ
Vector Saturating Doubling Multiply (which is mixture of different operations)
VBSLQ
Vector bitwise select

Using most vector intrinsics commonly used should be achieved by ‘vector’ dialect. However, there are some intrinsics that are hardware specific. Therefore, I see there exists some hardware-specialized dialects such as ‘ARMNeon’. However, they don’t seem to cover lots of intrinsics that user may want to utilize.

If these instructions can be generated by folding some patterns written in vector dialect, it would be nice if there was some documentation about how vector operations are folded to produce these specific hardware intrinsics.

I haven’t looked close into other hardware-specific dialects such as ‘X86Vector’ or ‘SME’, but if situation is similar, It would be nice to extend them.

Is there plans to widen support for these kinds of intrinsics? If not, do you think widening intrinsic support in hardware-specific dialect or implementing pattern rewriter that would convert to these intrinsics can add more value to mlir?

nicolasvasilache · November 12, 2023, 4:22pm

Yes it would be great to extend these target-specific dialects. At the moment you can consider them as placeholders.

Note that there is a fine line between hardware specific vector dialect ops and just using llvm.call_intrinsic. We should avoid duplicating ops just for the sake of: if higher level semantic information should be represented then a new hw-specific op makes sense; otherwise llvm.call_intrinsic is preferable.

banach-space · November 13, 2023, 8:08am

I am not aware of anyone actively working on/extending the ArmNEON dialect, but we are indeed doing something similar to what you suggested for scalable vectors, i.e. ArmSME and ArmSVE dialects.

I think that we need to be careful and to avoid feature creep - there’s little point in adding new intrinsics to these hardware dialects if there’s no mechanism (i.e. lowering path) to generate them from higher level abstractions.

Also, there are a lot of widening instructions which would require support for operations with mixed types, but I think that we are still unclear what would that entail exactly. See the recent discussion here:

Compile error with mixed-types linalg.matmul vectorization · Issue #15241 · openxla/iree · GitHub

Just raising some points for considerations.

-Andrzej

dcaballe · November 16, 2023, 12:41am

+1 to what Nicolas said. We have been following that approach for all the target-specific dialects. The introduction of new target-specific ops may also be justified if you plan to do some heavy transformation on those ops within the dialect. SME was a case that started as “let’s try to reuse LLVM intrinsics” and then we realized that we needed a better abstraction than LLVM to implement some transformations at that level.

+1 as well. I would also add that if that folding is already happening in the LLVM backend and it’s not really needed at MLIR level, let’s keep things simple. Introducing intrinsics too early also has significant drawbacks as LLVM middle-end will treat them as black boxes most of the time, which would prevent some optimizations to happen.

This one looks interesting if you have an end-to-end story for it

Vector bitwise select

Is this something that we could model using a plain i1 vector select?

jwkim98 · November 21, 2023, 12:02pm

I was designing my own language (called ‘Opus’) specialized for writing HPC algorithms using MLIR, and wrote ML layers with it (such as convolution). While I was compiling Opus implementation of operations using quantized data types, I found efficient implementation in C++ would contain intrinsics I stated above, and was thinking of ways to generate those intrinsics from Opus.

Then I thought using strong pattern-matching capability of MLIR, maybe I could generate those intrinsics automatically(by lowering pass perhaps) without having to write them explicitly in code.

If putting those features inside MLIR makes it too complicated, maybe we can add those pattern-matching at LLVM level if possible. But we would have to be aware of some trade-offs, where transformation to some specific intrinsics might not be optimal for some cases.

I understand your concerns, and I see your point. For now, I will rely on lowering intrinsics directly to llvm intrinsics, and think of better ways overcome this issue. I feel ‘Getting most out of hardware intrinsics without revealing it to programmer’ is quite interesting issue.

stellaraccident · November 21, 2023, 12:36pm

There is good advice up thread for how to do arch specific codegen pipeline construction.

For completeness, there are other options being pursued as well. For example, on x86 and arm, IREE uses a library of high performance “microkernels” written in C for a variety of common cases that turn up and have a high degree of micro architecture specific variation: https://github.com/openxla/iree/tree/main/runtime/src/iree/builtins/ukernel (see the arch directory for the non generic versions).

Note that besides having “kernel” in the name, these are treated more like compiler intrinsics. They are statically available to the compiler as standalone bitcode and when used, the compiler inlines them. Normal compiler optimizations take care of the rest of specializing them to eliminate statically known special branches.

Basically, we often use C as form of codegen pipeline generator vs open coding very specific microarch specific intrinsic variants directly in mlir pipelines.

Topic		Replies	Views
[RFC] Vector Dialects: Neon and SVE MLIR	15	3158	December 8, 2020
Limitations of the vector dialects MLIR	5	679	June 22, 2021
[Abandoned][RFC] AVX512-specific Dialect for implementing and benchmarking XNNPack in MLIR MLIR	22	1668	March 4, 2020
[RFC] Starting an AVX512 Target-Specific Dialect - Rebooted MLIR	11	1675	March 17, 2020
Some questions about mlir lower MLIR	5	210	February 3, 2023

Support for more hardware intrinsics on ARMNeon dialect

Related Topics