Support for more hardware intrinsics on ARMNeon dialect

I’m currently working on building a compiler that targets ARM devices, utilizing ARM Neon intrinsics.

I found out that ARM Neon has many specialized intrinsics (for fixed-point arithmetics especially). Despite these intrinsics can be implemented using combination of other operations, I need to use them to produce best performance. But I don’t find way that MLIR would produce those intrinsics without user customization (maybe adding new dialect and providing conversion rules directly to LLVM IR).

For example,

  1. VQDMULHQ
    Vector Saturating Doubling Multiply (which is mixture of different operations)

  2. VBSLQ
    Vector bitwise select

Using most vector intrinsics commonly used should be achieved by ‘vector’ dialect. However, there are some intrinsics that are hardware specific. Therefore, I see there exists some hardware-specialized dialects such as ‘ARMNeon’. However, they don’t seem to cover lots of intrinsics that user may want to utilize.

If these instructions can be generated by folding some patterns written in vector dialect, it would be nice if there was some documentation about how vector operations are folded to produce these specific hardware intrinsics.

I haven’t looked close into other hardware-specific dialects such as ‘X86Vector’ or ‘SME’, but if situation is similar, It would be nice to extend them.

Is there plans to widen support for these kinds of intrinsics? If not, do you think widening intrinsic support in hardware-specific dialect or implementing pattern rewriter that would convert to these intrinsics can add more value to mlir?

Yes it would be great to extend these target-specific dialects. At the moment you can consider them as placeholders.

Note that there is a fine line between hardware specific vector dialect ops and just using llvm.call_intrinsic. We should avoid duplicating ops just for the sake of: if higher level semantic information should be represented then a new hw-specific op makes sense; otherwise llvm.call_intrinsic is preferable.

I am not aware of anyone actively working on/extending the ArmNEON dialect, but we are indeed doing something similar to what you suggested for scalable vectors, i.e. ArmSME and ArmSVE dialects.

I think that we need to be careful and to avoid feature creep - there’s little point in adding new intrinsics to these hardware dialects if there’s no mechanism (i.e. lowering path) to generate them from higher level abstractions.

Also, there are a lot of widening instructions which would require support for operations with mixed types, but I think that we are still unclear what would that entail exactly. See the recent discussion here:

Just raising some points for considerations.

-Andrzej

+1 to what Nicolas said. We have been following that approach for all the target-specific dialects. The introduction of new target-specific ops may also be justified if you plan to do some heavy transformation on those ops within the dialect. SME was a case that started as “let’s try to reuse LLVM intrinsics” and then we realized that we needed a better abstraction than LLVM to implement some transformations at that level.

+1 as well. I would also add that if that folding is already happening in the LLVM backend and it’s not really needed at MLIR level, let’s keep things simple. Introducing intrinsics too early also has significant drawbacks as LLVM middle-end will treat them as black boxes most of the time, which would prevent some optimizations to happen.

This one looks interesting if you have an end-to-end story for it :slight_smile:

Vector bitwise select

Is this something that we could model using a plain i1 vector select?

I was designing my own language (called ‘Opus’) specialized for writing HPC algorithms using MLIR, and wrote ML layers with it (such as convolution). While I was compiling Opus implementation of operations using quantized data types, I found efficient implementation in C++ would contain intrinsics I stated above, and was thinking of ways to generate those intrinsics from Opus.

Then I thought using strong pattern-matching capability of MLIR, maybe I could generate those intrinsics automatically(by lowering pass perhaps) without having to write them explicitly in code.

If putting those features inside MLIR makes it too complicated, maybe we can add those pattern-matching at LLVM level if possible. But we would have to be aware of some trade-offs, where transformation to some specific intrinsics might not be optimal for some cases.

I understand your concerns, and I see your point. For now, I will rely on lowering intrinsics directly to llvm intrinsics, and think of better ways overcome this issue. I feel ‘Getting most out of hardware intrinsics without revealing it to programmer’ is quite interesting issue.

There is good advice up thread for how to do arch specific codegen pipeline construction.

For completeness, there are other options being pursued as well. For example, on x86 and arm, IREE uses a library of high performance “microkernels” written in C for a variety of common cases that turn up and have a high degree of micro architecture specific variation: https://github.com/openxla/iree/tree/main/runtime/src/iree/builtins/ukernel (see the arch directory for the non generic versions).

Note that besides having “kernel” in the name, these are treated more like compiler intrinsics. They are statically available to the compiler as standalone bitcode and when used, the compiler inlines them. Normal compiler optimizations take care of the rest of specializing them to eliminate statically known special branches.

Basically, we often use C as form of codegen pipeline generator vs open coding very specific microarch specific intrinsic variants directly in mlir pipelines.