Implementing the ARM NEON Intrinsics for PowerPC

Hello LLVM Devs,

Thanks for helping me previously to cross-compile for ARM, I managed to get a working toolchain and am currently having fun compiling different toy problems and running them on a pandaboard.

As part of my research I am trying to implement the ARM NEON Intrinsics in the PowerPC LLVM backend. I am still at the beginning of my efforts and am not yet familiar with either the ARM or the PowerPC backends. After I started investigating the code and found out that in total it is more than 100 kloc for the two backends I thought it is a good idea to ask you for some hints of where I should start from.

I have written a small unrelated experimental backend for LLVM before, so I have some experience with the topic.

Thanks,

  • Stan

Stan,

Do you mean that you want to emulate the ARM NEON intrinsics on PowerPC?

-Hal

Hello Hal,

I am not very familiar with the DSP capabilities of PowerPC, but I imagine there will be instructions for simple vector operations like vector addition, multiplication, etc. so for these I imagine the implementation would consist of just outputting the correct instruction. However, for NEON instructions like the reciprocal step (see http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHDIACI.html) it is unlikely that there is a corresponding PowerPC vector instruction, so these will need to be emulated, yes.

  • Stan

How does this make any sense? NEON intrinsics are there to support code
generation targeting the ARM NEON SIMD unit on the ARM architecture.
Power/PowerPC as it's own AltiVec/VSX SIMD units, which in turn has it's
own intrinsics.

If you want write code that explicitly targets CPU execution units it's
necessarily tied to that specific CPU architecture. If you just want to
test code for written for a different CPU on a development box your best
bet is to use a VM like QEMU with CPU emulation.

If you want to write code that will take advantage of whatever SIMD
hardware is available you might want to try abstracting your
implementation and use one of the many libraries which provide a higher
level API to SIMD optimized functionality.

(Note: these are personal opinions rather than anything from my employer.)

Although unusual, there might be circumstances in which it would make sense.

If you want write code that explicitly targets CPU execution units it's
necessarily tied to that specific CPU architecture. If you just want to
test code for written for a different CPU on a development box your best
bet is to use a VM like QEMU with CPU emulation.

It's possible to have either already written code to analyse, or be
intending
to write code that will eventually
be deployed on a particular mobile architecture but wish to develop that on
a desktop
machine. Using an architectural simulation will potentially incur more of a
cost than implementing as much optimization of the emulation via compiler
transformation at compile time. (Whether this is actually enough all the
work of writing
an LLVM backend is another question of course.)

Cheers,
Dave

For example, Eigen library [1] supports both AltiVec and NEON.

[1] http://eigen.tuxfamily.org

Here is an example implementation of reciprocal square root with AltiVec intinsics:

http://web.archive.org/web/20090810124308/http://developer.apple.com/hardwaredrivers/ve/algorithms.html

I have to agree with you that this doesn't make much sense, but there is a
case where you would want something like that: when the original source
uses NEON intrinsics, and there is no alternative in AltiVec, AVX or even
plain C.

We encourage people to use NEON intrinsics, as opposed to writing inline
NEON assembly, when the compiler cannot vectorize your code properly. This
may fix the current problem of under-performing forward-incompatible inline
asm, and it does solve the portability issue across ARM sub-architectures
(ex. v7 vs v8), but it doesn't help on portability across entirely
different architectures. Since it's not easy to vectorize every code, and
not desired to have special cases hard-coded in the vectorizer, I don't see
another solution to this problem.

Before, you'd have assembly files with NEON specific code, another with
AltiVec specific and so on, and now you'd have C files with each
intrinsics, which is better. But, as you said yourself, the semantics of
NEON instructions are not the same as other SIMD ISAs, so if you only have
the NEON file and want to create an AltiVec version, you'll have to
understand both pretty well.

Stanislav,

If I got it right above, I think it would be better if you could do that
transformation in IR, with a mapping infrastructure between each SIMD ISA.
Something that could represent every possible SIMD instruction, and how
each target represents them, so in one side you read the intrinsics (and
possibly IR operations on vectors), translate to this meta-SIMD language,
then export on the SIMD language that you want.

A tool like this, possibly exporting back to C code (so you can add it to
your project as an one-off pass), would be valuable to all programs that
have legacy hard-coded SSE routines to run on any platform that support
SIMD operations.

I have no idea how easy would be to do that, let alone if it's at all
possible, but it seems that this is what you want. Correct me if I'm wrong.

cheers,
--renato

Or to compile existing code using NEON intrinsics and run it on PowerPC device
without changes.

How does this make any sense?

I have to agree with you that this doesn't make much sense, but there is a
case where you would want something like that: when the original source
uses NEON intrinsics, and there is no alternative in AltiVec, AVX or even
plain C.

This is exactly the case that I am in. I want to make DSP code written in
C, but with NEON intrinsics "portable" as it is less feasible to rewrite it.

Stanislav,

If I got it right above, I think it would be better if you could do that
transformation in IR, with a mapping infrastructure between each SIMD ISA.
Something that could represent every possible SIMD instruction, and how
each target represents them, so in one side you read the intrinsics (and
possibly IR operations on vectors), translate to this meta-SIMD language,
then export on the SIMD language that you want.

A tool like this, possibly exporting back to C code (so you can add it to
your project as an one-off pass), would be valuable to all programs that
have legacy hard-coded SSE routines to run on any platform that support
SIMD operations.

I have no idea how easy would be to do that, let alone if it's at all
possible, but it seems that this is what you want. Correct me if I'm wrong.

Again, the tool you describe is exactly what I ultimately want to create.
The translation to AltiVec would be a step towards understanding how to
manipulate the intrinsics, but it is not a goal on its own.

Do you have any ideas where in the whole LLVM structure would it fit
(should it be implemented as a separate optional pass)?

Thanks,
- Stan

I think there are two separate things:

1. A conversion tool, that will read specific SIMD-1 C files and produce
SIMD-2 C files. This will need the C back-end to be working well, or
implement its own SIMD-specific C backend, which is in itself, quite a big
task. This tool would have to use a function pass that would scan for
SIMD-1 intrinsics, and convert them to SIMD-2 in the IR level, so your tool
would read the SIMD-1 file as if it were targeting arch-2, and the pass
would convert automatically, using the function pass below.

2. A function pass, to do the conversion between SIMD-1 intrinsics to
SIMD-2, based on their original namespace inside LLVM (AVX, NEON, etc) and
the target parameter (for SIMD-2 output). This FP should be off by default,
of course, but could be turned on (say -convert-simd-intrinsics) when
compiling legacy code.

I'd start with just cataloguing all NEON and AltiVec intrinsics, and trying
to map them. You'll probably hit cases where NEON A == AltiVec A + op1 +
op2, so you'll have to take head and tail operations around the intrinsics
as possible part of an interchangeable SIMD operation.

As a first example, you could write a function pass to get only the ones
that map nicely 1-to-1 and see if the concept works, and if people are
happy with your changes. It should be able to read a (very simple) NEON C
file and produce compatible PowerPC AltiVec assembly code. After the
infrastructure is in place, you can continue incrementing it by adding
support for more intrinsics, more SIMD ISAs, and more complex patterns
(involving surrounding instructions, etc). In parallel, you could try to
create the tool that would do the source-to-source transformation, using
the pass that you have written.

Of course, adding tests for all known supported conversions to/from would
be critical to the success of your project.

cheers,
--renato

I'm sure Sean (CC'd) would agree, that adding some documentation would be
equally valuable. :wink:

--renato

How does this make any sense?

I have to agree with you that this doesn't make much sense, but there
is a case where you would want something like that: when the
original source uses NEON intrinsics, and there is no alternative in
AltiVec, AVX or even plain C.

This is exactly the case that I am in. I want to make DSP code
written in C, but with NEON intrinsics "portable" as it is less
feasible to rewrite it.

Are you using Clang as the frontend? If so, my recommendation would be to start by creating a header file that implements the NEON intrinsics in terms of generic functionality and the Altivec ones. The header file would need to look kind of like this:

#if defined(__powerpc__) || defined(__ppc__)

#define neon_intrinsic1 ppc_neon_intrinsic1
static __inline__ vec_type __attribute__((__always_inline__, __nodebug__))
ppc_neon_intrinsic1(vec_type a1, vec_type a2) {
  ...
}

...

#endif

If you look in tools/clang/lib/Headers you'll see lots of example intrinsics header files, and if you look in your build directory in tools/clang/lib/Headers you'll find the arm_neon.h.inc file.

You can certainly do this in terms of an LLVM transformation, but I think that creating some kind of header file would be, at least, where I'd start prototyping this.

Also, you'll want to make sure that the endianness of the ARM and PPC environments agree (or that the code is endian-neutral), otherwise you'll likely have bigger problems :wink:

-Hal

Yes, this is a good approach to understanding the problem. But I wouldn't
use this as a final solution, as it scales quadratically with the number of
supported SIMD architectures, including all variations (like NEON v7, v8
and CPU dependent choices).

cheers,
--renato

Thank you all for the help.

Here is my plan of action:

  1. Read up on NEON and AltiVec

  2. Write ((small) parts of) arm_neon.h using AltiVec intrinsics

  3. Write a function pass to convert simple (vector arithmetic) NEON C code to PowerPC AltiVec assembly code and submit for review.

  4. Add NEON intrinsics that map to multiple AltiVec instructions

  5. Add patterns involving surrounding instructions in order to support single complex AltiVec instructions

  6. (not necessarily after 4 and 5, but maybe during): Try producing C code with AltiVec intrinsics as output, when given C code with NEON intrinsics.
    Things to be aware of:

  7. Endian-ness

  8. Importance of tests and documentation
    I will update you once I have some progress.

Cheers,

  • Stan

As crazy as this is, the reverse (AltiVec intrinsics on ARM hardware) was working in tree for a while for the common functions.

Another approach would be to develop a libTooling tool that helps rewrite processor-specific SIMD code to use some generic SIMD library (a C++1y one?) and provide ports of that library.

Alex