Target intrinsics and translation

LLVM (via clang) currently translates target intrinsics to generic IR
whenever it can. For example, on x86 it translates _mm_loadu_pd to a
simple load instruction with an alignment of 1. The backend is then
responsible for translating the load back to the corresponding
machine instruction.

The advantage of this is that it opens up such code to LLVM's
optimizers, which can theoretically speed it up.

The disadvantage is that it's pretty surprising when intrinsics
designed for the sole purpose of giving programmers access to specific
machine instructions is translated to something other than those
instructions. LLVM's optimizers aren't perfect, and there are many
aspects of performance which they don't understand, so they can also
pessimize code.

If the user has gone through the trouble of using target-specific
intrinsics to ask for a specific sequence of machine instructions,
is it really appropriate for the compiler to emit different
instructions, using its own heuristics?

Dan

There are several benefits to doing it this way:

1. Fewer intrinsics in the compiler, fewer patterns in the targets, less redundancy.

2. The compiler should know better than the user, because code is often written and forgotten about. The compiler can add value when building hand tuned and highly optimized SSE2 code for an SSE4 chip, for example.

3. If the compiler is pessimizing (e.g.) unaligned loads, then it is a serious bug that should be fixed, not something that should be worked around by adding intrinsics. Adding intrinsics just makes it much less likely that we'd find out about it and then be able to fix it.

4. In practice, if we had intrinsics for everything, I strongly suspect that a lot of generic patterns wouldn't get written. This would pessimize "portable" code using standard IR constructs.

-Chris

Hi Dan,

LLVM (via clang) currently translates target intrinsics to generic IR
whenever it can. For example, on x86 it translates _mm_loadu_pd to a
simple load instruction with an alignment of 1. The backend is then
responsible for translating the load back to the corresponding
machine instruction.

The advantage of this is that it opens up such code to LLVM's
optimizers, which can theoretically speed it up.

The disadvantage is that it's pretty surprising when intrinsics
designed for the sole purpose of giving programmers access to specific
machine instructions is translated to something other than those
instructions.

gcc only supports a limited set of vector expressions. If you want to
shuffle a vector, how do you do that? The only way (AFAIK) is to use
a target intrinsic. Thus people can end up using target intrinsics
because it's the only way they have to express vector operations, not
because they absolutely want to have that particular instruction.

  LLVM's optimizers aren't perfect, and there are many

aspects of performance which they don't understand, so they can also
pessimize code.

Such cases should be improved. They would never be noticed if everyone
was using target intrinsics rather than generic IR.

If the user has gone through the trouble of using target-specific
intrinsics to ask for a specific sequence of machine instructions,
is it really appropriate for the compiler to emit different
instructions, using its own heuristics?

This same question might come up in the future with inline asm.
Thanks to the MC project I guess it may become feasible to parse
peoples inline asm and do optimizations on it. Personally I'm
in favour of that, but indeed there are dangers.

Ciao, Duncan.

> LLVM (via clang) currently translates target intrinsics to generic IR
> whenever it can. For example, on x86 it translates _mm_loadu_pd to a
> simple load instruction with an alignment of 1. The backend is then
> responsible for translating the load back to the corresponding
> machine instruction.
>
> The advantage of this is that it opens up such code to LLVM's
> optimizers, which can theoretically speed it up.
>
> The disadvantage is that it's pretty surprising when intrinsics
> designed for the sole purpose of giving programmers access to specific
> machine instructions is translated to something other than those
> instructions. LLVM's optimizers aren't perfect, and there are many
> aspects of performance which they don't understand, so they can also
> pessimize code.
>
> If the user has gone through the trouble of using target-specific
> intrinsics to ask for a specific sequence of machine instructions,
> is it really appropriate for the compiler to emit different
> instructions, using its own heuristics?

In my personal opinion, this should be controlled via a compiler option.
The default should be to omit the instructions as specified. This should
at least be true at low optimization levels.

There are several benefits to doing it this way:

1. Fewer intrinsics in the compiler, fewer patterns in the targets, less redundancy.

I don't view limiting the number of intrinsics in LLVM as a worthwhile
goal unto itself. The fact that specifying intrinsics is currently a
fairly-verbose procedure (requiring updates in several different files)
is something that we should fix via a more-intelligent tablegen setup.

2. The compiler should know better than the user, because code is often written and forgotten about. The compiler can add value when building hand tuned and highly optimized SSE2 code for an SSE4 chip, for example.

This is a good use case for '-O4' -- This means that I've asked for
something specific and the compiler may do something else instead. I
think that '-O3' (and below) should make what I've specified as fast as
possible. Since specifying '-O3' is a fairly standard default choice, I
think it should provide the safer behavior.

3. If the compiler is pessimizing (e.g.) unaligned loads, then it is a serious bug that should be fixed, not something that should be worked around by adding intrinsics. Adding intrinsics just makes it much less likely that we'd find out about it and then be able to fix it.

4. In practice, if we had intrinsics for everything, I strongly suspect that a lot of generic patterns wouldn't get written. This would pessimize "portable" code using standard IR constructs.

We should work on providing a comprehensive set of generic vector
builtins (similar to __builtin_shuffle) to cover other cases that can be
represented in the IR directly (either as-is, or suitably extended).
This has the benefit of working over many different architectures. And
we could make sure that patterns will be written to support these
generic builtins.

-Hal