Questions on the llvm 'vector' types and resulting SIMD instructions

If I generate IR using 'vector' types, for example, if my code assembles IR like this:

    define <4 x float> @simd_mul(<4 x float>, <4 x float>) {
      %3 = fmul <4 x float> %0, %1
      ret <4 x float> %3
    }

I assume that when I JIT, it will generates the best SIMD instructions available on the host it's running on? For example, when running on a machine supporting SSE, it does seem to generate SSE instructions, and this successfully turns into a function callable from C with a signature that looks like

    __m128 simd_mul (__m128 a, __m128 b);

But the vector documentation is a little sketchy, and I am not sure about a few things:

* Will it really autodetect and use the best SIMD available on my machine? (For example, SSE4.2 vs SSE2, etc.?) Is there anything I need to tell the JIT or the ExecutionEngine to make it use a particular instruction set? (The only case I care about is to generate the best code for the host it's currently running on.)

* Is there any difference in vector functionality of old JIT versus MCJIT? (Yes, I know that starting in 3.6, it'll be only MCJIT.)

* What happens if it runs on a machine without SSE? Is using vectors an error, or will it just generate the equivalent scalar code automatically? If it generates scalar code, what is the function signature, as it would appear to be called from a C function, on a machine without __m128?

* What happens to vector types of length not equal to the machine's SIMD length? If I defined a <3 x float>, would it always generate scalar code, or would it pad to a 4xfloat and generate SSE instructions? Or is it not even allowed?

Thanks, and apologies if I've missed the documentation where all this is spelled out.

Hi Larry,

I'll try to answer a few of your questions, but other folks will know more...

* Will it really autodetect and use the best SIMD available on my machine? (For example, SSE4.2 vs SSE2, etc.?) Is there anything I need to tell the JIT or the ExecutionEngine to make it use a particular instruction set? (The only case I care about is to generate the best code for the host it's currently running on.)

If you don't specify -mfpu/-mcpu, LLVM will try to guess the best it
can. Some archs (x86) are better than others (ARM) at that, but it
should never generate bad code (ie. AVX on an SSE machine). At most,
it'll guess conservatively and maybe generave SSE code in AVX
machines, but not the other way around.

* Is there any difference in vector functionality of old JIT versus MCJIT? (Yes, I know that starting in 3.6, it'll be only MCJIT.)

I don't think so. Both use the same passes and back-ends, so I'd be
surprised if they did.

As obvious as it sounds, I'd heavily encourage you not to use the old
JIT. Not only we deleted it for good, but it was never that good on
all architectures, so you'll be stuck with an ageing, unsupported and
possibly broken JIT technology.

* What happens if it runs on a machine without SSE? Is using vectors an error, or will it just generate the equivalent scalar code automatically? If it generates scalar code, what is the function signature, as it would appear to be called from a C function, on a machine without __m128?
* What happens to vector types of length not equal to the machine's SIMD length? If I defined a <3 x float>, would it always generate scalar code, or would it pad to a 4xfloat and generate SSE instructions? Or is it not even allowed?

The answer to both questions is: it depends.

Obviously, <3 x float> is not a legal type on any machine, so LLVM
tends to either expand it to a larger vector or split into multiple
vectors, etc. There are some IR passes that do all that, including
serialization of vector code, but your mileage may vary on different
back-ends to support everything. Since you're fiddling with IR and
JIT, you should make your choices based on what each one supports.

Back-ends have a late legalization phase, where they scan the DAG
(after IR lowering) and legalize types (ex. i64 into i32+i32 in 32-bit
archs), so depending on the IR you provide the back-end, it may know
how to legalize some types, but not others. Be careful. And, as usual,
if you find any odd behaviour, please report to the list or in
bugzilla.

cheers,
--renato

Hi Larry,

I'll try to answer a few of your questions, but other folks will know more...

* Will it really autodetect and use the best SIMD available on my machine? (For example, SSE4.2 vs SSE2, etc.?) Is there anything I need to tell the JIT or the ExecutionEngine to make it use a particular instruction set? (The only case I care about is to generate the best code for the host it's currently running on.)

If you don't specify -mfpu/-mcpu, LLVM will try to guess the best it
can. Some archs (x86) are better than others (ARM) at that, but it
should never generate bad code (ie. AVX on an SSE machine). At most,
it'll guess conservatively and maybe generave SSE code in AVX
machines, but not the other way around.

LLVM does provide the facilities to do that, but it’s not completely automatic. It’s very easy to do, however. When creating the target machine there’s a parameter to the createTargetMachine() method for the CPU. There’s a utility function, sys::getHostCPUName(), which will return a suitable value for that parameter. There’s been discussion about making that easier to specify as the default behavior for the common case where the compilation host and the execution target are the same machine, but there’s nothing firm yet.

* Is there any difference in vector functionality of old JIT versus MCJIT? (Yes, I know that starting in 3.6, it'll be only MCJIT.)

I don't think so. Both use the same passes and back-ends, so I'd be
surprised if they did.

The old JIT had somewhat spotty support for anything newer than SSS3. The MCJIT should be a strict improvement here.