Generate scalar SSE instructions instead of packed instructions

Hi,

I am interested in evaluating the performance of packed vs scalar double-precision floating point instructions on x86-atom and I was wondering if anyone knows more precisely where to modify llvm to use one or the other. I know I probably need to change something in the type legalizer. Could anyone provide more details than that?

Thanks,

Tyler

Hi,

I am interested in evaluating the performance of packed vs scalar double-precision floating point instructions on x86-atom and I was wondering if anyone knows more precisely where to modify llvm to use one or the other. I know I probably need to change something in the type legalizer. Could anyone provide more details than that?

Thanks,

Tyler

Hey Tyler,

Nadav is correct. Un-vectorizing would best be done before the IR level.

If one split the vectors at the ISel level, one would incur unnecessary extracts, which would skew the timing data.

To digress a bit, I’ve found that it’s necessary to rewrite the scalar SSE patterns to accept true scalar operands; not fake vector operands like the GNU built-ins. This topic was discussed a while back and the popular belief is that partial register updates would cause a performance hit when operating on true scalars. However, my empirical evidence suggests that the extra memory traffic of stuffing vectors is more of a performance hit than the partial register updates. Unfortunately, this is counter-intuitive to the documentation available. And, this may only be true for the benchmarks that hold my interest.

For completeness, I’m mainly interested in Interlagos and Sandybridge, so this conjecture may not hold for other processors such as Atom.

Hope this helps,
Cameron

Thanks for the reply, they were very helpful.

Is it enough to prevent BBVectorize from packing together double precision instructions? If a non-clang frontend is used, such as ISPC, is it possible that the IR may contain packed double instruction?

Tyler

Thanks for the reply, they were very helpful.

Is it enough to prevent BBVectorize from packing together double precision instructions? If a non-clang frontend is used, such as ISPC, is it possible that the IR may contain packed double instruction?

Yes, it could be possible that the IR includes packed SSE instructions.

I am not familiar with the ISPC frontend or Atom. But, in the general case, a frontend could be using the SSE intrinsics, which can make use of packed operands. For example:

def int_x86_sse_min_ps : GCCBuiltin<"__builtin_ia32_minps">,
Intrinsic<[llvm_v4f32_ty], [llvm_v4f32_ty,
llvm_v4f32_ty], [IntrNoMem]>;

The compiler I work on has a proprietary vectorizer that runs before the LLVM IR level. So, in our case, we have an extended set of proprietary packed intrinsics similar to the GNU SSE built-ins.

-Cameron