Bug with vectorization of transcendental functions

Hello all,

I am trying to optimize an inner loop in C++ in which nearly 100% of my runtime occurs. In this loop I have to do some exponential calls, and according to the docs at http://llvm.org/docs/Vectorizers.html#vectorization-of-function-calls , it seems that the exp function should be able to be vectorized using SIMD instructions. I’m using the tippy SVN clang.

The example here shows that the floor function is getting vectorized, but the exp function isn’t. Is this is a bug or known behavior? I see the same problem if I use floats for what its worth.

By the way, there are some problems with the auto-vectorization docs, vectorization is only enabled at -O3, it took me an hour to try to figure out why my -Rpass=loop-vectorize wasn’t giving ANY output.

My code (loop_test2.cpp):

#include
#include

typedef double numtype;

void foo(numtype *f) {
for (int i = 0; i != 1024; ++i)
f[i] = floor(f[i]);
}

void foo2(numtype *f) {
for (int i = 0; i != 1024; ++i)
f[i] = exp(f[i]);
}

int main()
{
std::vector f(1024, 1.3);
foo(&f[0]);
return 0;
}

Compilation:
…/…/…/build/Debug+Asserts/bin/clang++ -c -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -O3 -gline-tables-only -gcolumn-info loop_test2.cpp

which yields

loop_test2.cpp:7:8: remark: unrolled with interleaving factor 2 (vectorization
not beneficial) [-Rpass=loop-vectorize]
for (int i = 0; i != 1024; ++i)
^
loop_test2.cpp:13:12: remark: loop not vectorized: call instruction cannot be
vectorized [-Rpass-analysis=loop-vectorize]
f[i] = exp(f[i]);
^
loop_test2.cpp:18:3: remark: vectorized loop (vectorization factor: 2, unrolling
interleave factor: 2) [-Rpass=loop-vectorize]
std::vector f(1024, 1.3);
^
loop_test2.cpp:8:12: remark: unrolled with interleaving factor 2 (vectorization
not beneficial) [-Rpass=loop-vectorize]
f[i] = floor(f[i]);
^

Thanks,

Ian

Hello all,

I am trying to optimize an inner loop in C++ in which nearly 100% of my
runtime occurs. In this loop I have to do some exponential calls, and
according to the docs at
http://llvm.org/docs/Vectorizers.html#vectorization-of-function-calls ,
it seems that the exp function should be able to be vectorized using SIMD
instructions. I'm using the tippy SVN clang.

To my knowledge, Clang does not vectorize arbitrary math functions; it only
vectorizes those functions for which a SIMD machine instruction exists,
such as floor.

There are libraries that provide vectorized exp() functions; you could try
using one of these. However, these libraries require that you rewrite your
code to use vectors instead of scalars. I maintain such a library
"vecmathlib" <https://bitbucket.org/eschnett/vecmathlib/wiki/Home> that is
used in pocl <http://pocl.sourceforge.net/>, an OpenCL implementation based
on LLVM.

-erik

The example here shows that the floor function is getting vectorized, but

Perhaps the docs should be updated to remove exp? Are there architectures with a SIMD machine instruction for exp()? Or a link to which architectures support which SIMD instructions would also be swell. It’s hard to know a priori what will or will not get vectorized. Though clang does at least tell you when it won’t be able to vectorize.

That’s a nice idea with vecmathlib, but for my application I need something that is portable, compiles in gcc/clang/msvc for a wide range of compilers and doesn’t require too much hacking. That said, I did look at it before, but the docs are quite sparse. The example shows what not to do, would be great if you could come up with an example of best practices. If you could, I well might use your library. What’s the expected speedup for AVX for instance? How about this as an example: Take a std::vector x, calculate exp(x). I think that would be a seductive application for not just me.

Something like gcc's -mveclibabi could avoid rewriting code with explicit vectors.

I believe some extensions like cilk were also supposed to offer something in that direction: annotate a function to generate a vector version (in addition to the scalar one) or advertise that a vector version is available.

Hi Ian,

That's a nice idea with vecmathlib, but for my application I need
something that is portable, compiles in gcc/clang/msvc for a wide range
of compilers and doesn't require too much hacking.

Are you aware of Scout? scout.zih.tu-dresden.de
It is source-to-source thus widely compatible (and configurable for SSE, AVX aso.). However currently it vectorizes C code only.
Regarding exp() Intel provides _mm_exp_ps and the like. Back then (2-3 years ago) using that intrinsic slowed down programs in my experiments.

Best regards
Olaf

You just cannot give a general answer to that question. Not even an estimation. The only thing you can state for sure is that vectorization shifts the performance from being compute-bound toward being memory-bound. So if your algorithm is already memory-bound, you will not see any effect at all (in rare circumstances it might even decrease the performance). In other cases (e.g. reductions without any memory footprint) you might get practical speedups above the theoretical maximum (e.g. more than 8x for float on AVX).
It just depends on too many factors to state any expected speedups in a best practice guide of a general-purpose library.

Best Olaf

From: "Olaf Krzikalla" <olaf.krzikalla@tu-dresden.de>
To: cfe-dev@cs.uiuc.edu
Sent: Thursday, August 21, 2014 8:29:56 AM
Subject: Re: [cfe-dev] Bug with vectorization of transcendental functions

Hi Ian,

> That's a nice idea with vecmathlib, but for my application I need
> something that is portable, compiles in gcc/clang/msvc for a wide
> range
> of compilers and doesn't require too much hacking.
Are you aware of Scout? scout.zih.tu-dresden.de
It is source-to-source thus widely compatible (and configurable for
SSE,
AVX aso.). However currently it vectorizes C code only.
Regarding exp() Intel provides _mm_exp_ps and the like. Back then
(2-3
years ago) using that intrinsic slowed down programs in my
experiments.

FWIW, I've made use of the SLEEF library (http://shibatch.sourceforge.net/) for this (for automated vectorizer targeting). If there is interest in having some kind of vector math function targeting in the upstream LLVM/Clang, then we'll need to think seriously about exactly what to target and how to distribute any required library.

-Hal