SIMD trigonometry/logarithms?

Hi everyone,

I was looking at loop vectorizer code and wondered if there was any current or planned effort to introduce SIMD implementations of sin/cos/exp/log intrinsics (in particular for x86-64 backend)?

Cheers,

Dimitri.

From: "Dimitri Tcaciuc" <dtcaciuc@gmail.com>
To: llvmdev@cs.uiuc.edu
Sent: Sunday, January 27, 2013 3:42:42 AM
Subject: [LLVMdev] SIMD trigonometry/logarithms?

Hi everyone,

I was looking at loop vectorizer code and wondered if there was any
current or planned effort to introduce SIMD implementations of
sin/cos/exp/log intrinsics (in particular for x86-64 backend)?

Ralf Karrenberg had implemented some of these as part of his whole-function vectorization project:
https://github.com/karrenberg/wfv/blob/master/src/utils/nativeSSEMathFunctions.hpp
https://github.com/karrenberg/wfv/blob/master/src/utils/nativeAVXMathFunctions.hpp

Opinions on pulling these into the X86 backend?

-Hal

I’m wondering if it makes sense to instead supply a bc math library. I would think it would be easier to maintain and debug, and should still give you all of the benefits. You could just link with it early in the optimization pipeline to ensure inlining. This may also make it easier to maintain SIMD functions for multiple backends.

Hi Justin,

I think having .bc math libraries for different backends makes perfect sense! For example, in case of NVPTX backend we have the following problem: many math functions that are only available as CUDA C++ headers could not be easily used in, for instance, GPU program written in Fortran. On our end we are currently doing exactly what you proposed: generating math.bc module and then link it at IR-level with the target application. There is no need for SIMD, but having .bc math library would still be very important!

  • D.

2013/1/27 Justin Holewinski <justin.holewinski@gmail.com>

From: "Dmitry Mikushin" <dmitry@kernelgen.org>
To: "Justin Holewinski" <justin.holewinski@gmail.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Sunday, January 27, 2013 10:19:42 AM
Subject: Re: [LLVMdev] SIMD trigonometry/logarithms?

Hi Justin,

I think having .bc math libraries for different backends makes
perfect sense! For example, in case of NVPTX backend we have the
following problem: many math functions that are only available as
CUDA C++ headers could not be easily used in, for instance, GPU
program written in Fortran. On our end we are currently doing
exactly what you proposed: generating math.bc module and then link
it at IR-level with the target application. There is no need for
SIMD, but having .bc math library would still be very important!

I agree. I think that, essentially, all we need is some infrastructure for finding standard bc/ll include files (much like clang can add its own internal include directory).

-Hal

First let me say that I really like the notion of being able to plug in .bc libraries into the compiler and I think that there are many potential uses (i.e. vector saturation operations and the like). But even so it is important to realize the limitations of this approach.

Generally implementations of transcendental functions require platform specific optimizations to get the best performance and accuracy. Additionally usually if you implement such operations with SIMD the use cases do do not require as much accuracy as a general math library routine implying that if you just perform a blind vectorization of a math library function you will be giving up a lot of potential performance. If speed/accuracy is not an issue to you and you just want *SOMETHING* I suppose it could work ok.

Something else to consider as well is the possibility of creating an interface for plugging in system SIMD libraries. Then one could use structural analysis or the like to recognize (lets say) an FFT or a Matrix Multiplication and just patch in the relevant routine. On OS X you would use Accelerate, on linux you could use MKL or the like/etc.

Just some thoughts.

Michael

From: "Michael Gottesman" <mgottesman@apple.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Dmitry Mikushin" <dmitry@kernelgen.org>, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Sunday, January 27, 2013 9:23:51 PM
Subject: Re: [LLVMdev] SIMD trigonometry/logarithms?

First let me say that I really like the notion of being able to plug
in .bc libraries into the compiler and I think that there are many
potential uses (i.e. vector saturation operations and the like). But
even so it is important to realize the limitations of this approach.

Generally implementations of transcendental functions require
platform specific optimizations to get the best performance and
accuracy. Additionally usually if you implement such operations with
SIMD the use cases do do not require as much accuracy as a general
math library routine implying that if you just perform a blind
vectorization of a math library function you will be giving up a lot
of potential performance. If speed/accuracy is not an issue to you
and you just want *SOMETHING* I suppose it could work ok.

I imagined that these bc files would contain a mixture of generic IR and target-specific intrinsics. Completely-generic fallback versions also sound like a useful feature to support, but I'm not sure that bc files are the best way to do that (because we'd need to pre-select the supported vector sizes).

Something else to consider as well is the possibility of creating an
interface for plugging in system SIMD libraries. Then one could use
structural analysis or the like to recognize (lets say) an FFT or a
Matrix Multiplication and just patch in the relevant routine. On OS
X you would use Accelerate, on linux you could use MKL or the
like/etc.

This would also be nice :slight_smile:

-Hal

I've actually been considering doing the same as these for our project
(using the same reference implementation by Julien Pommier). We use math
functions a lot in our code so any performance gain would be useful.

However, accuracy is quite important in our case, so we would want any
implementation to at least match the accuracy of the current
implementation. On X86 (at least in LLVM 2.8 which we currently use),
the sin/cos intrinsics end up using the X87 fpsin instruction. On Win32,
the system provided sin function also ends up using the same assembly
instruction.

Peter N

Hi,

As Peter mentioned, most of the credit of the links below go to Julien Pommier - I basically took his code and ran the cpp-backend over it.

A few more comments:
- In the new implementation of WFV (that is not public yet, but is also under LLVM license), I moved that code to a .ll file, which (as Justin also pointed out) is a lot easier to maintain because the IR format changes a lot slower than LLVM's internal API.
- I also found versions for AVX and NEON based on the ones of Julien, I have the link somewhere if anybody is interested.
- There are quite a few interesting functions missing, most importantly a vector powf (the implementation I have has precision issues)
- I don't think it would be too bad to keep different versions for different vector widths and architectures. After all, these are meant to provide highly optimized code, so sacrificing some portability and maintainability may be okay.

Best,
Ralf

I think that a bc library could be quite flexible. For example it should be possible to create a wrapper in bc that calls libsvml functions.

paul

I'm wondering if it makes sense to instead supply a bc math library. I would think it would be easier to maintain and debug, and should still give you all of the benefits. You could just link with it early in the optimization pipeline to ensure inlining. This may also make it easier to maintain SIMD functions for multiple backends.

From: "Dimitri Tcaciuc" <dtcaciuc@gmail.com<mailto:dtcaciuc@gmail.com>>
To: llvmdev@cs.uiuc.edu<mailto:llvmdev@cs.uiuc.edu>
Sent: Sunday, January 27, 2013 3:42:42 AM
Subject: [LLVMdev] SIMD trigonometry/logarithms?

Hi everyone,

I was looking at loop vectorizer code and wondered if there was any
current or planned effort to introduce SIMD implementations of
sin/cos/exp/log intrinsics (in particular for x86-64 backend)?

Ralf Karrenberg had implemented some of these as part of his whole-function vectorization project:
https://github.com/karrenberg/wfv/blob/master/src/utils/nativeSSEMathFunctions.hpp
https://github.com/karrenberg/wfv/blob/master/src/utils/nativeAVXMathFunctions.hpp

Opinions on pulling these into the X86 backend?

-Hal

Additionally, I think this relates to my suggestion to transform library calls into intrinsics.

Early on you want to canonicalize math library calls into intrinsics. The vectorizers trivially widen the intrinsics and before codegen you replace the intrinsics with calls.

If you insert calls to the bc math library too early you won't benefit from intrinsic aware optimizations.

Basically I think what you need is a module that provides intrinsic implementations for various vector widths.

paul

I'm wondering if it makes sense to instead supply a bc math library. I would think it would be easier to maintain and debug, and should still give you all of the benefits. You could just link with it early in the optimization pipeline to ensure inlining. This may also make it easier to maintain SIMD functions for multiple backends.

From: "Dimitri Tcaciuc" <dtcaciuc@gmail.com<mailto:dtcaciuc@gmail.com>>
To: llvmdev@cs.uiuc.edu<mailto:llvmdev@cs.uiuc.edu>
Sent: Sunday, January 27, 2013 3:42:42 AM
Subject: [LLVMdev] SIMD trigonometry/logarithms?

Hi everyone,

I was looking at loop vectorizer code and wondered if there was any
current or planned effort to introduce SIMD implementations of
sin/cos/exp/log intrinsics (in particular for x86-64 backend)?

Ralf Karrenberg had implemented some of these as part of his whole-function vectorization project:
https://github.com/karrenberg/wfv/blob/master/src/utils/nativeSSEMathFunctions.hpp
https://github.com/karrenberg/wfv/blob/master/src/utils/nativeAVXMathFunctions.hpp

Opinions on pulling these into the X86 backend?

-Hal

Hi all.
In fact, this is how we have implemented it in our compiler (intel's OpenCL).
We have created a .bc file for every architecture. Each file contains all the SIMD versions for the functions to be vectorized.
To cope with the massive amount of code to be produced, we implemented a dedicated tblgen BE for that purpose.
We are willing to share that code with the llvm community, in case this is a mutual interest.
All the best, Elior

From: "Elior Malul" <elior.malul@intel.com>
To: "Michael Gottesman" <mgottesman@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Thursday, February 14, 2013 8:33:42 AM
Subject: RE: [LLVMdev] SIMD trigonometry/logarithms?

Hi all.
In fact, this is how we have implemented it in our compiler (intel's
OpenCL).
We have created a .bc file for every architecture. Each file contains
all the SIMD versions for the functions to be vectorized.
To cope with the massive amount of code to be produced, we
implemented a dedicated tblgen BE for that purpose.

This sounds interesting. What exactly does the TableGen backend do? Is it for generating architecture -> file-name mappings? Does it drive the .bc file generation somehow?

-Hal

FWIW, llvmpipe has C-binding code that generates (inlines) log/exp/pow/sin/cos functions of the fly, for an arbitrary number of elements (in most cases), and uses sse/avx intrinsics as available:

http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/auxiliary/gallivm/lp_bld_arit.c

The precision is 20bits, which is enough for 3D. But it wouldn’t be difficult to have more (or even variable) precision, by using higher degree approximating functions were appropriate.

The code is in a permissible license.

Jose

Hi,

Looks interesting, but the codebase appears to only do SSE*/AVX (no ARM NEON – which obviously makes me sad – , or other architectures). It might be useful if some of the tests which are “optimized” into things like “has_avx, because that implies we’re capable of 256-bit vectorization” were deoptimized to first test in “generic terms” and only tested for a specific instruction set is being used at the point when the string for an actual intrinsic is needed.

Cheers,

Dave

The tblgen BE generates llvm code, which then passed through llvm-as to produce the final .bc code.
As for the architecture-mapping, the answer is yes, the BE generates the filename to match the target arch.

Our design principle was to enable the BE to be arch. Independent on the one hand, and for the generated code to be arch-specific on the other,
so intrinsic could be incorporated into it whenever needed.

All the best, Elior