[Discussion] Portable SIMD programming using LLVM?

Hello all,

I am currently working on building a portable SIMD programming model(SIMD within a register) for various platforms, like x86, ARM and so on. I know that LLVM supports SIMD programming on many architectures, so I am thinking of using LLVM to help build this model but I not quite sure whether LLVM can help.

The following describes the details of the model and also my current approach for this model.

Background
Almost modern processor families support SIMD instruction sets but the instruction set designs for each platform have different combinations of operations. The portable SIMD here is to make an uniform system of SIMD operations at all power-of-2 field widths.
For example, for simd_add on SSE2, I want to have all the following operations supported, simd<2^x>::add(a, b) for 2<=2^x<=register_size(register_size equals 128 in SSE2). However, SSE2 only supports simd_add on field widths of 8, 16, 32, or 64. Hence, simulation for simd_add on field widths of 2, 4 and 128 is needed. At mean time, I want the simulation to be as good as possible.

An iteration Approach
My current approach of getting the best simulation is a strategy-based iteration algorithm. For example, simd<4>::add can be implemented based on simd<8>::add in the following manner.

===strategy_add_doubling===

  1. hiMask = simd<8>::himask() //this is a constant with high 4 bits of each 8-bits field are 1111 and low 4 bits of each 8-bits field are 0000.
  2. loMask = simd<8>::lowmask() //this is a constant with high 4 bits of each 8-bits field are 0000 and low 4 bits of each 8-bits field are 1111.
  3. c = simd<8>::add(simd_and(loMask, a), simd_and(loMask, b))
  4. d = simd<8>::add(simd_and(hiMask, a), simd_and(hiMask, b))
  5. return simd_or(d, simd_and(loMask, c)) // get the answer for simd<4>::add(a, b)

Also, it is not hard to see that simd::add can be implemented based on simd<fw/2>::add. The key is getting the correct carry information and constructing that carry mask.
===strategy_add_halving===

  1. c = simd<fw/2>::add(a, b)
  2. get the carry mask d //the mask should look like …00010000…00000000…00010000…
  3. return simd<fw/2>::add(d, c)

As you can see, for a certain simd operation, there are many ways/strategies to simulate it, like simd::add, it can be simulated based on simd<2*fw>::add and also based on simd<fw/2>::add.

Then, we can construct as many strategies as possible for each simd operation we wanna support. Based on all the strategies we have, an iteration algorithm can be performed to get the best simulation for each simd operation. The criteria here for evaulation is the number of instructions used in a simulation, so the best simulation will have the least number of instructions.

Can I use LLVM to improve the above approach or even get a new approach?

Since LLVM supports various architectures and captures the system resources very well, I am wondering whether LLVM could be used to help the project. I went through the LLVM documentation, but I am still stucking here on how to incorporate LLVM into my approach. Below are some of my thoughts.

  1. The strategy-based approach doesn’t capture the target resources very well in terms of CPU cycles for operations, instruction selection, etc. There might be a case that two different simulations for a certain operation have the same number of instructions, the strategy-based approach can’t handle this case very well(It just randomly picks one).
  2. The strategies are written by human, so they are not carefully tuned to achieve best performance. There are quite a lot of room for improving the performance.
  3. Can LLVM be incorporated into the strategy-based approach so that LLVM provides additional information for getting the best simulation?
  4. Instead of using the strategy-based approach, can we utilize LLVM to get a new approach to achieve the goal?

I know that LLVM has auto-vectorization and can generate SIMD instructions for scalar codes. But I don’t know the details how LLVM captures the target resources and generates good instructions especially the SIMD instructions on different platforms.

Any suggestions or comments are appreciated.

I look forward to hearing from you.

Thanks a lot,
Kevin

Hi Kevin,

*Background*
Almost modern processor families support SIMD instruction sets but the
instruction set designs for each platform have different combinations of
operations. The portable SIMD here is to make an uniform system of SIMD
operations at all power-of-2 field widths.
For example, for simd_add on SSE2, I want to have all the following operations
supported, simd<2^x>::add(a, b) for 2<=2^x<=register_size(register_size equals
128 in SSE2). However, SSE2 only supports simd_add on field widths of 8, 16, 32,
or 64. Hence, simulation for simd_add on field widths of 2, 4 and 128 is needed.
*At mean time, I want the simulation to be as good as possible*.

LLVM already does this: if you use a vector type that is not natively supported
then operations on it are simulated using either vector operations in another
vector type or scalar operations if no suitable vector type exists.

Ciao, Duncan.

Right. LLVM supports the transformation from unsupported vector type to supported vector type or even scalar types.
According to the LLVM’s documentation, LLVM has two ways of doing the transformation, one is promoting the small vector type to larger vector type, the other is breaking up the large vector type into smaller ones.
But I am wondering, how good the transformation will be in LLVM’s implementation.

In my approach, I use strategies to capture all possible transformation. The transformation are not only just promoting and breaking up, but also other forms of transformation, like the transformation between vector types with same size.
My goal is to make sure the simulation to be as good as possible in terms of the number of instructions, CPU latency, etc. If LLVM has information about the latency information of supported architectures, maybe I can use them to optimize my approach.

Does LLVM have the latency information(such as CPU cycles) of SIMD instructions for each supported architecture?

Thanks.

Hi Kevin,

Right. LLVM supports the transformation from unsupported vector type to
supported vector type or even scalar types.
According to the LLVM's documentation, LLVM has two ways of doing the
transformation, one is promoting the small vector type to larger vector type,
the other is breaking up the large vector type into smaller ones.
But I am wondering, how good the transformation will be in LLVM's implementation.

Nadav is looking into implementing a wider range of transforms.

In my approach, I use strategies to capture all possible transformation. The
transformation are not only just promoting and breaking up, but also other forms
of transformation, like the transformation between vector types with same size.
My goal is to make sure the simulation to be as good as possible in terms of the
number of instructions, CPU latency, etc. If LLVM has information about the
latency information of supported architectures, maybe I can use them to optimize
my approach.

Does LLVM have the latency information(such as CPU cycles) of SIMD instructions
for each supported architecture?

I don't know.

Ciao, Duncan.