SIMD Projects with LLVM

Hi everyone. After lurking for a while, this is my
first post to the list.

I am working with some graduate students on the general
topic of compiler support for SIMD programming and specific
projects related to LLVM and my own Parabix technology
(parabix.costar.sfu.ca).

Right now we have a few course projects on the go and
already a question arising out of one of them (SSE2 Hoisting).
We're not sure how much has been tried before, or even
makes sense, but we're eager to learn.

Briefly the projects are:

SSE2 Hoisting: translating programs that directly use SSE2
intrinsics into platform-independent code expressed with LLVM IR.

Long integer support: systematic support for i128, i256, ... targetting
SIMD registers.

Systematic strategies for the shufflevector operation. This
is a very powerful operation that can be used to code for arbitrary
rearrangement of data in SIMD registers. No architecture we
know of supports it in its full generaility. But there are
many special cases that are recognized in code generation and
potentially many more that might be.

Systematic support for all power-of-2 field widths with
vector types. For example, we are interested in <64 x i2> being
a legal type with appropriate expansion operations. A student
has made a GSoC submission for this project.

The question I have right now actually relates to the i2 type.
In our SSE2 hoisting, we found an issue with the movemask_pd
operation, which extracts the sign bits of the 2 doubles in
a <2 x double> and returns them as an int32. We would
like to use the icmp slt as the LLVM IR operation for this,
but have a problem when we bitcast the <2 x i1> vector to i2,
it seems. We use the following LLVM IR code.

define i32 @signmaskd(<2 x double> %a) alwaysinline #5
{
        %bits = bitcast <2 x double> %a to <2 x i64>
        %b = icmp slt <2 x i64> %bits, zeroinitializer
        %c = bitcast <2 x i1> %b to i2
        %result = zext i2 %c to i32
        ret i32 %result
}

Unfortunately, we only get 1 bit of data out; the assembly language
output seems to confirm that the individual bit extractions take
place, but the second one clobbers the first. We are using the 3.4
tool chain.

There is more detail at the following URL.
http://parabix.costar.sfu.ca/wiki/I2Result

Anyway the question is whether we should just try to treat
this as a bug to be fixed or whether our idea of working with
i2 types is misguided in a more fundamental way.

Hi Rob,

This is a codegen bug. At the moment we don’t support bitcasting or storing/loading memory of 'illegal' vector element types that are smaller than i8.

Thanks,
Nadav

Rob,

If you care about SIMD performance, I advice you generating LLVM IR, which has obvious mapping to particular instruction set. I.e. what is i2, what is vector of i2 and how this should be mapped to instruction set - all this is a mystery for me. You should extend i2 to i8/i16/i32 depending on the desired code generation. In ISPC project we map vector of booleans to vector of i32 typically, when targeting SSE2/4 and AVX1/2 instruction set, for example.

Also representing SIMD registers as i128/i256/i512 is bad idea IMO, as you are not able to perform arithmetic operation on it. Vector representation is much better idea and you can freely cast between vectors, when you need to switch from one element type to another (from <i32 x 4> to <f32 x 4>, for example).

-Dmitry.

Hi, Nadav.

Thanks for this. I am now wondering about the
treatment of i1 and <N x i1>. These are both
fundamental types for LLVM primitives like icmp.

My understanding right now is that i1 is probably
dealt with by "promoting" i1 to i8 in the type
legalizer phase. I haven't located where in the codebase
this happens, yet, though.

Another option that occurs to me is to assert that
the types i1 and <N x i1> are legal, and require all
the i1 operations to be implemented. Certainly, this
would be straightforward for all the binary arithmetic
and bitwise logic operations. But perhaps it is
load/store where it becomes problematic. I am
wondering if there exists any discussion of this design
possibility that I might read.

For i2 and <N x i2> it should probably be treated
similarly (but not identically) to i1. I am guessing that
the codegen bug you mention is related to the promotion
of i2 to i8 in type legalization; I can certainly imagine
that this code has not been so thoroughly evaluated.