[PATCH] relational: Implement signbit

v2 Changes:
   - use __builtin_signbit instead of shifting by hand
   - significantly improve vector shuffling
   - Works correctly now for signbit(float16) on radeonsi

Signed-off-by: Aaron Watry <awatry@gmail.com>

FYI: I've introduced the _CLC_DEFINE_RELATIONAL_UNARY* macros here and
immediately undef'd them after they're used. I'm planning on moving
this set of Macros to clcmacro.h and then introducing a 2-arg BINARY
version as well for things like isgreaterequal and friends.

By using the _CLC_DEFINE_RELATIONAL_BINARY* macros just mentioned, I'm
able to cut the isgreaterequal llvm assembly to about a third of its
previous size (e.g. max variable identifier of %41 vs %138 in the
final dump for isgreaterequal(float16, float16). signbit was roughly
the same as well.

Yes, we should probably fix the __builtin_* scalar functions to accept
vector inputs... I don't have the necessary time at the moment to do
that, sadly.

Should I continue going down this road for now? The code as is works
perfectly on radeonsi, it just might be leaving some performance on
the table. Functional now, fast later? Or do we hold off on
implementing these built-ins until clang can handle them optimally
(and wait for distro's to pick up the new version... you can probably
tell which option I prefer).

--Aaron

v2 Changes:
   - use __builtin_signbit instead of shifting by hand
   - significantly improve vector shuffling
   - Works correctly now for signbit(float16) on radeonsi

LGTM.

Committed.

I'm going to send v2 of the other 3 patches of this series after I
whip up the unary/binary relational macro changes to simplify the
implementations.

--Aaron

Hi,

I hadn’t looked at the patch before, but I’m wondering if it does the right thing. Consider the following kernel (I’m compiling to the NVPTX target and llvm 3.4):

#define id (get_group_id(0) * get_local_size(0) + get_local_id(0))

__kernel void foo(__global float* p, __global float3 *q, __global int3 *n)
{
  n[get_global_id(0)] = signbit(q[get_global_id(0)]);
}

I’m getting the following bit code:

define void @foo(float addrspace(1)* %p, <3 x float> addrspace(1)* %q, <3 x i32> addrspace(1)* %n) #2 {
  call void @llvm.dbg.value(metadata !{float addrspace(1)* %p}, i64 0, metadata !41), !dbg !42
  call void @llvm.dbg.value(metadata !{<3 x float> addrspace(1)* %q}, i64 0, metadata !43), !dbg !44
  call void @llvm.dbg.value(metadata !{<3 x i32> addrspace(1)* %n}, i64 0, metadata !45), !dbg !46
  %1 = call i32 @get_group_id(i32 0) #5
  %2 = call i32 @get_local_size(i32 0) #5
  %3 = mul i32 %2, %1
  %4 = call i32 @get_local_id(i32 0) #5
  %5 = add i32 %3, %4
  %6 = getelementptr inbounds <3 x float> addrspace(1)* %q, i32 %5, !dbg !47
  %7 = bitcast <3 x float> addrspace(1)* %6 to <4 x float> addrspace(1)*, !dbg !47
  %8 = load <4 x float> addrspace(1)* %7, !dbg !47
  %9 = shufflevector <4 x float> %8, <4 x float> undef, <3 x i32> <i32 0, i32 1, i32 2>, !dbg !47
  %10 = extractelement <3 x float> %9, i32 2
  %11 = bitcast float %10 to i32
  %.lobit.i.i = lshr i32 %11, 31
  %12 = insertelement <3 x i32> undef, i32 %.lobit.i.i, i32 0
  %13 = shufflevector <3 x i32> %12, <3 x i32> undef, <3 x i32> zeroinitializer
  %14 = icmp ne <3 x i32> %13, zeroinitializer
  %15 = sext <3 x i1> %14 to <3 x i32>
  %16 = call i32 @get_group_id(i32 0) #5
  %17 = call i32 @get_local_size(i32 0) #5
  %18 = mul i32 %17, %16
  %19 = call i32 @get_local_id(i32 0) #5
  %20 = add i32 %18, %19
  %21 = getelementptr inbounds <3 x i32> addrspace(1)* %n, i32 %20, !dbg !49
  %22 = shufflevector <3 x i32> %15, <3 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>, !dbg !49
  %23 = bitcast <3 x i32> addrspace(1)* %21 to <4 x i32> addrspace(1)*, !dbg !49
  store <4 x i32> %22, <4 x i32> addrspace(1)* %23, align 16, !dbg !49
  ret void, !dbg !50
}

The builtin is apparently turned into a lshr. However, I see only one of these and not the three I expecting since I’m working over vectors of length 3. And, I get the impression that all vector elements should be considered separately.

There’s also seems to be a deeper problem with this bitcode: The vector of length 3 is written as a vector of length 4. Hence, although the kernel is data race free the bitcode isn’t if the vectors are tightly packed in the array, which seems to be the case given the way the getelementptr is used.

Yeah, I screwed that up, and the piglit tests didn't catch it because
the auto-generated tests use the same input value for all vector
elements.

In the definition of _CLC_DEFINE_RELATIONAL_UNARY_VEC3:
return (RET_TYPE)((FUNCTION(x.s0), FUNCTION(x.s1), FUNCTION(x.s2)) !=
(RET_TYPE)0); \

Should be:
return (RET_TYPE)( (RET_TYPE){FUNCTION(x.s0), FUNCTION(x.s1),
FUNCTION(x.s2)} != (RET_TYPE)0); \

Big difference between parentheses around the vector components and braces.

This was messed up for float3, float4, float8, and float16, but float2
was correct.

With the above change and the following kernel
kernel void test_3_signbit_float(global int* out, global float* in0){
    vstore3(signbit(vload3(0, in0)), 0, out);
}

I get the following bitcode:
define void @test_3_signbit_float(i32 addrspace(1)* nocapture %out,
float addrspace(1)* nocapture readonly %in0) #0 {
  %1 = load float addrspace(1)* %in0, align 4, !tbaa !2
  %2 = getelementptr inbounds float addrspace(1)* %in0, i32 1
  %3 = load float addrspace(1)* %2, align 4, !tbaa !2
  %4 = getelementptr inbounds float addrspace(1)* %in0, i32 2
  %5 = load float addrspace(1)* %4, align 4, !tbaa !2
  %6 = bitcast float %1 to i32
  %.lobit.i.i = lshr i32 %6, 31
  %7 = insertelement <3 x i32> undef, i32 %.lobit.i.i, i32 0
  %8 = bitcast float %3 to i32
  %.lobit.i3.i = lshr i32 %8, 31
  %9 = insertelement <3 x i32> %7, i32 %.lobit.i3.i, i32 1
  %10 = bitcast float %5 to i32
  %.lobit.i2.i = lshr i32 %10, 31
  %11 = insertelement <3 x i32> %9, i32 %.lobit.i2.i, i32 2
  %12 = icmp ne <3 x i32> %11, zeroinitializer
  %13 = sext <3 x i1> %12 to <3 x i32>
  %14 = extractelement <3 x i32> %13, i32 0
  store i32 %14, i32 addrspace(1)* %out, align 4, !tbaa !6
  %15 = extractelement <3 x i32> %13, i32 1
  %16 = getelementptr inbounds i32 addrspace(1)* %out, i32 1
  store i32 %15, i32 addrspace(1)* %16, align 4, !tbaa !6
  %17 = extractelement <3 x i32> %13, i32 2
  %18 = getelementptr inbounds i32 addrspace(1)* %out, i32 2
  store i32 %17, i32 addrspace(1)* %18, align 4, !tbaa !6
  ret void
}

Does that look better?

--Aaron

<snip>

There’s also seems to be a deeper problem with this bitcode: The vector of length 3 is written as a vector of length 4. Hence, although the kernel is data race free the bitcode isn’t if the vectors are tightly packed in the array, which seems to be the case given the way the getelementptr is used.

Forget about this. This turned out to be an error in our tool chain we have that builds on libclc. It doesn’t take into account the alignment requirement on vectors of length 3 as stated in Section 6.1.5 of the spec.

Jeroen