[PATCH] Fix vload3/vstore3 to emit only one IR load

I found out from here how to finally do this correctly:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150921/301818.html

You can combine ext_vector_type and the aligned to get a load of the
right vector type with the correct alignment. With this, the IR
for vload3 looks like:

     define <3 x i32> @81(i32 %offset, i32* nocapture readonly %x) #0 {
     entry:
       %mul = mul i32 %offset, 3
       %arrayidx = getelementptr inbounds i32, i32* %x, i32 %mul
       %castToVec4 = bitcast i32* %arrayidx to <4 x i32>*
       %loadVec4 = load <4 x i32>, <4 x i32>* %castToVec4, align 4
       %extractVec = shufflevector <4 x i32> %loadVec4, <4 x i32> %undef, <3 x i32> <i32 0, i32 1, i32 2>
       ret <3 x i32> %extractVec
     }

The load of <4 x i32> instead of <3 x i32> is somewhat surprising to me,
but this is much better than the previous mess from doing a load of
the first 2 components, a separate load of the 3rd and a sequence
to recombine them.

Old:

   define <3 x i32> @81(i32 %offset, i32* nocapture readonly %x) #0 {
   entry:
     %mul = mul i32 %offset, 3
     %arrayidx = getelementptr inbounds i32, i32* %x, i32 %mul
     %0 = bitcast i32* %arrayidx to <2 x i32>*
     %1 = load <2 x i32>, <2 x i32>* %0, align 4, !tbaa !1
     %2 = extractelement <2 x i32> %1, i32 0
     %3 = insertelement <3 x i32> undef, i32 %2, i32 0
     %4 = extractelement <2 x i32> %1, i32 1
     %5 = insertelement <3 x i32> %3, i32 %4, i32 1
     %add = add i32 %mul, 2
     %arrayidx3 = getelementptr inbounds i32, i32* %x, i32 %add
     %6 = load i32, i32* %arrayidx3, align 4, !tbaa !6
     %7 = insertelement <3 x i32> %5, i32 %6, i32 2
     ret <3 x i32> %7
   }

0001-Fix-vload3-vstore3-to-emit-only-one-IR-load.patch (5.54 KB)

Hi Matt,

The IR below seem fishy to me: if we have

vload3(get_global_id(0), A)

then the work item with the highest id is likely to access an element out of bounds of the array being passed in.

Also, does the store generate a store of 4 elements, or will that be precisely be 3 elements?

Jeroen

The store is also emitted as a <4 x i32>. I'm not sure why clang is avoiding direct load/store of 3 vectors, but this seems like a clang bug to me.

Right, then something like:

__kernel foo(__global int *A) {
  int3 tmp = vload3(get_global_id(0), A);
  tmp += 1;
  vstore3(tmp, get_global_id(0), A);
}

will have a data race, e.g., on the 4th element of A. And reading the spec, I think it should be data race free "vload3 and vload_half3 read x, y, z components from address (p + (offset * 3))”.

The reason why the load and store are turned in to 4 element loads and stores with your new code, is because something like:

int3 B[n];

will have an element of padding after each 3rd element. Hence, you’re just reading or writing an additional undefined value, which is not overlapping with one of the next element of the array (so no data race), and the reasoning is probably that a 4 element load or store is likely to be more efficient than a 3-element one. See also Section 6.1.5
"Alignment of Types" in the OpenCL 1.2 spec.

Jeroen

The code that does this was added in r162002: "Convert loads and stores of vec3 to vec4 to achieve better code generation. Add test case."

I don't think this is really something the frontend should be doing, but I guess if you can assume a 3 vector type pointer always points to the rounded up size, I guess it is correct.

I don't think there's any other solution than to write the vload3/vstore3 directly in IR then without adding some other kind of clang extension.

Hi Matt,

I generally think it makes sense to perform some optimisations in a compiler front-end (because it has certain information that later phases might not have). However, I appreciate that there are different schools of thought here.

Regarding the problem at hand: indeed the best approach might be to encode vload3/store3 in IR (unfortunately).

Best,

Jeroen

Yes, probably.