> > Does anyone else have feedback on this?
> >
>
> Hi Aaron,
>
> I've been testing this the last few days and trying to fix some vload and
> vstore bugs on SI. At this point I think the remaining bugs are in
> the LLVM backend, so you can go ahead and commit this patch.
>
Maybe I spoke too soon, the code generated for vstore3 looks wrong:
; Function Attrs: alwaysinline nounwind
define void @_Z7vstore3Dv3_ijPU3AS3i(<3 x i32> %vec, i32 %offset, i32 addrspace(3)* nocapture %mem) #0 {
entry:
%mul = mul i32 %offset, 3
%arrayidx = getelementptr inbounds i32 addrspace(3)* %mem, i32 %mul
%extractVec2 = shufflevector <3 x i32> %vec, <3 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
%storetmp3 = bitcast i32 addrspace(3)* %arrayidx to <4 x i32> addrspace(3)*
store <4 x i32> %extractVec2, <4 x i32> addrspace(3)* %storetmp3, align 4, !tbaa !1
ret void
}
It's storing a vec4 value with the last element undef. This would be legal
if mem were declared as <3 x i32>*, since in OpenCL vec3 occupy the same
amount of memory as vec4. However, in this case, since mem is declared
as i32*, I think we should only be storing three values.
I'm not sure yet if this is a bug in libclc or LLVM, but I'm looking into it.
I got it to work with this implementation of vstore3:
typedef PRIM_TYPE##3 less_aligned_##ADDR_SPACE##PRIM_TYPE##3 __attribute__ ((aligned (sizeof(PRIM_TYPE))));\
_CLC_OVERLOAD _CLC_DEF void vstore3(PRIM_TYPE##3 vec, size_t offset, ADDR_SPACE PRIM_TYPE *mem) { \
*((ADDR_SPACE less_aligned_##ADDR_SPACE##PRIM_TYPE##2*) (&mem[3*offset])) = (PRIM_TYPE##2)(vec.s0, vec.s1); \
mem[3 * offset + 2] = vec.s2;\
} \
\
Which generates the following LLVM IR:
; Function Attrs: alwaysinline nounwind
define void @_Z7vstore3Dv3_ijPU3AS1i(<3 x i32> %vec, i32 %offset, i32 addrspace(1)* nocapture %mem) #0 {
entry:
%vecinit1 = shufflevector <3 x i32> %vec, <3 x i32> undef, <2 x i32> <i32 0, i32 1>
%mul = mul i32 %offset, 3
%0 = sext i32 %mul to i64
%arrayidx = getelementptr inbounds i32 addrspace(1)* %mem, i64 %0
%1 = bitcast i32 addrspace(1)* %arrayidx to <2 x i32> addrspace(1)*
store <2 x i32> %vecinit1, <2 x i32> addrspace(1)* %1, align 4, !tbaa !2
%2 = extractelement <3 x i32> %vec, i32 2
%add = add i32 %mul, 2
%3 = sext i32 %add to i64
%arrayidx3 = getelementptr inbounds i32 addrspace(1)* %mem, i64 %3
store i32 %2, i32 addrspace(1)* %arrayidx3, align 4, !tbaa !7
ret void
}
Does this look correct?
-Tom