[PATCH] vload/vstore: Use casts instead of scalarizing everything in CLC version

Does anyone else have feedback on this?

I'd like to finally put this saga to rest and get this committed, but I haven't actually gotten a reviewed-by or tested-by yet.

If I don't get any more feedback on it before then, I will probably push it on Monday.

--Aaron

Does anyone else have feedback on this?

Hi Aaron,

I've been testing this the last few days and trying to fix some vload and
vstore bugs on SI. At this point I think the remaining bugs are in
the LLVM backend, so you can go ahead and commit this patch.

-Tom

> Does anyone else have feedback on this?
>

Hi Aaron,

I've been testing this the last few days and trying to fix some vload and
vstore bugs on SI. At this point I think the remaining bugs are in
the LLVM backend, so you can go ahead and commit this patch.

Maybe I spoke too soon, the code generated for vstore3 looks wrong:

; Function Attrs: alwaysinline nounwind
define void @_Z7vstore3Dv3_ijPU3AS3i(<3 x i32> %vec, i32 %offset, i32 addrspace(3)* nocapture %mem) #0 {
entry:
  %mul = mul i32 %offset, 3
  %arrayidx = getelementptr inbounds i32 addrspace(3)* %mem, i32 %mul
  %extractVec2 = shufflevector <3 x i32> %vec, <3 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
  %storetmp3 = bitcast i32 addrspace(3)* %arrayidx to <4 x i32> addrspace(3)*
  store <4 x i32> %extractVec2, <4 x i32> addrspace(3)* %storetmp3, align 4, !tbaa !1
  ret void
}

It's storing a vec4 value with the last element undef. This would be legal
if mem were declared as <3 x i32>*, since in OpenCL vec3 occupy the same
amount of memory as vec4. However, in this case, since mem is declared
as i32*, I think we should only be storing three values.

I'm not sure yet if this is a bug in libclc or LLVM, but I'm looking into it.

-Tom

> > Does anyone else have feedback on this?
> >
>
> Hi Aaron,
>
> I've been testing this the last few days and trying to fix some vload and
> vstore bugs on SI. At this point I think the remaining bugs are in
> the LLVM backend, so you can go ahead and commit this patch.
>

Maybe I spoke too soon, the code generated for vstore3 looks wrong:

; Function Attrs: alwaysinline nounwind
define void @_Z7vstore3Dv3_ijPU3AS3i(<3 x i32> %vec, i32 %offset, i32 addrspace(3)* nocapture %mem) #0 {
entry:
  %mul = mul i32 %offset, 3
  %arrayidx = getelementptr inbounds i32 addrspace(3)* %mem, i32 %mul
  %extractVec2 = shufflevector <3 x i32> %vec, <3 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
  %storetmp3 = bitcast i32 addrspace(3)* %arrayidx to <4 x i32> addrspace(3)*
  store <4 x i32> %extractVec2, <4 x i32> addrspace(3)* %storetmp3, align 4, !tbaa !1
  ret void
}

It's storing a vec4 value with the last element undef. This would be legal
if mem were declared as <3 x i32>*, since in OpenCL vec3 occupy the same
amount of memory as vec4. However, in this case, since mem is declared
as i32*, I think we should only be storing three values.

I'm not sure yet if this is a bug in libclc or LLVM, but I'm looking into it.

I got it to work with this implementation of vstore3:

typedef PRIM_TYPE##3 less_aligned_##ADDR_SPACE##PRIM_TYPE##3 __attribute__ ((aligned (sizeof(PRIM_TYPE))));\
_CLC_OVERLOAD _CLC_DEF void vstore3(PRIM_TYPE##3 vec, size_t offset, ADDR_SPACE PRIM_TYPE *mem) { \
  *((ADDR_SPACE less_aligned_##ADDR_SPACE##PRIM_TYPE##2*) (&mem[3*offset])) = (PRIM_TYPE##2)(vec.s0, vec.s1); \
  mem[3 * offset + 2] = vec.s2;\
} \
\

Which generates the following LLVM IR:

; Function Attrs: alwaysinline nounwind
define void @_Z7vstore3Dv3_ijPU3AS1i(<3 x i32> %vec, i32 %offset, i32 addrspace(1)* nocapture %mem) #0 {
entry:
  %vecinit1 = shufflevector <3 x i32> %vec, <3 x i32> undef, <2 x i32> <i32 0, i32 1>
  %mul = mul i32 %offset, 3
  %0 = sext i32 %mul to i64
  %arrayidx = getelementptr inbounds i32 addrspace(1)* %mem, i64 %0
  %1 = bitcast i32 addrspace(1)* %arrayidx to <2 x i32> addrspace(1)*
  store <2 x i32> %vecinit1, <2 x i32> addrspace(1)* %1, align 4, !tbaa !2
  %2 = extractelement <3 x i32> %vec, i32 2
  %add = add i32 %mul, 2
  %3 = sext i32 %add to i64
  %arrayidx3 = getelementptr inbounds i32 addrspace(1)* %mem, i64 %3
  store i32 %2, i32 addrspace(1)* %arrayidx3, align 4, !tbaa !7
  ret void
}

Does this look correct?

-Tom

It looks correct, but it really should be a single store with the align 4. Does the attribute aligned thing not emit the right thing from clang?

I don't think it's possible to implement a single store version of vec3 using
OpenCL C, because if you cast a pointer as a vec3 type, clang will will
try to store a vec4 value to it, because sizeof(vec3) == sizeof(vec4) in
memory.

-Tom

I guess it would be appropriate to write this one directly in IR then

Actually that kind of seems like a clang bug. It should be emitting the load on the actual type with whatever alignment, not the type the rounded size happens to be

What clang is doing is legal as far as I can tell, it just is inefficient,
so I think we'll still need an IR version.

The problem with an IR version, though, is that it has to be target
specific, since the address spaces are different for different targets.

My recommendation is to commit the version I suggested which works,
and then we can figure out how to optimize it with IR in a follow
up commit.

-Tom

One final question which I've got, which I believe I know the answer
to, but I'd rather ask than have to do an extra revision:

Do we want to do the same thing (vec2+scalar load) for vload3?

From looking at the generated bitcode for char3 vload3(), I'm seeing:

; Function Attrs: alwaysinline nounwind readonly
define <3 x i8> @_Z6vload3jPKc(i32 %offset, i8* nocapture readonly %x) #0 {
  %1 = mul i32 %offset, 3
  %2 = getelementptr inbounds i8* %x, i32 %1
  %3 = bitcast i8* %2 to <4 x i8>*
  %4 = load <4 x i8>* %3, align 4
  %5 = shufflevector <4 x i8> %4, <4 x i8> undef, <3 x i32> <i32 0,
i32 1, i32 2>
  ret <3 x i8> %5
}

In theory, vec3 takes 4 elements worth of space

If you use a vload3 which is at the end of your allocated memory
space, would it be possible to get segfaults or VM faults since we'd
be reading 1 element past what we've allocated?

E.g. Allocated 48 bytes (4 x int3, or 3 x int4), followed by:
vload3(3, global int* input)

In theory, that should read integers input[9], input[10], and
input[11], but it will also read and then discard int[12]...

Or am I missing something? Is the allocation required to be padded to
prevent this from happening? The status of 3-element vectors being
occasionally treated as 4-element vectors for memory operations is
"fun".

--Aaron

One final question which I've got, which I believe I know the answer
to, but I'd rather ask than have to do an extra revision:

Do we want to do the same thing (vec2+scalar load) for vload3?

From looking at the generated bitcode for char3 vload3(), I'm seeing:
; Function Attrs: alwaysinline nounwind readonly
define <3 x i8> @_Z6vload3jPKc(i32 %offset, i8* nocapture readonly %x) #0 {
  %1 = mul i32 %offset, 3
  %2 = getelementptr inbounds i8* %x, i32 %1
  %3 = bitcast i8* %2 to <4 x i8>*
  %4 = load <4 x i8>* %3, align 4
  %5 = shufflevector <4 x i8> %4, <4 x i8> undef, <3 x i32> <i32 0,
i32 1, i32 2>
  ret <3 x i8> %5
}

In theory, vec3 takes 4 elements worth of space

If you use a vload3 which is at the end of your allocated memory
space, would it be possible to get segfaults or VM faults since we'd
be reading 1 element past what we've allocated?

I hadn't considered this before, but I think you are right that it could
be potentially unsafe to load a vec4. So, I think it would be a good
idea to make the same change for vload.

-Tom