[Patch][RFC] Change R600 data layout

Hi,

     I've prepared patches for both LLVM and Clang to change the
datalayout for R600. This may seem like a bold move, but I think it is
warranted. R600/SI is a strange architecture in that it uses 64bit
pointers but does not support 64 bit arithmetic except for load/store
operations that roughly map onto getelementptr.

    The current datalayout for r600 includes n32:64, which is odd
because r600 cannot actually do any 64bit arith natively. This causes
particular problems in the optimizer with the following kernel:

__kernel void if_eq(__global unsigned long* out, unsigned arg0)
{
    int i=0;
    for(i = 0; i < arg0; i++){
            out[i] = i;
       }
}

Clang decides that instead of adding a sext i32 %i to i64 before
getelementptr, it would be best to just go ahead and promote the
variable i to i64. Which would be all good if i64 was actually a
native type.

By changing the native types to n32 *only*. clang emits better code
for r600, such as

; Function Attrs: nounwind
define void @if_eq(i64 addrspace(1)* nocapture %out, i32 %arg0) #0 {
entry:
  %cmp4 = icmp eq i32 %arg0, 0
  br i1 %cmp4, label %for.end, label %for.body

for.body: ; preds = %for.body, %entry
  %i.05 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
  %conv = sext i32 %i.05 to i64
  %arrayidx = getelementptr inbounds i64 addrspace(1)* %out, i64 %conv
  store i64 %conv, i64 addrspace(1)* %arrayidx, align 8, !tbaa !2
  %inc = add nsw i32 %i.05, 1
  %exitcond = icmp eq i32 %inc, %arg0
  br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry
  ret void
}

Another upside to this is that i64 addition on r600 can even be
enabled as it the lowering code need only be called for arithmetic and
not pointers, for which it actually works. In the future, r600 will
require more patches wrt this issue as the old IR generated by clang
was perfectly valid but crashes llc.

For now, I think this patch is a good solution because it makes better
code and allows me to compile programs much longer than my previous 3
line record :slight_smile: Not so much a bandaid as a something else that needed
to be done.

Regards,

Jon Pry
jonpry@gmail.com

0001-R600-Fix-pointer-arithmetic.txt (1.48 KB)

0001-R600-Change-datalayout.txt (2.84 KB)

Mostly n64 is there for historical reasons as AMDGPU backend originally derived from the AMDIL backend. The AMDIL backend did support 64bit native types on some hardware chips of the r6XX derived architectures, so for simplicity software efficiently emulated them on the ones it did not.

Also GCN does support 64bit on many operations, so I wouldn't remove it as a native type there.

Basically what you want to for clang to emit better code is to use per pointer address spaces. So in the address spaces where you want 32bit computation, you specify it in the data layout, and in the address spaces where you need 64bit, you do the same. LLVM produces the right code in these situations.

Hi,

    I've prepared patches for both LLVM and Clang to change the
datalayout for R600. This may seem like a bold move, but I think it is
warranted. R600/SI is a strange architecture in that it uses 64bit
pointers but does not support 64 bit arithmetic except for load/store
operations that roughly map onto getelementptr.

   The current datalayout for r600 includes n32:64, which is odd
because r600 cannot actually do any 64bit arith natively.

This isn’t entirely true. 64-bit operations do exist for a small number of (mostly bitwise) operations, just not the normal arithmetic add and multiply. Most importantly, there are 64-bit loads and stores. I’ve seen various places that use the native integer type to decide a good sized integer for something like memcpy. It would be beneficial to use a 64-bit type for a series of load / store for global pointers.

Mostly n64 is there for historical reasons as AMDGPU backend originally derived from the AMDIL backend. The AMDIL backend did support 64bit native types on some hardware chips of the r6XX derived architectures, so for simplicity software efficiently emulated them on the ones it did not.

Also GCN does support 64bit on many operations, so I wouldn't remove it as a native type there.

Basically what you want to for clang to emit better code is to use per pointer address spaces. So in the address spaces where you want 32bit computation, you specify it in the data layout, and in the address spaces where you need 64bit, you do the same. LLVM produces the right code in these situations.

I don't believe this is a viable option. The address space in question
is (1) global, which is set to p64 as it must be. The address space
computation is not an issue. It's this senseless promotion of the
indexer to 64 bits. Thus requiring the i++ to be a 64bit add and so
on.

This isn’t entirely true. 64-bit operations do exist for a small number of (mostly bitwise) operations, just not the normal arithmetic add and multiply. Most importantly, there are 64-bit loads and stores. I’ve seen various places that use the native integer type to
decide a good sized integer for something like memcpy. It would be beneficial to use a 64-bit type for a series of load / store for global pointers.

I agree that it would be nice if user code could somehow tell that
64bit ld/s was optimal. Imho the crux of the issue is that once clang
promotes something to i64 there is no hope for optimizing it back out.
It's impossible to tell that the indexer is bounded to 32-bit limits,
so the 64 bit additions just have to stay. Ie information is destroyed
when clang promotes for us.

I'd also like to point out that the patch works really well,
especially combined with my previous one. I am able to compile all
clang and clover opencl tests for r600/si now. where as before ~90% of
them crashed. Not that my 9 lines of code are responsible for all that
much. Just a little polish on the edges.

Hi,

     I've prepared patches for both LLVM and Clang to change the
datalayout for R600. This may seem like a bold move, but I think it is
warranted. R600/SI is a strange architecture in that it uses 64bit
pointers but does not support 64 bit arithmetic except for load/store
operations that roughly map onto getelementptr.

    The current datalayout for r600 includes n32:64, which is odd
because r600 cannot actually do any 64bit arith natively. This causes
particular problems in the optimizer with the following kernel:

__kernel void if_eq(__global unsigned long* out, unsigned arg0)
{
    int i=0;
    for(i = 0; i < arg0; i++){
            out[i] = i;
       }
}

Clang decides that instead of adding a sext i32 %i to i64 before
getelementptr, it would be best to just go ahead and promote the
variable i to i64. Which would be all good if i64 was actually a
native type.

By changing the native types to n32 *only*. clang emits better code
for r600, such as

; Function Attrs: nounwind
define void @if_eq(i64 addrspace(1)* nocapture %out, i32 %arg0) #0 {
entry:
  %cmp4 = icmp eq i32 %arg0, 0
  br i1 %cmp4, label %for.end, label %for.body

for.body: ; preds = %for.body, %entry
  %i.05 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
  %conv = sext i32 %i.05 to i64
  %arrayidx = getelementptr inbounds i64 addrspace(1)* %out, i64 %conv
  store i64 %conv, i64 addrspace(1)* %arrayidx, align 8, !tbaa !2
  %inc = add nsw i32 %i.05, 1
  %exitcond = icmp eq i32 %inc, %arg0
  br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry
  ret void
}

Another upside to this is that i64 addition on r600 can even be
enabled as it the lowering code need only be called for arithmetic and
not pointers, for which it actually works. In the future, r600 will
require more patches wrt this issue as the old IR generated by clang
was perfectly valid but crashes llc.

For now, I think this patch is a good solution because it makes better
code and allows me to compile programs much longer than my previous 3
line record :slight_smile: Not so much a bandaid as a something else that needed
to be done.

I'm assuming you are using a Southern Islands GPU. Is this correct?
What errors are you seeing without this patch and with which tests?

SI supports 64-bit operations natively, so I don't think we should be
removing n64 from the DataLayout. I think you could achieve the same
results by adding a target-specific DAG combine that recognizes
some form of the (i64 add (i64 sext a), (i64 1)) pattern and replaces it
with a 32-bit add. I think there used to be a generic LLVM IR
optimization that did something like this (I'm not sure what it was
called, maybe "value range propagation"), so reviving this would be
another option.

It appears you are using tests from my opencl demos repo. If you are
interested, there are more more mature and comprehensive tests in piglit:
http://piglit.freedesktop.org/ take a look at the tests/cl directory.

-Tom