[AMDGPU] non-hsa intrinsic with hsa target

Dear Developers,

I compiled a OpenCL kernel before (on Nov. last year) like

__kernel void g(__global float* array)
{
array[get_global_id(0)] = 1;
}

with libclc, which would originally use the instrinsics like llvm.r600.read.local.size.x().

I executed the generated object file with one version of the hsa-runtime [1] provided by Mr. Stellard, when there was more than one workgroup, the output of the program wasn’t correct at that time. I guessed this might be because get_group_id() always returned 1 (not quite sure what was going on at that time).

When I compile such cases using current llvm trunk, it uses a set of instrinsics starting with llvm.amdgcn, while it still uses llvm.r600.read.local.size.x(). The output LLVM IR code is like:

define void @g(float addrspace(1)* nocapture %array) #0 {

%x.i.i = tail call i32 @llvm.amdgcn.workgroup.id.x() #2
%x.i12.i = tail call i32 @llvm.r600.read.local.size.x() #1
%mul26.i = mul i32 %x.i12.i, %x.i.i
%x.i4.i = tail call i32 @llvm.amdgcn.workitem.id.x() #2, !range !7

%add.i = add i32 %x.i4.i, %mul26.i
%0 = sext i32 %add.i to i64
%arrayidx = getelementptr inbounds float, float addrspace(1)* %array, i64 %0
store float 1.000000e+00, float addrspace(1)* %arrayidx, align 4, !tbaa !8
ret void
}

which cannot be handled by llc with the message “the non-hsa instrinsic with hsa target shown”.

After looking into the log (r259297), my question is that is there other intrinsic that support this case when the target is amdgcn–amdhsa? In the log of r259297, it states that AMDGPUPromoteAlloca pass (a backend pass) will generate this intrinsic, but even when I just emit-llvm without going through llc, this intrinsic is still emitted.

[1] https://github.com/tstellarAMD/hsa-runtime

Regards,

李弘宇 (Li, Hong-Yu)
Department of Computer Science & Information Engineering
National Taiwan University

Li, Hong-Yu,

it’s because get_group_id() uses get_local_size

_CLC_DEF size_t get_global_id(uint dim) {
return get_group_id(dim)*get_local_size(dim) + get_local_id(dim);
}

in libclc/amdgcn, ‘get_local_size’ invokes r600-xxx intrinsics. I doubt that libclc ever supports hsa-runtime before.

thanks,
–lx

Hi Mr. Liu,

Thanks for your quick reply.

I compiled the code with the libclc_trunk and linked the bitcode file under $LIBCLC_DIR/built_libs/tahiti-amdgcn–.bc. After looking into the libclc, it is currently using the new workitem intrinsics (commit ba9858caa1e927a6fcc601e3466faa693835db5e). In the linked bitcode ($LIBCLC_DIR/built_libs/tahiti-amdgcn–.bc), it has the following code segment,

define linkonce_odr i32 @get_global_id(i32 %dim) #5 {
entry:
switch i32 %dim, label %get_local_id.exit [
i32 0, label %get_group_id.exit.thread
i32 1, label %get_group_id.exit.thread22
i32 2, label %get_group_id.exit.thread24
]

get_group_id.exit.thread: ; preds = %entry
%x.i = tail call i32 @llvm.amdgcn.workgroup.id.x() #13
%x.i12 = tail call i32 @llvm.r600.read.local.size.x() #3
%mul26 = mul i32 %x.i12, %x.i
%x.i4 = tail call i32 @llvm.amdgcn.workitem.id.x() #13, !range !1
br label %get_local_id.exit

}

So it shows that some intrinstics are still using llvm.r600.xxx. I have no idea if I ever missed something so that it doesn’t work.

Thanks.

Best regards,

李弘宇 (Li, Hong-Yu)
Department of Computer Science & Information Engineering
National Taiwan University

I think the “define linkonce_odr i32 @get_global_id(i32 %dim) #5” you dumped is llvm IR after inlining and opt.
The commit you mentioned(ba9858) doesn’t change get_local_size() at all.

I never worked on OpenCL+HSA. I just wonder if libclc supports HSA.

HSA RT uses ‘hsa_kernel_dispatch_packet_t’ to get know workgroup size and grid size. so far, I didn’t see hsa-specific implementation appears in libclc.

thanks,
–lx

I never worked on OpenCL+HSA. I just wonder if libclc supports HSA.

It does not currently, and only supports Clover’s ABI. The reason there are still r600 related intrinsics being used for workitems is mostly because I haven’t gotten around to fixing it. Clover should be reading these from an offset from the kernel argument pointer rather than having special case intrinsics, or it could start putting arguments somewhere else.

HSA RT uses ‘hsa_kernel_dispatch_packet_t’ to get know workgroup size and grid size. so far, I didn’t see hsa-specific implementation appears in libclc.

The Mesa ABI reads items out of a hidden kernel argument area before the true arguments, while HSA reads from the dispatch packet pointer, which has an intrinsic for it. You can see how these are implemented here:https://bitbucket.org/multicoreware/hcc/src/33432be0ab37668e55f1f534294d7525587518a4/lib/hsail-amdgpu-wrapper.ll?at=master&fileviewer=file-view-default

-Matt

Hi,

Thank you for your prompt response. I’ll see the implementation in the hcc.

Li