Assembling and linking AMD GPU kernel code

Not sure if this belongs in LLVM discussion or gem5, but I’ll start here:

Goal

I’m trying to experiment with AMD GPU kernel GCN assembly code to run on the gem5 simulator. I’d like to be able to write and compile HIP code where the CPU host code is written in C/C++, but the kernel itself is written in GCN assembly code. Eventually, I’d like to make slight modifications to the GCN ISA in LLVM.

My Setup

I’m running the current stable build of gem5 using the gem5 docker image. I am using hipcc in the corresponding docker container to build a small benchmark to run in gem5, which in turn invokes clang under the hood. To start, I am using the following HIP code, which I save as square.cpp:

#include <stdio.h>
#include "hip/hip_runtime.h"

#define CHECK(cmd) \
{\
    hipError_t error  = cmd;\
    if (error != hipSuccess) { \
      fprintf(stderr, "error: '%s'(%d) at %s:%d\n", hipGetErrorString(error), error,__FILE__, __LINE__); \
    exit(EXIT_FAILURE);\
    }\
}

/*
 * Square each element in the array A and write to array C.
 */
__global__ void
vector_square(int *C_d, const int *A_d, size_t N)
{
    size_t offset = (hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x);
    size_t stride = hipBlockDim_x * hipGridDim_x ;

    for (size_t i=offset; i<N; i+=stride) {
        C_d[i] = A_d[i] * A_d[i];
    }
}

int main(int argc, char *argv[])
{
    int *A_d, *C_d;
    int *A_h, *C_h;
    size_t N = 100;
    size_t Nbytes = N * sizeof(int);
    static int device = 0;
    CHECK(hipSetDevice(device));
    hipDeviceProp_t props;
    CHECK(hipGetDeviceProperties(&props, device/*deviceID*/));
    printf ("info: running on device %s\n", props.name);
    #ifdef __HIP_PLATFORM_HCC__
      printf ("info: architecture on AMD GPU device is: %d\n",props.gcnArch);
    #endif
    printf ("info: allocate host mem (%6.2f MB)\n", 2*Nbytes/1024.0/1024.0);
    A_h = (int*)malloc(Nbytes);
    CHECK(A_h == 0 ? hipErrorMemoryAllocation : hipSuccess );
    C_h = (int*)malloc(Nbytes);
    CHECK(C_h == 0 ? hipErrorMemoryAllocation : hipSuccess );
    for (size_t i=0; i<N; i++)
    {
        A_h[i] = 1 + i;
    }

    printf ("info: allocate device mem (%6.2f MB)\n", 2*Nbytes/1024.0/1024.0);
    CHECK(hipMalloc(&A_d, Nbytes));
    CHECK(hipMalloc(&C_d, Nbytes));

    printf ("info: copy Host2Device\n");
    CHECK ( hipMemcpy(A_d, A_h, Nbytes, hipMemcpyHostToDevice));

    const unsigned blocks = 512;
    const unsigned threadsPerBlock = 256;

    printf ("info: launch 'vector_square' kernel\n");
    hipLaunchKernelGGL(vector_square, dim3(blocks), dim3(threadsPerBlock), 0, 0, C_d, A_d, N);

    printf ("info: copy Device2Host\n");
    CHECK ( hipMemcpy(C_h, C_d, Nbytes, hipMemcpyDeviceToHost));
}

I can use the -save-temps flag to generate assembly code, which gives me x86 host code and GCN device code.

I want to experiment with GPU kernel assembly code by adding and/or removing instructions, recompiling it, and testing the binary in the simulator.

What I Have Tried

For now, I’m having trouble with linking with the host code. I’m working with the sample HIP program above, square.cpp. I also created the file square.host.cpp that contains the host code and only a definition of the kernel using extern.

Based on documentation here, I’ve been using the following procedure:

# Compile square.cpp
/opt/rocm/bin/hipcc -save-temps -O1 --amdgpu-target=gfx801 square.cpp -o square

# Assemble generated assembly code
mv square-hip-amdgcn-amd-amdhsa-gfx801.s square.kernel.s
sed -i "s/\.amdgcn_target/#.amdgcn_target/" square.kernel.s # comment out target to prevent error
/opt/rocm/llvm/bin/clang++ -x assembler -target amdgcn--amdhsa -mcpu=gfx801 -c square.kernel.s -o square.kernel.o

# Re-compile host code with extern definition
/opt/rocm/bin/hipcc -save-temps -O1 --amdgpu-target=gfx801 -c square.host.cpp -o square.host.o

# Link host and kernel code
/opt/rocm/bin/hipcc -save-temps -O1 --amdgpu-target=gfx801 square.host.o square.kernel.o -lm -o square

At the linking step, I get the following error:

/usr/bin/ld: square.kernel.o: Relocations in generic ELF (EM: 224)
/usr/bin/ld: square.kernel.o: error adding symbols: file in wrong format
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)

Is there something I’m doing wrong or should do differently with the compiler, assembler, or linker steps? My guess is that I’m doing something wrong in the assembler step even though I don’t get an error.

Is there some other way to achieve my goal of compiling a HIP program where the kernel is written in assembly?

square.kernel.o is an ELF binary for the GPU. You’re trying to link it with the host object file into a host executable. You just can’t do that.

On top of that your square.kernel.o is a GPU object file, not a GPU executable. It still needs to be linked into a GPU executable before you would be able to execute it on a GPU.

I don’t know if hipcc does any magic w/ GPU object files during linking phase, but I doubt it.

Mixing hipcc and clang may also add to the list of unnecessary trouble.

Your immediate challenges (the list is probably not exhaustive) are:

  • build a GPU executable
  • figure out how to embed it into the host object file in a way that would be compatible with what hipLaunchKernelGGL expects. I do not think you can use it with assembly kernels. Kernels compiled from hip/cuda source generate some glue on the host side to register the kernel with the runtime support library and that is missing for asm kernels.

Probably the easiest way to tinker with assmebly would be to use inline asm in a hip source file.

If you really want to launch a kernel written in assembly, with CUDA the way to do it is to embed the GPU executable (you would still need to link square.kernel.o for that) in the host executalbe, whichever way you want to (e.g. as a pla byte array), and then use low-level APIs to load that GPU binary (e.g. for cuda that would be via cuModuleLoad(), lookup kernel entry by name, and then launch that kernel using low level API (e.g. cuLaunchKernel).

CUDA has a rough example of that approach in cuda-samples/Samples/0_Introduction/matrixMulDrv at master · NVIDIA/cuda-samples · GitHub

Though they still compile the kernel from CUDA source, but the approach should work with asm kernels as well.

Good luck,

1 Like

If you’re not married to HIP you can use the JIT functionality in OpenMP offload.
Build clang with OpenMP offload (add “openmp” and “offload” to the runtimes if you work with main).

Then you can use a simple empty kernel:

// a.c
int main() {
  #pragma omp target teams
  ;
}

Then
clang a.c -O3 -fopenmp --offload-arch=native -fopenmp-target-jit
a.out should now be an executable that offloads to your GPU.
Check it with LIBOMPTARGET_INFO=16 ./a.out.
Next you get the IR for the (empty) kernel:
LIBOMPTARGET_JIT_POST_OPT_IR_MODULE=device.ll ./a.out
Now device.ll is the device IR.
You can modify it, or replace it, but keep the symbols, especially the kernel name!
If you want to pipe different IR into the program, use:
LIBOMPTARGET_JIT_REPLACEMENT_MODULE=new_device.ll ./a.out
If you want to pipe in assembly, use:
LIBOMPTARGET_JIT_REPLACEMENT_OBJECT=new_device.s ./a.out

See LLVM/OpenMP Runtimes — LLVM/OpenMP 19.0.0git documentation for more information.

1 Like

If you run hipcc with -save-temps -v it will print the intermediate commands. Like @Artem-B said, it seems like an lld linker invocation is missing.

I made some (partial) notes last time I tried something similar: WIP · kuhar/iree@2958c14 · GitHub

It would be nice to have an in-tree test / sample that shows how to dis- and re-assemble an hsaco file and check that it roundtrips as expected. Not sure if this is on anyone’s radar. Maybe @jhuber6 would know?

1 Like

If this is just for testing, you might find my GPU libc project compelling Using libc for GPUs — The LLVM C Library. It lets you run C/C++ code as if it were a standard executable. I would personally just write a simple main function and use inline assembly. (Linking object code with AMDGCN isn’t completely functional so linking with it is usually done with monolithic LTO).

It’s been mentioned before, but you’re linking different architectures here. The GCN code is just a target like any other target. You take some source code, get out some assembly like in Compiler Explorer, and then link that into an executable. The GPU runtime will then load this to execute it on the device similarly to the loader on your system when you execute a regular CPU executable. The HIP / OpenMP / CUDA runtimes embed this GPU executable inside of the executable itself so the runtime can load it. If you want to load a custom one you’ll need to override those runtime calls, or do some JIT stuff like Johannes mentioned.

1 Like

Thank you all for your responses!

I agree that inline assembly would probably be the easiest path to achieving my goals. To try this out, I created an empty kernel and copied the assembly instructions for the .s file generated using the -save-temps flag into an __asm function call, with all other code being the same as my original square.cpp example.

The problem with this solution is that the compiler does not correctly set all of the metadata at the end of the assembly code, including the number of SGPRs and VGPRs. Unsurprisingly, when I run the benchmark in gem5, I get an error that the SGPR index is out of range for one of the instructions. I tried simply adding the metadata within the __asm call, but the compiler still generates its own version which causes an invalid symbol redefinition error.

Is anyone aware of any mechanism I can use to override these metadata values?

First I think inline assembly is not the solution here. But failing that, if you added the correct register constraints, it should work (assuming the assembly actually satisfies the register budget for the usage function, otherwise you’ll probably hit a similar error)

simply adding the metadata within the __asm call

How did you do this? Inline asm is expected to only include instructions

Apparently LLVM doesn’t have the constraints documented, but GCC mentions them in Machine Constraints (Using the GNU Compiler Collection (GCC)). Using v and s should be what you’re after.

Also, if you really want to go down the road of manually linking in a custom written kernel I could probably hack together something that would work, but it’s non-trivial because the runtime needs to do all kinds of things like register the kernel by symbol name and set up a stub to associate it on the host.

I wish I could suggest doing something like using -Xoffload-linker <your obj> which contains a device function marked extern in HIP or OpenMP offloading, but the problem with that is we don’t really support ELF linking in AMDGCN right now, the metadata doesn’t know how to track the correct SGPR / VGPR count or LDS usage.

Are there some usage/syntax examples of these constraints?

Not sure if my understanding is correct, but I’m assuming I can replace specific register indices with some constraint. For instance, in the following instruction, using v instead of v1.

__asm(
"v_mov_b32_e32 v1, 0"
);

I tried this, but I get an invalid operand for instruction error at compile time.

Not sure how correct it is, but I just added the metadata lines after the last instruction all within the string provided to the asm call.

Stuff like this works for me Compiler Explorer.

1 Like

This usage is incorrect. The compiler isn’t going to guess at what the registers in the string are. You have to mark those out explicitly, something like

__asm("v_mov_b32 v1, 0" : "={v1}"(output_var));

1 Like

Good point.

I wrote a simple python script that reformats the assembly code into the inline syntax with explicit register constraints, which gives me something like the following:

// v_mov_b32_e32 v1, 0
// s_load_dword s0, s[4:5], 0x4
// s_waitcnt lgkmcnt(0)
// s_and_b32 s2, s0, 0xffff
// v_mov_b32_e32 v2, s8
// v_mad_u64_u32 v[0:1], s[0:1], s2, v2, v[0:1]
// s_load_dwordx4 s[8:11], s[6:7], 0x10
int x, y;
asm volatile ("v_mov_b32_e32 %0, 0" : "={v1}"(x));
asm volatile ("s_load_dword %0, %1, 0x4" : "={s0}"(x) : "{s[4:5]}"(y));
asm volatile ("s_waitcnt lgkmcnt(0)");
asm volatile ("s_and_b32 %0, %1, 0xffff" : "={s2}"(x) : "{s0}"(y));
asm volatile ("v_mov_b32_e32 %0, %1" : "={v2}"(x) : "{s8}"(y));
asm volatile ("v_mad_u64_u32 %0, %1, %2, %3, %4" : "={v[0:1]}"(x) : "{s[0:1]}"(y), "{s2}"(y), "{v2}"(y), "{v[0:1]}"(y));
asm volatile ("s_load_dwordx4 %0, %1, 0x10" : "={s[8:11]}"(x) : "{s[6:7]}"(y));
/* etc. for the rest of the kernel */

For some reason, when I get to the last instruction here in the simulator, the read of s[6:7] returns 0. With the original kernel, the register value is some non-zero address (0x7FFF5E7C1000). This results in an invalid memory access by the load instruction and a runtime error.

I wonder if there is some additional information missing that the compiler uses to specify how GPU registers are initialized. However, it’s a little weird that the first s_load_dword gets a non-zero address from s[4:5] and actually loads data from it. I’m not really sure where those register values are coming from.

These are the differences I still see in the generated assembly metadata:

Are there still some problems or gaps with how I’m specifying the inline assembly, or is register initialization something I will just have to do manually with this approach?

Inline assembly support is not to the level where you can expect to throw your entire assembler output in it and expect to work like this. You certainly aren’t guaranteed to get the exact same output. We’re not trying to guess that you were directly reading the kernel ABI input in assembly and didn’t detect it was used. For example the fact that the dispatch_ptr is now disabled is suspicious. All the subsequent SGPR ABI arguments are now off by 2.

Inline assembly is for “break glass in case of emergency” and some basic experimentation. I would not be going to build any kind of major tooling on top of it.

We really need to improve our infrastructure in this area. The problem is that linking ELF code doesn’t fully work because it doesn’t know how to set up the LDS memory or register / memory usage metadata with a foreign function call. Additionally, the mainstream languages don’t support registering a custom kernel (to my knowledge)

If you really want to, you can use the HIP / HSA API to directly call some named kernel and set up the arguments yourself. That’s basically what my amdgpu_loader utilitiy is llvm-project/libc/utils/gpu/loader/amdgpu/Loader.cpp at main · llvm/llvm-project · GitHub.

Alternatively, if you are okay with the register usage being potentially broken, you can just use a regular language like OpenMP or HIP to set up the kernel and then call your function externally. For OpenMP you can use -Xoffload-linker foo.o to pass your object file to the device linking job. You can do this with HIP if you use -fgpu-rdc --offload-new-driver in addition.

For OpenMP it looks something like this

extern int foo();

int main() {
  int val;
#pragma omp target map(from:val)
  val = foo();
  return val;
}
int foo() {
  return 42;
}

The foo function is just a stand-in for something to be defined by you. You can compile C or assembly for GCN directly using --target=amdgcn-amd-amdhsa -mcpu=... That looks something like this to get OpenMP working.

> clang foo.c --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -c                                       
> clang main.c -fopenmp --offload-arch=gfx1030 -fopenmp-offload-mandatory -Xoffload-linker foo.o
> ./a.out; echo $?
42

But like I said, this technically doesn’t work because the ld.lld linker does not know how to calculate or propagate register usage with foreign function calls. Given this opaque function call it will usually default to a sufficiently high count for trivial cases, and any use of LDS will likely be wrong. But apart from that it will “work”.

1 Like

It should…eventually. The base infrastructure work for this is in progress

For now, I’ll use some of the above methods and set up the registers and/or arguments manually. When the linker functionality is ready, I can hopefully switch to using that.

Is there a place where the linker work will be announced? I’d like to monitor it so I can be notified when it is ready.

I don’t think anyone’s working on it at the moment, but I’ve thought about it in the past. Basically we’d need to embed resource usage metadata into each function, then embed callgraph information into the binary. The linker would then need to create a full callgraph once all the inputs are known and find the diameter of the graph. That would solve the resource usage I feel.

LDS usage is more complicated, we would need some special symbol relocation to track references to LDS as some abstract symbol. LDS instructions are basically just integer offsets into a memory buffer, AKA an array. Each variable in LDS needs an integer index that gives it exclusive access at the needed alignment to that variable. The complicated bit is that each kernel has its own LDS buffer, so if a device function is shared between two kernels both of them will need to get the same integer index. I think we do something like this in the LDS lowering pass, but it would be a bit more difficult to do that in the linker than LLVM-IR. Unfortunately I’m not a linker expert and it’s somewhat low on the priority list because ABI-style linking is inherently slow on GPUs.