I'm trying to launch a GPU kernel which was compiled by the LLVM
AMDGPU backend. Currently I'm having no success with it and I was
hoping someone tuned in on here might have an idea.
It seems that tensorflow is doing a similar thing. So I was reading
the tensorflow code on github and I believe the following setup is
pretty close in the vital parts:
1) Compile an LLVM IR module (see below) with AMDGPU backend to a
'module.o' file. Using this triple/CPU:
llvm::Triple TheTriple; TheTriple\.setArch \(llvm::Triple::ArchType::amdgcn\); TheTriple\.setVendor \(llvm::Triple::VendorType::AMD\); TheTriple\.setOS \(llvm::Triple::OSType::AMDHSA\); std::string CPUStr\("gfx906"\); LLVM IR passes that I use: TargetLibraryInfoWrapperPass TargetMachine\->addPassesToEmitFile with CGFT\_ObjectFile
2) LLVM linker generates a shared lib using 'system()' call
ld\.lld \-shared module\.o \-o module\.so
3) Reading this shared module back into a 'vector<uint8> shared'
4) Using HIP to load this module:
hipModule\_t module; ret = hipModuleLoadData\( &module , shared\.data\(\) \); \(this returns hipSuccess\)
5) Trying to get a HIP function:
hipFunction\_t kernel; ret = hipModuleGetFunction\(&kernel, module, "kernel" \);
.. and this fails with HIP error code 500 !?
I believe the vital steps here concerning ROCm are similar
(identical?) to what's in tensorflow but I don't get it to work.
I have to admit that I did not build tensorflow to see if the AMD GPU
bits actually work. I read the comments and some are saying that it
comes with some performance overhead. Performance isn't the point at
the moment - I'm working on a proof-of-concept.
My test machine has an 'AMD gfx906' card installed.
Digging deeper, the hipModule_t is a pointer to ihipModule_t and
printing out the values after loading the module gives
ihip->hash = 3943538976062281088
ihip->kernargs.size() = 0
ihip->executable.handle = 42041072
It's not telling me much. 'Not sure what to do with the handle for the
Any ideas what could be tried next?