[OpenMP][GSoC 2024] Improve GPU First Framework

shiltian · February 18, 2024, 12:12am

Description

GPU First is a methodology and framework that can enable any existing host code to execute the entire program on a GPU without any modification from users. The goal of this project is two folded:

Port host code to handle RPC to the new plugin and rewrite it with the host RPC framework introduced in the GPU LibC project.
Explore the support for MPI among multiple thread blocks on a single GPU, or even multiple GPUs.

Expected outcomes

More efficient GPU First framework that can support both NVIDIA and AMD GPUs. Optionally, upstream the framework.

Confirmed mentors and their contacts

@shiltian @jdoerfert @jhuber6

Required Skills

Good understanding of C++ and GPU architecture
Familiarity with GPUs and LLVM IR

Desired Skills

Good understanding of the LLVM code base and OpenMP target offloading

Size of the project

medium

An easy, medium or hard rating if possible

medium

Polygonalr · February 27, 2024, 2:03pm

Hi, I have interest in the project as mentioned over DMs!

As mentioned before, I have prior experience with NVIDIA GPU architecture and MPI usage, and I have studied compiler design in school before, so I am familiar with LLVM IR. Bringing the conversation to public,

Do you have access to a GPU? It doesn’t matter too much if that’s an AMD or NVIDIA GPU. I’m not sure if GSoC or LLVM Foundation could provide basic resource, but it might be a good idea if you can have your own resources.

Yes, I have access to an Ada Lovelace GPU, so that isn’t an issue.

It might be a good start to try to build the LLVM repo (GitHub - shiltian/llvm-project at llvm-test-suite-gpu) with OpenMP target offloading enabled and run some basic OpenMP target offloading programs on a GPU. It is a fairly old branch but that’s gonna be the starting point of this project. After that, it will be a good idea to take a look at the compiler wrapper openmp/libomptarget/DirectGPUCompilation/clang-gpu, play with it (such as compiling a simple host program and run it on a GPU), and understand the compilation pipeline. We can then start from there.

Thank you for some pointers, I’ll be taking the time to do so over the week!

Edit: I’ve ran into the same issue mentioned here while trying to build OpenMP, I’m not sure how to proceed from here.

YuxinHao · February 28, 2024, 4:48pm

Hello, Shilei and Joseph,

I have already read several GPU First papers that you recommended to me, and I think they are very cool!

Now, I am using the following commands to build the LLVM project, which is the host code repository you provided:

cmake ../llvm -G Ninja -DLLVM_ENABLE_PROJECTS="libc"  \
   -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
   -DCMAKE_BUILD_TYPE=Release \
   -DLLVM_ENABLE_PROJECTS="clang"

Then, I use the following command to compile hello.cpp:

clang hello.cpp -L /home/Projects/gso/llvm-project/build/projects/libc/lib -lllvmlibc

The content of hello.cpp is:

cppCopy code

#include <iostream>

using namespace std;

int main() {
    string s = "hello,llvm!";
    int len = s.length();
    cout << s << " length is " << len << endl;
    return 0;
}

However, there are some errors. Could you please give me some advice?

/usr/bin/ld: /tmp/hello-cd1da6.o: in function `main':
hello.cpp:(.text+0x18): undefined reference to `std::allocator<char>::allocator()'
/usr/bin/ld: hello.cpp:(.text+0x2c): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)'
/usr/bin/ld: hello.cpp:(.text+0x3a): undefined reference to `std::allocator<char>::~allocator()'
/usr/bin/ld: hello.cpp:(.text+0x47): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::length() const'
/usr/bin/ld: hello.cpp:(.text+0x55): undefined reference to `std::cout'
/usr/bin/ld: hello.cpp:(.text+0x5a): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
/usr/bin/ld: hello.cpp:(.text+0x73): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'
/usr/bin/ld: hello.cpp:(.text+0x88): undefined reference to `std::ostream::operator<<(int)'
/usr/bin/ld: hello.cpp:(.text+0x9c): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&)'
/usr/bin/ld: hello.cpp:(.text+0xa1): undefined reference to `std::ostream::operator<<(std::ostream& (*)(std::ostream&))'
/usr/bin/ld: hello.cpp:(.text+0xb6): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()'
/usr/bin/ld: hello.cpp:(.text+0xd4): undefined reference to `std::allocator<char>::~allocator()'
/usr/bin/ld: hello.cpp:(.text+0xee): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()'
/usr/bin/ld: /tmp/hello-cd1da6.o: in function `__cxx_global_var_init':
hello.cpp:(.text.startup+0xc): undefined reference to `std::ios_base::Init::Init()'
/usr/bin/ld: hello.cpp:(.text.startup+0x13): undefined reference to `std::ios_base::Init::~Init()'
/usr/bin/ld: /tmp/hello-cd1da6.o:(.data.DW.ref.__gxx_personality_v0[DW.ref.__gxx_personality_v0]+0x0): undefined reference to `__gxx_personality_v0'
clang: error: linker command failed with exit code 1 (use -v to see invocation)
root@1ef1351cbf8e:/home/Projects/hello# ls
hello.cpp
root@1ef1351cbf8e:/home/Projects/hello# export BU=/home/Projects/gso/llvm-project/build/bin
root@1ef1351cbf8e:/home/Projects/hello# $BU/clang hello.cpp -L /home/Projects/gso/llvm-project/build/projects/libc/lib -lllvmlibc
/usr/bin/ld: /tmp/hello-948583.o: in function `main':
hello.cpp:(.text+0x18): undefined reference to `std::allocator<char>::allocator()'
/usr/bin/ld: hello.cpp:(.text+0x3a): undefined reference to `std::allocator<char>::~allocator()'
/usr/bin/ld: hello.cpp:(.text+0x47): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::length() const'
/usr/bin/ld: hello.cpp:(.text+0x55): undefined reference to `std::cout'
/usr/bin/ld: hello.cpp:(.text+0x5a): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <char, std::char_traits<char>, std::allocator<char> >(std::basic_ostream<char, std::char_traits<char> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
/usr/bin/ld: hello.cpp:(.text+0x73): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)'
/usr/bin/ld: hello.cpp:(.text+0x88): undefined reference to `std::ostream::operator<<(int)'
/usr/bin/ld: hello.cpp:(.text+0x9c): undefined reference to `std::basic_ostream<char, std::char_traits<char> >& std::endl<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&)'
/usr/bin/ld: hello.cpp:(.text+0xa1): undefined reference to `std::ostream::operator<<(std::ostream& (*)(std::ostream&))'
/usr/bin/ld: hello.cpp:(.text+0xb6): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()'
/usr/bin/ld: hello.cpp:(.text+0xd4): undefined reference to `std::allocator<char>::~allocator()'
/usr/bin/ld: hello.cpp:(.text+0xee): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()'
/usr/bin/ld: /tmp/hello-948583.o: in function `__cxx_global_var_init':
hello.cpp:(.text.startup+0xc): undefined reference to `std::ios_base::Init::Init()'
/usr/bin/ld: hello.cpp:(.text.startup+0x13): undefined reference to `std::ios_base::Init::~Init()'
/usr/bin/ld: /tmp/hello-948583.o: in function `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<std::allocator<char> >(char const*, std::allocator<char> const&)':
hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEC2IS3_EEPKcRKS3_[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEC2IS3_EEPKcRKS3_]+0x1d): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_local_data()'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEC2IS3_EEPKcRKS3_[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEC2IS3_EEPKcRKS3_]+0x2d): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider::_Alloc_hider(char*, std::allocator<char> const&)'
/usr/bin/ld: /tmp/hello-948583.o: in function `void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*>(char const*, char const*, std::forward_iterator_tag)':
hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0x48): undefined reference to `std::__throw_logic_error(char const*)'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0x75): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long)'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0x81): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_data(char*)'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0x8e): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_capacity(unsigned long)'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0x97): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_data() const'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0xb1): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy_chars(char*, char const*, char const*)'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0xcb): undefined reference to `__cxa_begin_catch'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0xd4): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose()'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0xde): undefined reference to `__cxa_rethrow'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0xf4): undefined reference to `__cxa_end_catch'
/usr/bin/ld: hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_M_constructIPKcEEvT_S8_St20forward_iterator_tag]+0x10b): undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_set_length(unsigned long)'
/usr/bin/ld: /tmp/hello-948583.o: in function `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider::~_Alloc_hider()':
hello.cpp:(.text._ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_Alloc_hiderD2Ev[_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE12_Alloc_hiderD2Ev]+0x11): undefined reference to `std::allocator<char>::~allocator()'
/usr/bin/ld: /tmp/hello-948583.o: in function `__clang_call_terminate':
hello.cpp:(.text.__clang_call_terminate[__clang_call_terminate]+0x5): undefined reference to `__cxa_begin_catch'
/usr/bin/ld: hello.cpp:(.text.__clang_call_terminate[__clang_call_terminate]+0xa): undefined reference to `std::terminate()'
/usr/bin/ld: /tmp/hello-948583.o:(.data.DW.ref.__gxx_personality_v0[DW.ref.__gxx_personality_v0]+0x0): undefined reference to `__gxx_personality_v0'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Best regards,
Harrison Hao

shiltian · February 28, 2024, 4:55pm

That is one of the limitation we have.We don’t have a good support for C++ yet. You might want to try a simple C program.

YuxinHao · February 28, 2024, 4:56pm

Thank you,Shilei,I will try it!

shiltian · February 28, 2024, 4:57pm

I guess you were trying to build OpenMP using another LLVM? That might be the cause of the issue. To work around it, you can build LLVM w/o OpenMP first, and then use that LLVM to build OpenMP standalone, aka something like cmake ... -S llvm-project/openmp -DCMAKE_C_COMPILER=llvm-build/bin/clang -DCMAKE_CXX_COMPILER=llvm-build/bin/clang++.

jdoerfert · February 28, 2024, 5:35pm

This is not gpu-first. Your build doesn’t even seem to have offloading enabled.
The issue you see here is that you compile a C++ program as if it was C, try clang++ instead of clang.

Polygonalr · February 29, 2024, 1:38pm

I ran into a few issues after debugging for quite a while.

Firstly, I was unable to build clang on the llvm-test-suite-gpu branch that was the initial link provided to me.

FAILED: lib/Transforms/IPO/CMakeFiles/LLVMipo.dir/OpenMPOpt.cpp.o 
/usr/bin/c++ -DGTEST_HAS_RTTI=0 -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/passiflora/llvm-project/build/lib/Transforms/IPO -I/home/passiflora/llvm-project/llvm/lib/Transforms/IPO -I/home/passiflora/llvm-project/build/include -I/home/passiflora/llvm-project/llvm/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -fno-lifetime-dse -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wno-misleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG  -fno-exceptions -funwind-tables -fno-rtti -std=c++17 -MD -MT lib/Transforms/IPO/CMakeFiles/LLVMipo.dir/OpenMPOpt.cpp.o -MF lib/Transforms/IPO/CMakeFiles/LLVMipo.dir/OpenMPOpt.cpp.o.d -o lib/Transforms/IPO/CMakeFiles/LLVMipo.dir/OpenMPOpt.cpp.o -c /home/passiflora/llvm-project/llvm/lib/Transforms/IPO/OpenMPOpt.cpp
/home/passiflora/llvm-project/llvm/lib/Transforms/IPO/OpenMPOpt.cpp: In member function ‘bool {anonymous}::OpenMPOpt::splitKernels()’:
/home/passiflora/llvm-project/llvm/lib/Transforms/IPO/OpenMPOpt.cpp:1135:20: error: ‘struct {anonymous}::OMPInformationCache’ has no member named ‘Kernels’
 1135 |       OMPInfoCache.Kernels.insert(K);
      |                    ^~~~~~~
/home/passiflora/llvm-project/llvm/lib/Transforms/IPO/OpenMPOpt.cpp:1180:34: error: ‘struct {anonymous}::OMPInformationCache’ has no member named ‘Kernels’
 1180 |       for (auto K : OMPInfoCache.Kernels) {
      |                                  ^~~~~~~
/home/passiflora/llvm-project/llvm/lib/Transforms/IPO/OpenMPOpt.cpp: In lambda function:
/home/passiflora/llvm-project/llvm/lib/Transforms/IPO/OpenMPOpt.cpp:1239:43: error: ‘struct {anonymous}::OMPInformationCache’ has no member named ‘Kernels’
 1239 |       collectReachingKernels(OMPInfoCache.Kernels, CI->getFunction(),
      |                                           ^~~~~~~
[1055/4767] Building CXX object lib/Tr...LLVMipo.dir/AttributorAttributes.cpp.o

I then went to try the direct_gpu_compilation branch that was linked in the original post here instead, and managed to build clang without OpenMP.

~/llvm-project/build
❯ cmake -DLLVM_ENABLE_PROJECTS="llvm" -DLLVM_ENABLE_RUNTIMES="" -DLLVM_TARGETS_TO_BUILD="host" ../llvm

...      
-- mlir project is disabled
-- openmp project is disabled
-- polly project is disabled
...

~/llvm-project/build
❯ make -j$(nproc)

However, I ran into the another issue when trying to compile OpenMP.

~/llvm-project/openmp/build
❯ CC=../../build/bin/clang CXX=../../build/bin/clang++ cmake -S .. \
	-DCMAKE_C_COMPILER=/home/passiflora/llvm-project/build/bin/clang \
	-DCMAKE_CXX_COMPILER=/home/passiflora/llvm-project/build/bin/clang++ \
	-DLLVM_DIR=/home/passiflora/llvm-project/build/lib/cmake/llvm/ \
	-DCMAKE_BUILD_TYPE=Release \
	-DLIBOMPTARGET_NVPTX_ENABLE_BCLIB=ON \
	-DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=89

...
CMake Warning:
  Manually-specified variables were not used by the project:

    LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES
    LIBOMPTARGET_NVPTX_ENABLE_BCLIB
...

~/llvm-project/openmp/build
❯ make -j$(nproc)

...
/home/passiflora/llvm-project/openmp/libomptarget/DeviceRTL/src/Synchronization.cpp:310:10: error: '__nvvm_atom_sys_add_gen_i' needs target feature sm_60|sm_61|sm_62|sm_70|sm_72|sm_75|sm_80|sm_86|sm_87|sm_89|sm_90
  return __nvvm_atom_sys_add_gen_i(Address, Val);
         ^
1 error generated
...

Ran into another road-block, and I’m not too sure what is causing the NVPTX options to be ignored. I’ve also tried other guides such as this one, but I ran into the same problem of the NVPTX options being ignored by cmake.

jhuber6 · February 29, 2024, 1:57pm

...
/home/passiflora/llvm-project/openmp/libomptarget/DeviceRTL/src/Synchronization.cpp:310:10: error: '__nvvm_atom_sys_add_gen_i' needs target feature sm_60|sm_61|sm_62|sm_70|sm_72|sm_75|sm_80|sm_86|sm_87|sm_89|sm_90
  return __nvvm_atom_sys_add_gen_i(Address, Val);
         ^
1 error generated
...

This is interesting, I don’t know where it’s coming from personally since I can’t seem to find that line in the source off of main.

These are totally deprecated, use -DLIBOMPTARGET_DEVICE_ARCHITECTURES=sm_89 instead. I’ll just copy-paste a build script I used recently, maybe that will help.

mkdir build && cd build
cmake ../llvm -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=${PREFIX} -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DLLVM_CCACHE_BUILD=ON -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;compiler-rt;lld" -DLLVM_ENABLE_RUNTIMES="openmp;libunwind" -DLIBOMPTARGET_DEVICE_ARCHITECTURES=sm_89 -DLLVM_TARGETS_TO_BUILD="X86;AMDGPU;NVPTX" -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_OPTIMIZED_TABLEGEN=ON -G Ninja
ninja install

Sh0g0-1758 · February 29, 2024, 2:49pm

I am a little confused. I have been facing this error since morning. In the README, it is written to build openmp separately as pointed out by @shiltian but @jhuber6 is suggesting to build llvm along with openmp by specifying the corresponding runtime. So in the latter, would I need to build openmp separately or not? And If so I would be grateful I you could copy-paste the build script for that also.

jhuber6 · February 29, 2024, 2:52pm

Where is that in the README? The problem is that the GPU runtime can only be compiled with an up-to-date clang. Hence, if you do a two-step build you would first build clang and then compile OpenMP with the newly built clang using -DCMAKE_C_COMPILER=clang. However, when you do -DLLVM_ENABLE_RUNTIMES it does that automatically for you. The runtimes builid will fire after the compiler is finished building and create a separate CMake invocation with the newly built compiler. My suggested CMake remains unchanged.

shiltian · February 29, 2024, 5:30pm

The branch is fairly out of date so you might need to look into those CMake files to determine which CMake variables need to be used.

That is from the system fence I added for host RPC to deal with some sort of inconsistency back in the day. The first task of this project is to rebase the patch and get rid of that RPC and use the Libc one.

llvm-test-suite-gpu branch is newer than direct-gpu-compilation and it has more code (such as the compiler driver) than the original one. The OMPInfoCache issue might be caused when I rebased it because we did go through some changes in OpenMPOpt at that time. I might have fixed the issue locally but didn’t push it into the branch. The fix is pretty straightforward though.

Polygonalr · March 1, 2024, 2:17pm

Yeap, turns out I just have to add one more line within OpenMPOpt.cpp to add the missing attribute within the llvm-test-suite-gpu branch

I’ve adapted Joseph’s build script and it worked out nicely now, thank you! I have managed to write a simple matrix multiplication program and compiled it with -fopenmp-targets=nvptx64.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

#define N 1024

int main() {
  unsigned short A[N*N], B[N*N], C[N*N];
  for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        A[N*i+j] = (5*i + 3*j) % 17;
        B[N*i+j] = (11*i + 2*j) % 13;
        C[N*i+j] = 0;
    }
  }
  #pragma omp target teams distribute parallel for collapse(2) map(to: A, B) map(from: C)
  for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        C[N*i+j] = 0;
        for (int k = 0; k < N; k++) {
            C[N*i+j] += A[N*i+k] * B[N*k+j];
        }
    }
  }

  for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        printf("%d ", C[N*i+j]);
    }
    printf("\n");
  }
  return 0;
}

I’ll be taking a look at the clang-gpu compiler wrapper next!

Polygonalr · March 18, 2024, 1:50pm

So far, I have taken a look at the GPU libc (and watched the talk by Joseph from last year’s LLVM Developers’ Meeting to understand it better, it’s an awesome talk!). I’ve also finally got around to reading the paper on GPU First and taken a look at the clang-gpu wrapper, which from my understanding is just a compilation wrapper to execute a program on a GPU with the help of libomptarget, while leveraging on DeviceRTL modifications made within the branch to support RPC between the host and device. Took me quite a few hours of looking at the codebase on and off to get this far, though the whole LLVM codebase is still kinda overwhelming and there are still many things I don’t understand :S (I’ll be putting in more effort to look at the codebase from now!)

So what I’m getting is that aside from upstreaming the framework, the ultimate goal of of this project is to reduce or get rid of reliance on DeviceRTL’s RPC and use the RPC code introduced in libc, while still ensuring that the OpenMP constructs still compile and work as intended?

If it is alright, I would like to start working on a proposal draft sometime this week.

shiltian · March 18, 2024, 2:58pm

Yes, that sounds correct. The current RPC used in that branch is just a PoC. The one in libc project is fine implemented. We’d like to implement the same functionality available in current rtl.cpp by using the new RPC. Since we already abandon the old plugins, the new implementation will be in plugin-nextgen.

shiltian · March 23, 2024, 1:33am

(Not sure why the original post can’t be modified now)

Due to last-minute change on our side, the goal of this project has to be updated:

Explore support of native I/O from the device instead of relying on host RPC.
Explore automatic reverse offloading, that is to automatically push certain regions back to host, such as those initialization part of a program that has heavy I/O.

The expected outcomes don’t change. We will still want a more efficient GPU First framework and upstream what we have.

Sorry for the inconvenience this might cause. Feel free to reach out if you need more information.

Polygonalr · March 23, 2024, 3:49pm

For native I/O from device, as far as I know there seems to be no way of reading and writing data from/to a GPU without the need of passing the data through the host (i.e. CPU/system memory). This should be mostly true for typical computers, with the exception of Magnum IO that seems to require specific hardware features to work. Feel free to enlighten me if I’m wrong!

Automatic reverse offloading seemed to be more plausible as a goal, though from what I know I believe OpenMP 5.0 reverse offloading has yet to be implemented in LLVM. If so, should we be writing our own reverse offloading mechanism (i.e. based off @jhuber6 's implementation here) as part of the project?

(Also it seems like this Discourse instance prevents edits of posts after a few hours of being published.)

jhuber6 · March 23, 2024, 3:51pm

Likely the goal here is to use support like io_uring - Wikipedia combined with unified memory and virtual addressing to provide things like fwrite natively without explicit RPC calls.

That implementation was basically just a proof-of-concept for a paper I wrote. That’s an earlier version of what’s upstream currently in the libc project.

Polygonalr · March 23, 2024, 3:56pm

Ah gotcha, so I’m assuming that the original goal of converting RPC calls to libc’s is still present, to not only support the upstreaming of the framework, but also to possibly enable us to work on automatic reverse offloading. Is that right?

Edit: since we’re on this topic, I would also like to know roughly what are the tasks required to be completed before upstreaming the framework, if possible

shiltian · March 23, 2024, 4:41pm

In automatic reverse offloading, the emphasis is on the “automatic” aspect. The compiler must analyze the code, identifying sections suitable for host execution, and generate the necessary code to coordinate execution via host RPC.

NVIDIA GPUs offer support for GPUDirect Storage, with the underlying implementation potentially leveraging io_uring.

This project is full of opportunities for exploration. We remain open to discovering and investigating additional topics as they arise.

Topic		Replies	Views
[libc][GSoC 2024] Performance and testing in the GPU libc GSoC gpu , libc , gsoc2024	55	1119	March 31, 2024
[GSoC 2024] Offloading libcxx GSoC gpu , libcxx , gsoc2024	10	707	April 6, 2025
[libc][GSoC 2025] Direct I/O from the GPU with io_uring GSoC gpu , libc , gsoc2025	13	753	May 11, 2025
[libc][GSoC 2025] Profiling and testing of the LLVM libc GPU math GSoC gpu , libc , offload , gsoc2025	17	712	April 6, 2025
Request for feedback for gsoc proposal Beginners	2	106	March 19, 2024