Clacc: how to compile basic example for AMD GPUs?

Hi!

I know that Clacc is not part of LLVM project, but I hope there will be someone to answer my question.
Unfortunately, there is not enough info on Clacc :frowning:

I’ve been trying to compile this program:

This is modified code from here.

When I compile with the following Clacc invocation I get different output with OpenACC enabled and disabled.

Command:

$ build/bin/clang -fopenacc -lm -ldl openacc-examples/1/laplace2d.c -o acc1

Output without OpenACC

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269

Output with OpenACC:

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.000000

Also, I tried to specify OpenMP target specifically (I didn’t find OpenACC options for targets):

$ build/bin/clang -fopenacc -lm -ldl openacc-examples/1/laplace2d.c -o acc1  -fopenmp-targets=amdgcn-amd-amdhsa
clang-15: error: no library 'libomptarget-amdgpu-gfx908.bc' found in the default clang lib directory or in LIBRARY_PATH; use '--libomptarget-amdgpu-bc-path' to specify amdgpu bitcode library

What am I doing wrong? What should I fix?

Thanks in advance for help!

CC @jdenny-ornl

Hi Daniil,

You said you didn’t find enough info on Clacc. Clacc is currently maintained as part of the LLVM DOE Fork. Its main branch is:

https://github.com/llvm-doe-org/llvm-project/tree/clacc/main

Is that where you found Clacc? If not, would you please tell me a bit about how you obtained and built it?

Clacc’s top-level README.md provides basic documentation on building and installing as well as pointers to status and design documents. Building and installing Clacc should be the same as for upstream LLVM’s OpenMP offload support. Please let me know if you find any discrepancies. Unless you’re planning to modify Clacc, I recommend that you work from an install directory instead of a build directory. There are fewer environment variables and Clang command-line options you have to set that way. All of this is discussed in the above documentation.

Let me know if you have any other questions. I’d also be interested in hearing more about how you plan to use Clacc.

Thanks.

Yes, I used clang from this fork of LLVM

Ah, README! I forgot to check it out… I was trying to Google the docs…
Thank you, it is really helpful

I built the project for AMD GPUs as described in the README but it still doesn’t work…

$ which clang
/home/d.dudkin/dev/llvm-project/install/bin/clang
#include <stdio.h>
int main() {
  #pragma acc parallel num_gangs(2)
  printf("Hello World\n");
  return 0;
}
$ clang -fopenacc test.c && ./a.out
Hello World
Hello World
$ clang -fopenacc -fopenmp-targets=amdgcn-amd-amdhsa test.c && ./a.out
$ 

The machine I’m using has AMD MI-100 GPUs
Output with -### flag:

What should I consider in order to run the program successfully?

I want to reuse the OpenACC runtime library to run on AMD GPUs.

FWIW: AMD GPU software don’t come with a builtin printf. HIP adds one, OpenMP/OpenACC does not right now. This might run fine except that nothing is ever printed.

Yep, I pushed some Clacc changes yesterday that included an example for AMD GPU that’s broken for this reason.

@unterumarmung Sorry about that. I’ll push a fix soon and reply here.

I took a closer look at the code from your original post here. On the acc parallel, you need to add copy(error). Some OpenACC compilers will perform a dataflow analysis and add that implicitly for you, but the standard says it’s implicitly firstprivate because it’s a scalar. Clacc follows the standard here. This issue is discussed in, for example, sec. A.3.2 of OpenACC 3.2.

Edit: I just noticed that, when offloading, you also need data clauses for A and Anew because their sizes are not known at compile time. So, in summary, when I use #pragma acc parallel copy(error, A[0:n*m], Anew[0:n*m]), your example works for me in all cases.

Clacc’s OpenACC runtime library is a thin layer on top of the OpenMP runtime library, so don’t expect significant differences in, for example, performance.

Do you have a specific set of OpenACC applications you’re investigating? I’m always eager to find more examples to drive Clacc development.

Pushed.

Having a (very!) quick look at https://gist.github.com/unterumarmung/9f7f2ff46785c9204e1a64149d65f86a:

Actually, reduction(max:error) is missing on the first #pragma acc loop independent – so the way it’s currently written isn’t actually independent? (Also, the reduction clause on loop would then imply copy(error) on the compute construct?)

Also, there is no OpenACC gang-level synchronization between the two subsequent #pragma acc loops, so that constitutes a race condition (unless the compiler does some magic – or you’re lucky)?

@tschwinge The issues you mentioned happen not to make much difference in the results when I try with this example, but I agree they are valid issues.

There are also other modifications to make that might improve performance, such as using an acc data around the outer while loop to avoid unnecessary data transfers.

The original example was from NVIDIA, so it was probably developed using NVHPC. Moreover, it used acc kernels instead of acc parallel, and it targeted unified memory. These differences probably explain
why these issues aren’t addressed in the original.

A loop reduction doesn’t imply a copy clause according to the spec, but some implementations add that anyway. Clacc produces a diagnostic to encourage portability.