Is it possible to run part of code to NVIDIA gpu and part to AMD gpu?

typsmit · July 6, 2023, 7:56am

I’m very new to the compiler world. The problem I was trying to solve (at least theoretically) is I write a codebase, 1) Can I run a part of the code in NVIDIA GPU and some part AMD GPU? 2) Can I run the “same” code (high level, say C++/Python) in NVIDIA GPU as well as in AMD GPU?
While searching for answers to these, I get to know about LLVM and MLIR. I see MLIR has the capability to support heterogeneous hardware. Can someone enlighten me more on this?

tschuett · July 6, 2023, 9:43am

To answer your second question: Clang/LLVM supports OpenMP target. It is a widely used parallel programming model. With unchanged source, you can run on any GPU.

To answer your first question, you will seldomly find servers having NVIDIA and AMD GPUs.

jdoerfert · July 6, 2023, 4:36pm

But if you do, you can use Clang + OpenMP offload to target both at the same time, we tried

tschuett · July 6, 2023, 4:55pm

Official multi-target support comes with OpenMP 6.0.

"and by adopting true support for using multiple devices in the same program.”.

jdoerfert · July 6, 2023, 5:18pm

No. It has been there since 4.5 at least (via the device clause).

What they mean by true is effectively “better” and less cumbersome. E.g. offload to multiple devices all at once rather than “looping over them”, stuff like that.

tschuett · July 6, 2023, 5:20pm

That should be legal but cumbersome since 4.5, when they started supporting teams on the host.

#pragma omp parallel
#pragma omp teams
#pragma omp target device(teamNumber)
for(...) {
}

chenghuaWang · July 11, 2023, 3:15am

@typsmit For the first question, I suppose you have two types of machines(such like a distributed system which have multiple nodes) rather than two types of accelerating cards in one machine. I think you need to build a parallel model now (ring-all-reduce). If you also want to use code written in the same high-level language (such as nested for loops in cpp). You can design a Dialect for this high-level language and lowering it to MLIR’s Dialects. MLIR’s Dialect currently supports NVIDIA GPUs and AMD GPUs, and you can do device-related optimizations with some Dialects and Passes.

But this still requires a lot of changes to the existing code.

You can compile the code that needs to be parallelized into libs(cuda-libs, amd-libs) via MLIR and then call them. It’s up to you as to how to schedule them.

For the second question. As @jdoerfert @tschuett said, OpenMP is a choise. I think using OpenMP is the most convenient way. Using MLIR will bring a lot of work (depending on the complexity of your current high-level language). But more optimizations can be done on MLIR (e.g. polyhedral). After all, you are writing the code in a high-level language, so I believe you doesn’t do tiling or stream overlapping for different GPUs arch. it’s not clear to me if OpenMP has these optimizations. But it might be possible to do it using MLIR, though, with a fair amount of work.

jdoerfert · July 11, 2023, 3:41am

As I said before, that is working just fine with Clang + OpenMP offload:
clang -fopenmp --offload-arch=sm_80,gfx90a,gfx906 in.c -o a.out
Every GPU that is compatible with sm_80, gfx90a, or gfx906 will be available to you at runtime, e.g.,
omp_get_num_devices() might return 3 and you can use device(0) to offload to the NVIDIA GPU and device(1) to offload to the MI250x from AMD, at the same time.

IIRC, Clang understands OpenMP tile, in case you want to apply tiling. (@Meinersbur is the one who would know.) You get async computation that is overlapping if you use nowait on the target directives, and you can use depend to provide dependences.

FWIW, teams on the host are barely a thing and do not provide anything conceptually new.

tschuett · July 11, 2023, 5:16am

It does at 196 cores!

#pragma omp parallel
#pragma omp teams
#pragma omp parallel for
for(..)
{
}

It nicely partitions the parallel construct. Without teams it is much harder to run k parallel for loops in parallel.

jdoerfert · July 11, 2023, 8:11am

Teams does not partition, for my definition of partition.
Anyway, clang doesn’t seem to like it (Compiler Explorer).

Not all runtimes support this (=just nest parallel) to begin with. Ours does, if you adjust the max active levels ICV, IIRC. Task loop is probably the right alternative though.

tschuett · July 11, 2023, 8:14am

For OpenMP target teams model SMs. Clang complains about a missing target directive, but OpenMP supports teams on the host device?!?

jdoerfert · July 11, 2023, 8:22am

The standard doesn’t say. Implementations choose what makes sense wrt. result and complexity. For GPUs you get what you expect, namely thread blocks / work groups.

Yes. Teams need to be nested on the top level or directly inside a target. There was a plan to weaken that, I didn’t remember if it passed. I do not believe there is any plan to nest teams inside of a parallel w/o a target in-between. There is little point and lots of complexities.

Yes. Though, they are not really that interesting (on the host). The benefits of a “different” level of parallelism is not the same, and we, for now, do not make use of the minimal differences there might be. We support 5.X teams on the host, so on the outermost level, or in a target region offloaded to the host.

tschuett · July 11, 2023, 8:25am

https://reviews.llvm.org/D154441

“A teams region can only be strictly nested within the implicit parallel region or a target region. If a teams construct is nested within a target construct, that target construct must contain no statements, declarations or directives outside of the teams construct.”

jdoerfert · July 11, 2023, 8:25am

I do not follow. What are you trying to say?

tschuett · July 11, 2023, 8:35am

#pragma omp parallel // 192 cores
#pragma omp teams // nr_of_teams(192/32)
/*
#pragma omp parallel for // 6 threads per **each** for
for (...) {
}
*/

I claim that the code is legal according to the standard. Furthermore, there is no easy mechanism to run 32 for loops with 6 threads each.

jdoerfert · July 11, 2023, 3:52pm

No it’s not. You quoted the reason yourself in the last post before this one. The teams is not nested in the implicit parallel region (which means it is not nested inside “no” parallel region), and it is not nested in a target region. Thus, it is not legal. OpenMP 5.2, page 339, 4-7:

A teams region must be strictly nested either within the implicit parallel region
that surrounds the wholeOpenMP program or within a target region. If a teams
construct is nested within a target construct, that target construct must contain
no statements, declarations or directives outside of the teams construct.

If you do not believe me, ask the OpenMP standard lawyer of your confidence.

// test.c
#include <stdio.h>
#include <omp.h>

int main() {
  int outer = 0, inner = 0;
  #pragma omp parallel
  {
     #pragma omp atomic
     ++outer;
     #pragma omp parallel
     {
       #pragma omp atomic
       ++inner;
     }
  }
  printf("Outer %i,inner: %i, max active levels: %i\n", outer, inner, omp_get_max_active_levels());
}

> clang -fopenmp -O3 test.c
> ./a.out
Outer 96,inner: 96, max active levels: 1
> OMP_NUM_THREADS=32,6 ./a.out
Outer 32,inner: 192, max active levels: 2147483647

You can also set the max active level ICV via the setter.

tschuett · July 12, 2023, 5:03am

I take that option

typsmit · July 13, 2023, 7:46am

This is really a good conversation. Thank you all for your replies.
Most of the technical parts, right now I didn’t get.
In summary, 1) Yes, it is possible to run one part of the code on one accelerator card, and another part on another type of card using clang+OpenMP.
2) Yes, it is also possible to run the same code on different targets.

Will it be possible to show a toy example to achieve these? And a place from where I can start learning?

Thanks again.

jdoerfert · July 13, 2023, 4:40pm

Correct.

Toy example to show same code on different/all devices:

#include <omp.h>
#include <stdio.h>
struct Info {
  int NumTeams;
  int NumThreads;
  int Device;
};

int main() {
  int ND = omp_get_num_devices();
  struct Info *Infos = (struct Info *) calloc(sizeof(Infos[0]), ND);
  for (int d = 0; d < ND; ++d) {
    #pragma omp target teams device(d) map(from:Infos[d:1])
    {
       #pragma omp parallel
       {
         if (omp_get_thread_num() == 0)
           if (omp_get_team_num() == 0)
             Infos[d] = {omp_get_num_teams(), omp_get_num_threads(), d};
      } 
    }
  }
  for (int d = 0; d < ND; ++d)
    printf("Device %d: %d teams of %d threads\n", Infos[d].Device, Infos[d].NumTeams, Infos[d].NumThreads);
}

The machine:

llvm-omp-device-info | grep "Product Name"
    Product Name                        AMD Instinct MI250X
    Product Name                        AMD Instinct MI250X
    Product Name                        AMD Instinct MI250X
    Product Name                        AMD Instinct MI250X
    Product Name                        AMD Instinct MI250X
    Product Name                        AMD Instinct MI250X
    Product Name                        AMD Instinct MI250X
    Product Name                        AMD Instinct MI250X

llvm-omp-device-info | grep "Device Name"
    Device Name                         gfx90a
    Device Name                         gfx90a
    Device Name                         gfx90a
    Device Name                         gfx90a
    Device Name                         gfx90a
    Device Name                         gfx90a
    Device Name                         gfx90a
    Device Name                         gfx90a

Compile and run:

clang -O3 -fopenmp --offload-arch=gfx90a test.c

./a.out
Device 0: 440 teams of 256 threads
Device 1: 440 teams of 256 threads
Device 2: 440 teams of 256 threads
Device 3: 440 teams of 256 threads
Device 4: 440 teams of 256 threads
Device 5: 440 teams of 256 threads
Device 6: 440 teams of 256 threads
Device 7: 440 teams of 256 threads

If you have different device types in your system, make sure llvm-omp-device-info recognizes them. See also our FAQ (openmp.llvm.org).
Then pass all subarchitectures to --offload-arch, e.g., sm_80,gfx906.

typsmit · July 14, 2023, 4:25am

Can’t say enough thanks for the head start. Let me run and understand this.

Topic		Replies	Views
How can I use multiple GPUs in OpenMP? Clang Frontend gpu	21	2224	October 25, 2022
[RFC] Adding support for OpenMP GPU target offload MLIR gpu , openmp	9	802	February 15, 2024
offloading to Nvidia GPUs OpenMP	0	308	December 13, 2017
Automatic GPU Code Generation LLVM Dev List Archives	0	130	December 11, 2018
when is offloading to NVIDIA targets available? OpenMP	1	87	November 1, 2017

Is it possible to run part of code to NVIDIA gpu and part to AMD gpu?

Related topics