I’m very new to the compiler world. The problem I was trying to solve (at least theoretically) is I write a codebase, 1) Can I run a part of the code in NVIDIA GPU and some part AMD GPU? 2) Can I run the “same” code (high level, say C++/Python) in NVIDIA GPU as well as in AMD GPU?
While searching for answers to these, I get to know about LLVM and MLIR. I see MLIR has the capability to support heterogeneous hardware. Can someone enlighten me more on this?
To answer your second question: Clang/LLVM supports OpenMP target. It is a widely used parallel programming model. With unchanged source, you can run on any GPU.
To answer your first question, you will seldomly find servers having NVIDIA and AMD GPUs.
But if you do, you can use Clang + OpenMP offload to target both at the same time, we tried
Official multi-target support comes with OpenMP 6.0.
"and by adopting true support for using multiple devices in the same program.”.
No. It has been there since 4.5 at least (via the device
clause).
What they mean by true is effectively “better” and less cumbersome. E.g. offload to multiple devices all at once rather than “looping over them”, stuff like that.
That should be legal but cumbersome since 4.5, when they started supporting teams on the host.
#pragma omp parallel
#pragma omp teams
#pragma omp target device(teamNumber)
for(...) {
}
@typsmit For the first question, I suppose you have two types of machines(such like a distributed system which have multiple nodes) rather than two types of accelerating cards in one machine. I think you need to build a parallel model now (ring-all-reduce). If you also want to use code written in the same high-level language (such as nested for loops in cpp). You can design a Dialect for this high-level language and lowering it to MLIR’s Dialects. MLIR’s Dialect currently supports NVIDIA GPUs and AMD GPUs, and you can do device-related optimizations with some Dialects and Passes.
But this still requires a lot of changes to the existing code.
You can compile the code that needs to be parallelized into libs(cuda-libs, amd-libs) via MLIR and then call them. It’s up to you as to how to schedule them.
For the second question. As @jdoerfert @tschuett said, OpenMP is a choise. I think using OpenMP is the most convenient way. Using MLIR will bring a lot of work (depending on the complexity of your current high-level language). But more optimizations can be done on MLIR (e.g. polyhedral). After all, you are writing the code in a high-level language, so I believe you doesn’t do tiling or stream overlapping for different GPUs arch. it’s not clear to me if OpenMP has these optimizations. But it might be possible to do it using MLIR, though, with a fair amount of work.
As I said before, that is working just fine with Clang + OpenMP offload:
clang -fopenmp --offload-arch=sm_80,gfx90a,gfx906 in.c -o a.out
Every GPU that is compatible with sm_80, gfx90a, or gfx906
will be available to you at runtime, e.g.,
omp_get_num_devices()
might return 3 and you can use device(0)
to offload to the NVIDIA GPU and device(1)
to offload to the MI250x from AMD, at the same time.
IIRC, Clang understands OpenMP tile
, in case you want to apply tiling. (@Meinersbur is the one who would know.) You get async computation that is overlapping if you use nowait
on the target directives, and you can use depend
to provide dependences.
FWIW, teams on the host are barely a thing and do not provide anything conceptually new.
It does at 196 cores!
#pragma omp parallel
#pragma omp teams
#pragma omp parallel for
for(..)
{
}
It nicely partitions the parallel construct. Without teams it is much harder to run k parallel for loops in parallel.
Teams does not partition, for my definition of partition.
Anyway, clang doesn’t seem to like it (Compiler Explorer).
Not all runtimes support this (=just nest parallel) to begin with. Ours does, if you adjust the max active levels ICV, IIRC. Task loop is probably the right alternative though.
For OpenMP target teams model SMs. Clang complains about a missing target directive, but OpenMP supports teams on the host device?!?
The standard doesn’t say. Implementations choose what makes sense wrt. result and complexity. For GPUs you get what you expect, namely thread blocks / work groups.
Yes. Teams need to be nested on the top level or directly inside a target. There was a plan to weaken that, I didn’t remember if it passed. I do not believe there is any plan to nest teams inside of a parallel w/o a target in-between. There is little point and lots of complexities.
Yes. Though, they are not really that interesting (on the host). The benefits of a “different” level of parallelism is not the same, and we, for now, do not make use of the minimal differences there might be. We support 5.X teams on the host, so on the outermost level, or in a target region offloaded to the host.
https://reviews.llvm.org/D154441
“A teams region can only be strictly nested within the implicit parallel region or a target region. If a teams construct is nested within a target construct, that target construct must contain no statements, declarations or directives outside of the teams construct.”
I do not follow. What are you trying to say?
#pragma omp parallel // 192 cores
#pragma omp teams // nr_of_teams(192/32)
/*
#pragma omp parallel for // 6 threads per **each** for
for (...) {
}
*/
I claim that the code is legal according to the standard. Furthermore, there is no easy mechanism to run 32 for loops with 6 threads each.
No it’s not. You quoted the reason yourself in the last post before this one. The teams is not nested in the implicit parallel region (which means it is not nested inside “no” parallel region), and it is not nested in a target region. Thus, it is not legal. OpenMP 5.2, page 339, 4-7:
A teams region must be strictly nested either within the implicit parallel region
that surrounds the wholeOpenMP program or within a target region. If a teams
construct is nested within a target construct, that target construct must contain
no statements, declarations or directives outside of the teams construct.
If you do not believe me, ask the OpenMP standard lawyer of your confidence.
// test.c
#include <stdio.h>
#include <omp.h>
int main() {
int outer = 0, inner = 0;
#pragma omp parallel
{
#pragma omp atomic
++outer;
#pragma omp parallel
{
#pragma omp atomic
++inner;
}
}
printf("Outer %i,inner: %i, max active levels: %i\n", outer, inner, omp_get_max_active_levels());
}
> clang -fopenmp -O3 test.c
> ./a.out
Outer 96,inner: 96, max active levels: 1
> OMP_NUM_THREADS=32,6 ./a.out
Outer 32,inner: 192, max active levels: 2147483647
You can also set the max active level ICV via the setter.
I take that option
This is really a good conversation. Thank you all for your replies.
Most of the technical parts, right now I didn’t get.
In summary, 1) Yes, it is possible to run one part of the code on one accelerator card, and another part on another type of card using clang+OpenMP.
2) Yes, it is also possible to run the same code on different targets.
Will it be possible to show a toy example to achieve these? And a place from where I can start learning?
Thanks again.
Correct.
Toy example to show same code on different/all devices:
#include <omp.h>
#include <stdio.h>
struct Info {
int NumTeams;
int NumThreads;
int Device;
};
int main() {
int ND = omp_get_num_devices();
struct Info *Infos = (struct Info *) calloc(sizeof(Infos[0]), ND);
for (int d = 0; d < ND; ++d) {
#pragma omp target teams device(d) map(from:Infos[d:1])
{
#pragma omp parallel
{
if (omp_get_thread_num() == 0)
if (omp_get_team_num() == 0)
Infos[d] = {omp_get_num_teams(), omp_get_num_threads(), d};
}
}
}
for (int d = 0; d < ND; ++d)
printf("Device %d: %d teams of %d threads\n", Infos[d].Device, Infos[d].NumTeams, Infos[d].NumThreads);
}
The machine:
llvm-omp-device-info | grep "Product Name"
Product Name AMD Instinct MI250X
Product Name AMD Instinct MI250X
Product Name AMD Instinct MI250X
Product Name AMD Instinct MI250X
Product Name AMD Instinct MI250X
Product Name AMD Instinct MI250X
Product Name AMD Instinct MI250X
Product Name AMD Instinct MI250X
llvm-omp-device-info | grep "Device Name"
Device Name gfx90a
Device Name gfx90a
Device Name gfx90a
Device Name gfx90a
Device Name gfx90a
Device Name gfx90a
Device Name gfx90a
Device Name gfx90a
Compile and run:
clang -O3 -fopenmp --offload-arch=gfx90a test.c
./a.out
Device 0: 440 teams of 256 threads
Device 1: 440 teams of 256 threads
Device 2: 440 teams of 256 threads
Device 3: 440 teams of 256 threads
Device 4: 440 teams of 256 threads
Device 5: 440 teams of 256 threads
Device 6: 440 teams of 256 threads
Device 7: 440 teams of 256 threads
If you have different device types in your system, make sure llvm-omp-device-info
recognizes them. See also our FAQ (openmp.llvm.org).
Then pass all subarchitectures to --offload-arch
, e.g., sm_80,gfx906
.
Can’t say enough thanks for the head start. Let me run and understand this.