#pragma omp target teams architecture

Dear all,

I have a question about the architecture with which #pragma omp target teams is implemented in LLVM. On the one hand, __tgt_rtl_run_target_team_region() in omptargetplugin.h has a NumTeams parameter. When I write

#pragma omp target
#pragma omp teams num_teams(2)

to create two teams, the value of 2 is passed to __tgt_rtl_run_target_team_region() at runtime, so at first I thought it is the task of the plugin or device RTL to create the given number of teams on the device.

However, in the generated IR code, I see that in the OpenMP offloading function, the desired number of teams is created via __kmpc_fork_teams().

Now, my question is what the idea behind the NumTeams parameter in __tgt_rtl_run_target_team_region() is. Is this simply some kind of “informative” value, so that something on the device can be prepared for the following fork done explicitly in code? Or is the parameter also intended, in certain situations, to create the given number of teams without later explicit forking in the started code? As far as I see at the moment, the NumTeams parameter would not be strictly necessary.

Any hints on these questions would be greatly appreciated.

__kmpc_fork_teams is used for the host fallback.
__tgt_rtl_run_target_team_region is used for the device execution.
The 2 is used independently by both.

Does that clear things up?

Thanks for your answer, however I think I don’t completely understand what’s happening here. I compiled the following file test.cpp

int main (void)
{
  #pragma omp target teams num_teams(2)
  {
    float x = 1.0f;
  }
  
  return 0;
}

and had a look at the generated IR for the host and the device, i.e., test-host-<triple>.bc and test-openmp-<triple>.bc. In both files, not only for the host, the outlined target region function __omp_offloading_10306_969079_main_l3() contains a call to __kmpc_fork_teams() which starts the actual code.

Hence, if the plugin starts two teams of threads on the device, executing the target region function, each of the teams would reach __kmpc_fork_teams() so that more teams would be created, which would be inconsistent with the desired behavior as far as I see. Or maybe I misunderstand something here.

My observation holds for our own accelerator as offload target, but also for x86_64 generic elf offloading as target. The latter’s plugin ignores the NumTeams, starting the target region function single-threaded, and the fork happens via the generated IR code. So my question is, whether this is because of some special handling of that particular offload target, i.e., is IR generated differently than normally intended, since it is known that the plugin will ignore the NumTeams parameter?

On the other hand, did we do something wrong when we integrated our hardware accelerator as offload target, since for that target, the __kmpc_fork_teams() call is also generated in the offload function for the device? Or is it possible that the __kmpc_fork_teams() call is supposed to be catched by some kind of stubs library to be ineffective?

I don’t know what your triple is but what happens is that you are looking at two files generated for “host targets”. There is no __kmpc_fork_teams on anything that is not “the host”; at least not in the sense that it’s a CPU using libomp.so. Our actual device runtime (openmp/libomptarget/DeviceRTL) does not provide this entry point. The device runtime is only used if the frontend determines the target is “a (virtual) gpu” though. However, there is no need for your new target to not use the “gpu” code path.

You are reusing the host offload code path for your accelerator and that one is “different” than the device offload code path. It does, as you noted, ignore NumTeams in the __tgt_.. call and instead handles it later. This is not necessary and not well designed; it is grown as we didn’t have “teams” on the host in the beginning and things were simply added as they became available. The right solution is to move __kmpc_fork_teams into the host offloading parts and use the NumTeams passed to __tgt_..., assuming you want to stick with the host offloading parts. We can/should do that for upstream host offloading as well.

Thanks a lot for the explanations, I was not aware of these details, and the explanations help me a lot! I found the corresponding code which switches between host and device OpenMP code generation and will investigate this.