Runtime error when executing multiple target regions within a target data region

Hi all,

we experience strange errors when we try to launch multiple target regions within a data region, see attached code. The result when using unstructured data mapping is similar. We are using clang built from trunk this week.

When we map the data for each iteration (as in line 23), the whole code runs through. When we use a larger value for TEAMS, the execution falls back to the host in an earlier iteration (for 1024 in the second iteration instead of 7th as shown below).

So, there seems to be an issue with the allocation of teams, when the data region stays open. Any ideas, how this can be fixed?

Best,
Joachim

Output when running the attached code (num_teams, thread_limit, is_initial_device):

256, 992, 0
0
256, 992, 0
1
256, 992, 0
2
256, 992, 0
3
256, 992, 0
4
256, 992, 0
5
256, 992, 0
OMP: Warning #96: Cannot form a team with 256 threads, using 48 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).
48, 2147483647, 1
6
48, 2147483647, 1
7
48, 2147483647, 1
8
48, 2147483647, 1
9

target-test.c (1.01 KB)

Hi Joachim,

from internal discussion I know that you are targeting NVPTX and not x86_64 (aka "host offloading") - which is an important information in that case.

The arrays a and b are not mapped in the target region, so according to OpenMP 4.5, section 2.15.5 on page 215:

A variable that is of type pointer is treated as if it had appeared in a map clause as a zero-length array section.

However, Clang currently doesn't seem to do that; manually adding map(a[0:0], b[0:0]) leads to the expected output:
256, 992, 0
0
256, 992, 0
1
256, 992, 0
2
256, 992, 0
3
256, 992, 0
4
256, 992, 0
5
256, 992, 0
6
256, 992, 0
7
256, 992, 0
8
256, 992, 0
9
So the application has previously really been executing the fallback code on the host.
(Note that your commented combined directive works fine with the standalone data directives, because it actually has the map-clause.)

Follow-up questions:
1) Why doesn't the fallback printf as well?
2) Why does libomp have difficulties spawning teams after a while?

Cheers,
Jonas

The arrays a and b are not mapped in the target region

I am not an expert in offloading constructs, but what I see from the spec is that the arrays
should be mapped in all target regions because of outer "target data" region.
Correct me if I'm wrong here.

Why does libomp have difficulties spawning teams after a while?

I doubt we can create 256K threads, so the warning about this looks acceptable.
We also have internal limit on the number of teams created - available number of procs.
This can be overridden by KMP_TEAMS_THREAD_LIMIT environment variable.
Though it is unclear to me why those zillions of threads are needed? Even if we would be able
to create 256K threads on 48 procs it is more than 5400 threads per proc -
so huge oversubscription should cause awful performance of the test.

And the syntax of the test looks a bit broken to me, because the distribute or "parallel for"
or "distribute parallel for" should be followed by a loop, while here it is followed by
the compound statement.

Regards,
Andrey

The arrays a and b are not mapped in the target region

I am not an expert in offloading constructs, but what I see from the spec is that the arrays
should be mapped in all target regions because of outer "target data" region.
Correct me if I'm wrong here.

Why does libomp have difficulties spawning teams after a while?

I doubt we can create 256K threads, so the warning about this looks acceptable.
We also have internal limit on the number of teams created - available number of procs.
This can be overridden by KMP_TEAMS_THREAD_LIMIT environment variable.
Though it is unclear to me why those zillions of threads are needed? Even if we would be able
to create 256K threads on 48 procs it is more than 5400 threads per proc -
so huge oversubscription should cause awful performance of the test.

As Jonas already added, I was trying to offload to a GPU. To saturate the pipeline, I was trying to split the work into small pieces.

I was not complaining about the message comming from the host runtime when falling back onto the host. The problem is that the execution falls back to the host.

And the syntax of the test looks a bit broken to me, because the distribute or "parallel for"
or "distribute parallel for" should be followed by a loop, while here it is followed by
the compound statement.

You are right, for correct OpenMP code, I should remove the extra curly brackets around the for-loop.

Hi Alexey,

I pulled and rebuilt everything. I still see the same issue.

I try to offload to Tesla P100 SMX2 16GB.
Also, we built everything with Cuda 9.1

I did some experiments on the num-teams. 57 teams and more run into the issue. With 56 teams the code runs trough. We checked the documentation and this specific card has 56 SM, so starting with 57 teams some SM need to be reused.

The execution with 57 teams shows a big variation in successful repetition of target regions. I have seen 66 successful iterations before falling back to the host, but also the 19th iteration falling back to the host.

-> Especially this random behavior suggests some kind of data race.

When removing the num_teams and thread_limit clause, the runtime chooses 128 teams and 96 threads. With that number of teams the execution falls back reliably during the 10th iteration.

This is the libomptarget debug output, you can see the print from the target region (I changed the condition, so that the last iteration prints) and then the error messages:

64, 992, 0
Target CUDA RTL --> Kernel execution error at 0x0000000000d93910!
Target CUDA RTL --> CUDA error(700) is: an illegal memory access was encountered
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007fda4a068010, Size=0)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007fda4a068010, TgtPtrBegin=0x00007fd988000000, Size=0, updated RefCount=1
Libomptarget --> There are 0 bytes allocated at target address 0x00007fd988000000 - is not last
Libomptarget --> Looking up mapping(HstPtrBegin=0x00007fda4541c010, Size=0)...
Libomptarget --> Mapping exists with HstPtrBegin=0x00007fda4541c010, TgtPtrBegin=0x00007fd982000000, Size=0, updated RefCount=1
Libomptarget --> There are 0 bytes allocated at target address 0x00007fd982000000 - is not last
OMP: Warning #96: Cannot form a team with 64 threads, using 48 instead.
OMP: Hint Consider unsetting KMP_DEVICE_THREAD_LIMIT (KMP_ALL_THREADS), KMP_TEAMS_THREAD_LIMIT, and OMP_THREAD_LIMIT (if any are set).
48, 2147483647, 1
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x00000000006020c0 is compatible with RTL 0x000000000063f8b0!
Libomptarget --> Unregistered image 0x00000000006020c0 from RTL 0x000000000063f8b0!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x000000000062ba70
Libomptarget --> Done unregistering library!
Target CUDA RTL --> Error when unloading CUDA module
Target CUDA RTL --> CUDA error(700) is: an illegal memory access was encountered

Best
Joachim

Hi Alexey,

I just realized that the pull from master failed for the openmp repository. I rebuilt everything again and it works now.

Thanks
Joachim