nested parallelism in libomptarget-nvptx

Hi all,

I've started some cleanups in libomptarget-nvptx, the OpenMP runtime implementation on Nvidia GPUs. The ultimate motivation is reducing the memory overhead: At the moment the runtime statically allocates ~660MiB of global memory. This amount can't be used by applications. This might not sound much, but wasting precious memory doesn't sound wise.
I found that a portion of 448MiB come from buffers for data sharing. In particular they appear to be so large because the code is prepared to handle nested parallelism where every thread would be in the position to share data with its nested worker threads.
From what I've seen so far this doesn't seem to be necessary for Clang trunk: Nested parallel regions are serialized, so only the initial thread needs to share data with one set of worker threads. That's in line with comments saying that there is no support for nested parallelism.

However I found that my test applications compiled with clang-ykt support two levels of parallelism. My guess would be that this is related to "convergent parallelism": parallel.cu explains that this is meant for a "team of threads in a warp only". And indeed, each nested parallel region seems to be executed by 32 threads.
I'm not really sure how this works because I seem to get one OpenMP thread per CUDA thread in the outer parallel region. So where are the nested worker threads coming from?

In any case: If my analysis is correct, I'd like to propose adding a CMake flag which disables this (seemingly) legacy support [1]. That would avoid the memory overhead for users of Clang trunk and enable future optimizations (I think).
Thoughts, opinions?

Cheers,
Jonas

1: Provided that IBM still wants to keep the code and we can't just go ahead and drop it. I guess that this can happen at some point in time, but I'm not sure if we are in that position right now.

Hi Jonas,

The second level of parallelism in clang-ykt uses a scheme where all the threads in each warp cooperate to execute the workload of the 1st thread in the warp then the 2nd and so on until the workload of each of the 32 threads in the warp has been completed. The workload of each thread is always executed by the full warp.
You are correct in trunk the additional memory that this scheme uses is not required. For now we would like to keep this functionality in place so it would be good if you could hide it behind a flag. This will allow us to easily drop it in the future.

Thanks a lot,

–Doru

Hi, Doru,

What do you think we should do, upstream, for nested parallelism? Would it be desirable to have a clang-ykt-like scheme? Something else?

Thanks again,

Hal

Hi Hal,

At least as far as we are aware, the number of use cases where the nested parallel scheme would be used is quite small. Most of the use cases of OpenMP on GPUs have a single level of parallelism which is typically SPMD-like to achieve as much performance as possible. That said there is some merit to having a nested parallelism scheme because when it is helpful it typically is very helpful.

As a novelty point to ykt-clang I would suggest that whichever scheme (or schemes) we decide to use, they should be applied only at the request of the user. This is because we can do a better code gen job for more OpenMP patterns when using existing schemes (generic and SPMD) if we know at compile time if there will be no second level parallelism in use. This is due to some changes in implementation in trunk compared to ykt-clang.

Regarding which scheme to use there were two which were floated around based on discussions with users: (1) the current scheme in ykt-clang which enables the code in both inner and outer parallel loops to be executed in parallel and (2) a scheme where the outer loop code is executed by one thread and the innermost loop is executed by all threads (this was requested by users at one point, I assume this is still the case).

Since ykt-clang only supports the fist scheme when we ran performance tests comparing nested parallelism against no nested parallelism we got anywhere from 4x slowdown to 32x speedup depending on the: ratio of outer:inner iterations, the work size in the innermost loop, reductions, atomics and memory coalescing. About 80% of the number of cases we tried showed speed-ups with some showing significant speed-ups.
I would very much be in favour of having at least this scheme supported since it looks like it could be useful.

In terms of timing, we are still tied up with upstreaming at the moment so we won’t be attempting a new code generation scheme until we are feature complete on the current ones.

Thanks,

–Doru

Hi Hal,

At least as far as we are aware, the number of use cases where the
nested parallel scheme would be used is quite small. Most of the use
cases of OpenMP on GPUs have a single level of parallelism which is
typically SPMD-like to achieve as much performance as possible. That
said there is some merit to having a nested parallelism scheme because
when it is helpful it typically is very helpful.

As a novelty point to ykt-clang I would suggest that whichever scheme
(or schemes) we decide to use, they should be applied only at the
request of the user. This is because we can do a better code gen job
for more OpenMP patterns when using existing schemes (generic and
SPMD) if we know at compile time if there will be no second level
parallelism in use. This is due to some changes in implementation in
trunk compared to ykt-clang.

Regarding which scheme to use there were two which were floated around
based on discussions with users: (1) the current scheme in ykt-clang
which enables the code in both inner and outer parallel loops to be
executed in parallel and (2) a scheme where the outer loop code is
executed by one thread and the innermost loop is executed by all
threads (this was requested by users at one point, I assume this is
still the case).

Since ykt-clang only supports the fist scheme when we ran performance
tests comparing nested parallelism against no nested parallelism we
got anywhere from 4x slowdown to 32x speedup depending on the: ratio
of outer:inner iterations, the work size in the innermost loop,
reductions, atomics and memory coalescing. About 80% of the number of
cases we tried showed speed-ups with some showing significant speed-ups.
I would very much be in favour of having at least this scheme
supported since it looks like it could be useful.

In terms of timing, we are still tied up with upstreaming at the
moment so we won't be attempting a new code generation scheme until we
are feature complete on the current ones.

Hi, Doru,

Thanks for explaining. I think that your suggestion of putting this
behind a flag makes a lot of sense. It sounds as though, later, we might
want different user-selectable schemes (although we might want to use
pragmas instead of command-line flags at that point?).

-Hal

Hi Doru,

Hi Hal,

At least as far as we are aware, the number of use cases where the
nested parallel scheme would be used is quite small. Most of the use
cases of OpenMP on GPUs have a single level of parallelism which is
typically SPMD-like to achieve as much performance as possible. That
said there is some merit to having a nested parallelism scheme because
when it is helpful it typically is very helpful.

As a novelty point to ykt-clang I would suggest that whichever scheme
(or schemes) we decide to use, they should be applied only at the
request of the user. This is because we can do a better code gen job
for more OpenMP patterns when using existing schemes (generic and
SPMD) if we know at compile time if there will be no second level
parallelism in use. This is due to some changes in implementation in
trunk compared to ykt-clang.

I agree: Even then we may be able to restructure the application to be more performant and portable without nested parallelism.

Regarding which scheme to use there were two which were floated around
based on discussions with users: (1) the current scheme in ykt-clang
which enables the code in both inner and outer parallel loops to be
executed in parallel and (2) a scheme where the outer loop code is
executed by one thread and the innermost loop is executed by all
threads (this was requested by users at one point, I assume this is
still the case).

Since ykt-clang only supports the fist scheme when we ran performance
tests comparing nested parallelism against no nested parallelism we
got anywhere from 4x slowdown to 32x speedup depending on the: ratio
of outer:inner iterations, the work size in the innermost loop,
reductions, atomics and memory coalescing. About 80% of the number of
cases we tried showed speed-ups with some showing significant
speed-ups.
I would very much be in favour of having at least this scheme
supported since it looks like it could be useful.

Interesting. Are these experiments public? I'd be interested to see the codes that benefit from nested parallelism.
IIRC OpenACC doesn't have this feature, so I expect this to be corner cases.

Regards,
Jonas

Hi Hal,

at the extreme this might also mean having multiple runtime implementations. Without nested parallelism many per thread data structures can be removed, see my initial motivation: There will only be data sharing in the first level, no need to have buffers for worker threads; and many more things.
Maybe (not there yet) this will make per team data structures small enough to fit into shared memory instead of having queues in global memory that need atomics (see state-queue{,i}.h). For SimpleThreadPrivateContext this seems to reduce the kernel execution time of an empty SPMD construct with 8192 teams (read: its overhead) from ~20us to ~14.5us. This might become noticeable for very small kernels (example: one single axpy / xpay in a conjugate gradient solver with 1391349 elements takes around 60us with OpenACC if I interpret my old measurements correctly).

Regards,
Jonas

Hi Jonas,

The experiments are in this paper:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8287767

In case you can’t access the paper I’m attaching it here.

Thanks,

–Doru

ForkJoinPaper.pdf (824 KB)

Hi Doru,

thanks for the link. However I don't think these experiments looked at "nested parallelism", at least not how I understand it.
According to the OpenMP standard, the following snippet shows two parallel regions, the second one is "nested":
#pragma omp parallel // (1)
{
   #pragma omp parallel // (2)
   { }
}
Assuming that (1) is the outermost parallel region (ie it's not nested in yet another parallel) it will probably be "active", meaning that it's executed by more than one thread. For (2) the implementation can decide whether it supports "nested parallelism" or if it serializes the parallel region (making it "inactive" in OpenMP's terminology).

For comparison the paper you linked evaluates implementations for something like the following:
#pragma omp target teams // (1)
{
   #pragma omp parallel // (2)
   { }
}
This is different in that (2) is the first "parallel" region on the device (even though it may be the second level of parallelism when mapped onto a GPU). From my understanding Clang trunk already handles this and I agree that this use case is important (see performance comparison in the linked paper).

Back to my original question: Do we need to support "nested parallelism"?
#pragma omp target teams // (1)
#pragma omp parallel // (2)
#pragma omp parallel // (3)
{ }
This would be a third level of parallelism when executing on a GPU and would require data sharing from worker threads of (2) (master threads of (3)) to worker threads of (3).

Thanks,
Jonas

Hi Jonas,

Very legitimate observations but I think there may be a misunderstanding:

The patterns which is given as an example in the paper:

#teams
{

#parallel

}

This is actually just the first level of parallelism. The code in between the parallel and the team directives is considered “sequential” since only one thread per team executes that.

The kernels that have been evaluated actually contain either the parallel+parallel or the parallel+simd patterns.

The other example you give is:

#teams
{
#parallel
{
#parallel
{}
}
}

The innermost parallel would use the 2nd level of parallelism (or nested parallelism). You can also add this pattern to that list:

#teams
{
#parallel
{
#simd
{}
}
}

More level 2 patterns:

target teams parallel

parallel

or

target teams parallel

simd

I hope this clarifies the experiments.

Thanks,

–Doru

Hi Jonas,

Very legitimate observations but I think there may be a
misunderstanding:

The patterns which is given as an example in the paper:

#teams
{

  #parallel

}

This is actually just the first level of parallelism. The code in
between the parallel and the team directives is considered
"sequential" since only one thread per team executes that.

The kernels that have been evaluated actually contain either the
parallel+parallel or the parallel+simd patterns.

Hi Doru,

I've now carefully read trough the experiments and I can't find this. In particular, VI.d)
"Our second version uses the target teams distribute directive to exploit outer parallelism across teams and the parallel for directive on an inner loop to exploit nested parallelism within a team."

Jonas

P.S.: Maybe you can share the source code of a benchmark that uses nested parallel regions / 3 levels of parallelism?

Hi Jonas,

When safe to do so “target teams distribute” would use all threads in the team so when the “parallel for” is encountered, the second level of parallelism is activated. Any further nestings of “parallel for” or “simd” directives would be sequential.

Thanks,

–Doru

Hi Doru,

I'm getting more and more confused by your seemingly contradicting answers. In your first reply you wrote:

The second level of parallelism in clang-ykt uses a scheme where all
the threads in each warp cooperate to execute the workload of the 1st
thread in the warp then the 2nd and so on until the workload of each
of the 32 threads in the warp has been completed.

Accordingly, if you compile the following with clang-ykt:
#pragma omp teams
#pragma omp parallel
{
   /* ... */ = omp_get_num_threads();
}
the API call will return 32 because it's using "convergent parallelism" (see top of parallel.cu).

Hal's question was whether that makes sense to implement in Clang trunk when you answered

Since ykt-clang only supports the fist scheme when we ran performance
tests comparing nested parallelism against no nested parallelism we
got anywhere from 4x slowdown to 32x speedup [...]

quoting a paper where you now seem to agree that the experiments only dealt with a single parallel in a teams construct.

I'm sure you know what you are writing about, but the loose points don't match for me right now.

Jonas

Hi Jonas,

The experiments in the paper that are under the nested parallelism section really do use the nested parallelism scheme. “teams distribute” activated all the threads in the team.

Nested parallelism is activated every time you have an outer region with all threads active, calling an inner region that needs to have all threads active. No matter which directives you assign the second level parallelism to, the scheme for it will use the warp-wise execution.

If you have:

#target teams distribute
{
// all threads active

parallel for

{
// all threads active - this uses nested parallelism since it was called from a region where all threads were active
}
}

target teams distribute

{
// one thread per team active

parallel for

{
// all threads active

parallel for

{
// all threads active - this uses nested parallelism since it was called from a region where all thread are active
}
}
}

Is this clearer?

Thanks,

–Doru

Hi Doru,

Hi Jonas,

The experiments in the paper that are under the nested parallelism
section really do use the nested parallelism scheme. "teams
distribute" activated all the threads in the team.

I disagree: Only the team master executes the loop body of a "teams distribute" region. CUDA activates all (CUDA) threads at kernel launch, but that's really not the point.

Nested parallelism is activated every time you have an outer region
with all threads active, calling an inner region that needs to have
all threads active. No matter which directives you assign the second
level parallelism to, the scheme for it will use the warp-wise
execution.

If you have:

#target teams distribute
{
    // all threads active

This looks like an error? It's the same directive as below, but exhibits a different behavior?

    # parallel for
    {
        // all threads active - this uses nested parallelism since it
was called from a region where all threads were active
    }
}

# target teams distribute
{
     // one thread per team active
     # parallel for
     {
        // all threads active
        # parallel for
        {
            // all threads active - this uses nested parallelism since
it was called from a region where all thread are active
        }
     }
}

That's the pattern I'm looking for. Can you link me to a benchmark that uses this scheme?

Jonas

Hi Jonas,

You have to remember that clang-ykt can decide to use more efficient code generation schemes when it deems it safe to do so. For example, SPMD mode versus generic mode. Generic mode will use the master-worker scheme whereas SPMD will just have all threads do the same thing thus avoiding the master-worker scheme completely.

The activation of all threads in those two regions was regarded as an optimization. It is always safe to activate all threads if the code in the teams distribute only region does not contain side effects. For example if all you’re doing is declaring some local variables you can go ahead and run fully parallel. This was kind of like an SPMD-ization of the nested parallelism code. Doing it this way is a lot faster since you don’t have to use the master-worker scheme for the first level of parallelism which has an overhead that the experiments aim to avoid.

In your second comment you are now circling back to exactly the point I made at the start of the first e-mail I sent when I was talking about the limited number of use cases for nested parallelism. The pattern you’re really asking for (with the separate teams distribute) I don’t have any benchmarks to suggest for that one (this doesn’t mean that someone somewhere doesn’t have one).

Remember that you can combine directives so there’s no need to have a separate teams distribute. These patterns are far more common:

#pragma omp target teams distribute parallel for
{
// all threads active

parallel for

{
// all threads active - second level parallelism
}
}

or like this:

#pragma omp target teams distribute parallel for
{
// all threads active

simd

{
// all threads active - second level parallelism
}
}

Thanks,

–Doru

Hi Jonas,

You have to remember that clang-ykt can decide to use more efficient
code generation schemes when it deems it safe to do so. For example,
SPMD mode versus generic mode. Generic mode will use the master-worker
scheme whereas SPMD will just have all threads do the same thing thus
avoiding the master-worker scheme completely.

The activation of all threads in those two regions was regarded as an
optimization. It is always safe to activate all threads if the code in
the teams distribute only region does not contain side effects. For
example if all you're doing is declaring some local variables you can
go ahead and run fully parallel. This was kind of like an SPMD-ization
of the nested parallelism code. Doing it this way is a lot faster
since you don't have to use the master-worker scheme for the first
level of parallelism which has an overhead that the experiments aim to
avoid.

In your second comment you are now circling back to exactly the point
I made at the start of the first e-mail I sent when I was talking
about the limited number of use cases for nested parallelism. The
pattern you're really asking for (with the separate teams distribute)
I don't have any benchmarks to suggest for that one (this doesn't mean
that someone somewhere doesn't have one).

Remember that you can combine directives so there's no need to have a
separate teams distribute. These patterns are far more common:

#pragma omp target teams distribute parallel for
{
   // all threads active
   # parallel for
   {
       // all threads active - second level parallelism
   }
}

or like this:

#pragma omp target teams distribute parallel for
{
   // all threads active
   # simd
   {
       // all threads active - second level parallelism
   }
}

So if they are common, do you have benchmarks that use them? Is it possible to make some of the codes public, please?

Jonas