[RFC] Re-scheduling implicit tasks for non-blocking target nowait execution

guilherme.valarini · August 3, 2022, 1:20pm

Re-scheduling implicit target tasks for non-blocking nowait execution

TL;DR;

The current target nowait support in LLVM may limit the number of concurrent target regions being executed at the same time: we can only execute as many “target-tasks” as we have hidden helper threads. Although not a problem for most GPU applications, other offloading devices allow multiple (hundreds of) target regions to be concurrently executed. To solve that, this post proposes that target nowait regions be split into two: a dispatch and a non-blocking synchronization part. While the dispatch stage setups up all the necessary device operations to make a target region asynchronously run, the synchronize stage allows the OpenMP runtime to re-enqueue the task associated with the target region while any operation is still pending, doing any post-processing afterward (e.g., device data map updates in libomptarget, task dependency resolution in OpenMP).

Introduction

For some time now, the OpenMP implementation inside the LLVM project allows users to offload target regions for accelerators using the full power of tasks and dependencies with the nowait and depend clauses. The application may decompose its problem into tasks that are manually mapped to, for example, different GPUs, letting the OpenMP runtime correctly decide when each region may be executed (using LLVM’s task dependency graph and stealing algorithm). As far as I know, LLVM’s libomptarget library was primarily focused on offloading to GPUs, and the current “blocking” model of executing target nowait regions is enough for that use case. But, new plugins are being added to LLVM where the program is not limited anymore to offload to just a handful amount of devices (e.g., 8 GPUs), but to hundreds of them (e.g., the remote plugin and my research team’s work OmpCluster). Even more, some devices may even allow multiple regions to be concurrently executed inside a single device instance. This means that to feed all of them, hundreds of threads need to be spawned at the host (ideally, one for each concurrent task), which is not desirable at all (more threads competing for internal locks and resources). This RFC seeks to accomplish two things:

Improve my understanding of the target nowait execution model inside LLVM.
Propose a new execution model that allows target nowait tasks to be re-scheduled for execution without compromising the correctness of the program.

The ultimate goal would be to enable the concurrent execution of hundreds of target nowait regions using just a small number of threads (even a single one). For that, I divided the RFC into three sections: a brief description of the current execution model, the proposed new one, and a set of questions. Please, feel free to comment and point out any errors in this post. The more this allows me to understand the libomptarget library the better.

Current execution model

My current understanding of LLVM’s execution model (using the current code and D77609, D78075 as guidelines) for target nowait regions is as follows:


for (int i = 0; i < 10; ++i) {

#pragma omp target nowait

{ /* Region i */ }

}

#pragma omp taskwait

Main Thread:

The main program thread encounters the first target region and initializes the HHTs team.
It then goes through the steps of creating each Region i context (task allocation and dispatch). Since this example has no dependencies, the regions are immediately put inside the main thread task queue (making them available to be stolen by the HTT team).
The main thread encounters the taskwait clause and starts to participate in the execution of the target regions alongside the HTT team.

Hidden Helper Threads (HHT):

A task is acquired from its task queue (or stolen from another queue) and its outline function is executed, calling the libomptarget interface for a target region.
The thread goes through all the steps to execute the target region: allocating and submitting any mapped/needed buffers, updating device maps, and calculating the correct addresses of the needed data.
The regions are executed at the target device.
The region is finalized, retrieving any mapped/needed data and removing any buffer with a reference count of 0 from the device’s memory.
The thread returns from the libomptarget code and keeps on executing other tasks.

If the outlined steps and observed behavior are correct, this means that any thread that executes a target region will always be blocked at the libomptarget library just waiting for the operations to complete. This is also true for target nowait with dependencies. Since dependencies are abstracted by the OpenMP task runtime, it does not change anything regarding the blocking behavior of a target execution. It only defers the target execution up until the point where all dependencies are resolved. When that happens, an HHT will still be blocked inside the libomptarget waiting for the offloaded region to be executed. Data movement pragmas like target enter/exit/update data are also affected by this same behavior even when paired with nowait depend clauses.

Devices are allowed to implement _async variants of the plugin API, where multiple operations are called asynchronously and accumulated in a queue-like structure called an AsyncContext (e.g., CUDA streams for the CUDA device or custom MPI events for OmpCluster). Unfortunately, such context is not utilized outside of the libomptarget code, meaning the same libomptarget entry point that builds the queue will also synchronize it, blocking the current thread from executing other tasks. Even more, the plugin API does not expose any mechanism that allows the operations on the queue to advance and return its completion status to be checked without blocking the current thread (e.g., cudaStreamQuery). So, even if the synchronization was deferred, it would eventually block a host thread. All this to say that, once a thread calls a libomptarget entry point, it will be blocked there, even though it could be doing other useful work (e.g., executing/dispatching other target regions).

Furthermore, imagine that our previous example also dispatches each target region to a different device instance (so, a total of 10 devices). If our HTT team has only 2 threads, the program would only be able to use 2 out of the 10 accelerators at the same time (3 if we count that the main thread also executes tasks). This means that we must always set our HHT count at the host to be a one-to-one match with the number of devices that we want to use. This characteristic is even more aggravated when each device instance can execute multiple regions at the same time. Now our host thread count needs to be the number of tasks that we want to be concurrently executed. Although this behavior makes sense in the context of a multithreaded application, accelerators do not benefit from having a host thread waiting for each operation completion.

For these reasons, we need a mechanism that allows the HHT team to wait for dispatched target region while doing useful work.

Propose

Currently, the execution of a target region is done in a single pass: a thread dispatches many device side operations, updating internal libomptarget states, and then waits for them immediately afterward, re-updating more internal states. By leveraging the implemented _async plugin APIs, this RFC proposes to split the execution model of target nowait regions into two stages: a dispatch and a synchronization stage. Here is an overview of both stages’ responsibilities:

Dispatch stage:

HHT starts executing the task associated with a target nowait region.
All needed device side operations are dispatched using their _async variants, accumulating their execution context in the form of the AsyncInfo structure.
Any procedures that would be executed after the call to the AsyncInfo::synchronize function are now stored inside the structure itself in the form of a vector-like container (i.e., std::vector<std::function<void(void)>>). Lambdas can be used to store all the local variables needed in their capture list, allowing the post-processing procedures to be executed at the synchronization stage.
AsyncInfo is stored inside the OpenMP task structure in the form of an opaque handle, allowing the associated task to reference it at a later stage.
The task is re-enqueued for execution.

Synchronization stage:

HHT starts re-executing the task associated with the previously dispatched target nowait region.
The thread queries for any AsyncInfo handle that was stored in the task structure. When none is found, it executes the steps described in the dispatch stage. In this case, a valid handle is found and the thread recovers the original structure information.
The AsyncInfo queue is synchronized using a new Plugin API function that checks and advances the completion of the queue in a non-blocking manner, returning if the queue is completed as soon as possible. Such a function would allow libomptarget to only partially advance the execution of a target region without completely blocking any thread.
This step is based on whether the queue is completed or not:
If the queue is not completed, the HHT re-enqueues the task once more, allowing another thread to re-execute the synchronization stage at another time.
If the queue is completed, all the post-processing functions are sequentially executed, the task AsyncInfo handle is cleaned and invalidated, and the task is finalized. Here, the task dependencies can be completed, as usual, allowing other tasks/target nowait regions to execute.

Needed changes

To implement the proposed mechanism, the following changes need to be done across the OpenMP runtime, libomptarget library, and target plugins. A prototype of this new mechanism is in the works and this RFC will soon be updated with a link to its patch in Phabricator.

OpenMP runtime

Extend the task structure to include an opaque handle responsible for storing the AsyncInfo structure.
Implement two new function that allows the libomptarget to set and get said opaque handle.
Update the task execution loop to re-enqueue a task when the opaque handle is set.
Note: the re-enqueueing process can be done in numerous ways (e.g., re-enqueue to the current thread queue, push to another thread’s queue, use an exclusive queue for deferred tasks), but one thing must be ensured: the task-stealing process must be able to steal tasks from other threads while deferred tasks are waiting for synchronization. If that is not guaranteed, load-balancing cannot be assured since, currently, a thread only steals tasks when its own queue is empty, which would not happen if it contains even a single deferred task.

Libomptarget agnostic layer

Extend the AsyncInfo structure:
New post-processing function vector container.
New non-blocking synchronization function.
Split most, if not all, libomptarget paths to be executed in two stages. This will affect any interface path allowed to be called with a nowait clause. The synchronous functions may be implemented based on the asynchronous ones, reducing the pain of maintaining two different code paths.

Device plugins

Extend the Plugin API with a new non-blocking synchronization function. Such function would advance and query for the execution state of the operations stored in AsyncInfo in a non-blocking manner, returning if all of them are completed or not through the IsCompleted pointer. Error codes can be returned normally as the function result. As previously mentioned, for the CUDA plugin, this function can be implemented around the cudaStreamQuery function.


int32_t __tgt_rtl_synchronize_async(int32_t ID, __tgt_async_info *AsyncInfo, int8_t *IsCompleted);

Questions

Besides re-enqueueing the tasks, is there an already implemented way of re-executing a task in OpenMP? How about splitting the execution of a task into two stages? This cannot be done using any stack-based approach (i.e., the current implementation of task yield), since that could make the tasks lower in the stack starve for too long. A mechanism akin to co-operative tasking would be desired.
CUDA already has a library function to query for stream completion in a non-blocking manner (cudaStreamQuery). Do the other plugins have such functionality as well?

Thanks for your time!

OpenMP Cluster Project

shiltian · August 4, 2022, 7:29pm

Thanks for the RFC. That sounds pretty interesting.

Besides re-enqueueing the tasks, is there an already implemented way of re-executing a task in OpenMP? How about splitting the execution of a task into two stages? This cannot be done using any stack-based approach (i.e., the current implementation of task yield), since that could make the tasks lower in the stack starve for too long. A mechanism akin to co-operative tasking would be desired.

Actually the proposal shares some common design with what detached task does for the fulfillment. The execution of a detached task is divided into three parts. The first part is executed by encountering thread directly. This part doesn’t have any async behavior. After that, the task is not marked as complete. A call back function is registered somewhere else and will called when it’s time to finish the detached task. Since most of the time the call back function cannot do too much because of the limitation of where the call back function is registered, the task complete is divided into two parts: the call back function does some quick stuff (it’s called top half in libomp), and enqueue a proxy task which will finish the remaining part (bottom half in libomp).

CUDA already has a library function to query for stream completion in a non-blocking manner (cudaStreamQuery). Do the other plugins have such functionality as well?

Not exactly, but we could implement that to make the behavior align with our expectation.

Feel free to involve me into any future discussion.

guilherme.valarini · August 17, 2022, 1:49am

Here are some updates to the RFC:

There is currently a prototype open for review at D132005. The code is currently working with the NVPTX plugin and can dispatch multiple target regions in parallel with only a single hidden helper thread!
Previously, the RFC described that a new function would be added to the libomptarget interface to be used for synchronization. The implemented prototype changes that and now uses the same function called on dispatch for synchronization. Here is a flowchart for the execution:

guilherme.valarini · August 17, 2022, 2:01am

Using detached tasks could be interesting since, at first, no major changes would need to be made to the OpenMP runtime. For that, target nowait tasks need to be marked as detached and an omp event could be linked to the libomptarget async info struct. Once again, I know that this could be done for the NVPTX plugin, but I have no idea about the others.

Topic		Replies	Views
Implementation of nowait and async in libomptarget OpenMP	2	140	September 14, 2020
openmp 4.5 and cuda streams Clang Frontend	9	139	October 31, 2019
Taskgroup task_reduction with target in_reduction segfaults OpenMP gpu	5	338	September 2, 2022
OpenMP in LLVM Multi-company Telecom Meeting Minutes April 08th OpenMP	1	88	April 8, 2020
OpenMP offloaded target region executed in both host and target-device Clang Frontend	2	94	April 9, 2018