Re-scheduling implicit target tasks for non-blocking nowait execution
TL;DR;
The current target nowait
support in LLVM may limit the number of concurrent target regions being executed at the same time: we can only execute as many “target-tasks” as we have hidden helper threads. Although not a problem for most GPU applications, other offloading devices allow multiple (hundreds of) target regions to be concurrently executed. To solve that, this post proposes that target nowait
regions be split into two: a dispatch and a non-blocking synchronization part. While the dispatch stage setups up all the necessary device operations to make a target region asynchronously run, the synchronize stage allows the OpenMP runtime to re-enqueue the task associated with the target region while any operation is still pending, doing any post-processing afterward (e.g., device data map updates in libomptarget, task dependency resolution in OpenMP).
Introduction
For some time now, the OpenMP implementation inside the LLVM project allows users to offload target regions for accelerators using the full power of tasks and dependencies with the nowait
and depend
clauses. The application may decompose its problem into tasks that are manually mapped to, for example, different GPUs, letting the OpenMP runtime correctly decide when each region may be executed (using LLVM’s task dependency graph and stealing algorithm). As far as I know, LLVM’s libomptarget library was primarily focused on offloading to GPUs, and the current “blocking” model of executing target nowait
regions is enough for that use case. But, new plugins are being added to LLVM where the program is not limited anymore to offload to just a handful amount of devices (e.g., 8 GPUs), but to hundreds of them (e.g., the remote plugin and my research team’s work OmpCluster). Even more, some devices may even allow multiple regions to be concurrently executed inside a single device instance. This means that to feed all of them, hundreds of threads need to be spawned at the host (ideally, one for each concurrent task), which is not desirable at all (more threads competing for internal locks and resources). This RFC seeks to accomplish two things:
-
Improve my understanding of the
target nowait
execution model inside LLVM. -
Propose a new execution model that allows
target nowait
tasks to be re-scheduled for execution without compromising the correctness of the program.
The ultimate goal would be to enable the concurrent execution of hundreds of target nowait
regions using just a small number of threads (even a single one). For that, I divided the RFC into three sections: a brief description of the current execution model, the proposed new one, and a set of questions. Please, feel free to comment and point out any errors in this post. The more this allows me to understand the libomptarget library the better.
Current execution model
My current understanding of LLVM’s execution model (using the current code and D77609, D78075 as guidelines) for target nowait
regions is as follows:
for (int i = 0; i < 10; ++i) {
#pragma omp target nowait
{ /* Region i */ }
}
#pragma omp taskwait
Main Thread:
-
The main program thread encounters the first target region and initializes the HHTs team.
-
It then goes through the steps of creating each Region i context (task allocation and dispatch). Since this example has no dependencies, the regions are immediately put inside the main thread task queue (making them available to be stolen by the HTT team).
-
The main thread encounters the
taskwait
clause and starts to participate in the execution of the target regions alongside the HTT team.
Hidden Helper Threads (HHT):
-
A task is acquired from its task queue (or stolen from another queue) and its outline function is executed, calling the libomptarget interface for a target region.
-
The thread goes through all the steps to execute the target region: allocating and submitting any mapped/needed buffers, updating device maps, and calculating the correct addresses of the needed data.
-
The regions are executed at the target device.
-
The region is finalized, retrieving any mapped/needed data and removing any buffer with a reference count of 0 from the device’s memory.
-
The thread returns from the libomptarget code and keeps on executing other tasks.
If the outlined steps and observed behavior are correct, this means that any thread that executes a target region will always be blocked at the libomptarget library just waiting for the operations to complete. This is also true for target nowait
with dependencies. Since dependencies are abstracted by the OpenMP task runtime, it does not change anything regarding the blocking behavior of a target execution. It only defers the target execution up until the point where all dependencies are resolved. When that happens, an HHT will still be blocked inside the libomptarget waiting for the offloaded region to be executed. Data movement pragmas like target enter/exit/update data
are also affected by this same behavior even when paired with nowait depend
clauses.
Devices are allowed to implement _async
variants of the plugin API, where multiple operations are called asynchronously and accumulated in a queue-like structure called an AsyncContext (e.g., CUDA streams for the CUDA device or custom MPI events for OmpCluster). Unfortunately, such context is not utilized outside of the libomptarget code, meaning the same libomptarget entry point that builds the queue will also synchronize it, blocking the current thread from executing other tasks. Even more, the plugin API does not expose any mechanism that allows the operations on the queue to advance and return its completion status to be checked without blocking the current thread (e.g., cudaStreamQuery). So, even if the synchronization was deferred, it would eventually block a host thread. All this to say that, once a thread calls a libomptarget entry point, it will be blocked there, even though it could be doing other useful work (e.g., executing/dispatching other target regions).
Furthermore, imagine that our previous example also dispatches each target region to a different device instance (so, a total of 10 devices). If our HTT team has only 2 threads, the program would only be able to use 2 out of the 10 accelerators at the same time (3 if we count that the main thread also executes tasks). This means that we must always set our HHT count at the host to be a one-to-one match with the number of devices that we want to use. This characteristic is even more aggravated when each device instance can execute multiple regions at the same time. Now our host thread count needs to be the number of tasks that we want to be concurrently executed. Although this behavior makes sense in the context of a multithreaded application, accelerators do not benefit from having a host thread waiting for each operation completion.
For these reasons, we need a mechanism that allows the HHT team to wait for dispatched target region while doing useful work.
Propose
Currently, the execution of a target region is done in a single pass: a thread dispatches many device side operations, updating internal libomptarget states, and then waits for them immediately afterward, re-updating more internal states. By leveraging the implemented _async
plugin APIs, this RFC proposes to split the execution model of target nowait
regions into two stages: a dispatch and a synchronization stage. Here is an overview of both stages’ responsibilities:
Dispatch stage:
-
HHT starts executing the task associated with a
target nowait
region. -
All needed device side operations are dispatched using their
_async
variants, accumulating their execution context in the form of theAsyncInfo
structure. -
Any procedures that would be executed after the call to the
AsyncInfo::synchronize
function are now stored inside the structure itself in the form of a vector-like container (i.e.,std::vector<std::function<void(void)>>
). Lambdas can be used to store all the local variables needed in their capture list, allowing the post-processing procedures to be executed at the synchronization stage. -
AsyncInfo
is stored inside the OpenMP task structure in the form of an opaque handle, allowing the associated task to reference it at a later stage. -
The task is re-enqueued for execution.
Synchronization stage:
-
HHT starts re-executing the task associated with the previously dispatched
target nowait
region. -
The thread queries for any
AsyncInfo
handle that was stored in the task structure. When none is found, it executes the steps described in the dispatch stage. In this case, a valid handle is found and the thread recovers the original structure information. -
The
AsyncInfo
queue is synchronized using a new Plugin API function that checks and advances the completion of the queue in a non-blocking manner, returning if the queue is completed as soon as possible. Such a function would allow libomptarget to only partially advance the execution of a target region without completely blocking any thread. -
This step is based on whether the queue is completed or not:
-
If the queue is not completed, the HHT re-enqueues the task once more, allowing another thread to re-execute the synchronization stage at another time.
-
If the queue is completed, all the post-processing functions are sequentially executed, the task
AsyncInfo
handle is cleaned and invalidated, and the task is finalized. Here, the task dependencies can be completed, as usual, allowing other tasks/target nowait
regions to execute.
Needed changes
To implement the proposed mechanism, the following changes need to be done across the OpenMP runtime, libomptarget library, and target plugins. A prototype of this new mechanism is in the works and this RFC will soon be updated with a link to its patch in Phabricator.
OpenMP runtime
-
Extend the task structure to include an opaque handle responsible for storing the
AsyncInfo
structure. -
Implement two new function that allows the libomptarget to set and get said opaque handle.
-
Update the task execution loop to re-enqueue a task when the opaque handle is set.
-
Note: the re-enqueueing process can be done in numerous ways (e.g., re-enqueue to the current thread queue, push to another thread’s queue, use an exclusive queue for deferred tasks), but one thing must be ensured: the task-stealing process must be able to steal tasks from other threads while deferred tasks are waiting for synchronization. If that is not guaranteed, load-balancing cannot be assured since, currently, a thread only steals tasks when its own queue is empty, which would not happen if it contains even a single deferred task.
Libomptarget agnostic layer
-
Extend the
AsyncInfo
structure: -
New post-processing function vector container.
-
New non-blocking synchronization function.
-
Split most, if not all, libomptarget paths to be executed in two stages. This will affect any interface path allowed to be called with a
nowait
clause. The synchronous functions may be implemented based on the asynchronous ones, reducing the pain of maintaining two different code paths.
Device plugins
- Extend the Plugin API with a new non-blocking synchronization function. Such function would advance and query for the execution state of the operations stored in
AsyncInfo
in a non-blocking manner, returning if all of them are completed or not through theIsCompleted
pointer. Error codes can be returned normally as the function result. As previously mentioned, for the CUDA plugin, this function can be implemented around thecudaStreamQuery
function.
int32_t __tgt_rtl_synchronize_async(int32_t ID, __tgt_async_info *AsyncInfo, int8_t *IsCompleted);
Questions
-
Besides re-enqueueing the tasks, is there an already implemented way of re-executing a task in OpenMP? How about splitting the execution of a task into two stages? This cannot be done using any stack-based approach (i.e., the current implementation of task yield), since that could make the tasks lower in the stack starve for too long. A mechanism akin to co-operative tasking would be desired.
-
CUDA already has a library function to query for stream completion in a non-blocking manner (
cudaStreamQuery
). Do the other plugins have such functionality as well?
Thanks for your time!
OpenMP Cluster Project