Asynchronous Side Effects From Offloaded Functions

Introduction

I’d like some help with some ideas I had while trying to reduce unnecessary gpu.barrier operations in the IREE pass pipeline. I started to think that a broader perspective might be needed. To illustrate the barrier problem, consider the following code:

1: %token = nvgpu.device_async_copy %A[], %B[], 1 : memref<f32> to memref<f32, #gpu.address_space<workgroup>>
2: gpu.barrier
3: nvgpu.device_async_wait %token
4: linalg.abs ins(%B: memref<f32, #gpu.address_space<workgroup>>) outs(%A: memref<f32>)

There is a dependency between line 1 and 4, so the effects of 1 need to be fully realized before starting 4. However, a gpu.barrier cannot synchronize the effects of nvgpu.device_async_copy. Only nvgpu.device_async_wait can do that. The gpu.barrier can be eliminated. The same problem exists for some other nvidia ops. And it will exist for any architecture with a DMA system, or dedicated calculation accelerators that run in parallel with the general-purpose processor that runs the IR, like NVDLA or BISMO. The underlying problem also extends beyond operations offloaded to an accelerator. In an SPMD context, normally synchronous operations like memref.copy may behave like an asynchronous launch, in that the op may exit in some threads before the full effects of the op are available. In IREE, the bufferization conservatively guards all such operations with gpu.barrier, but this creates a lot of unnecessary ops which ought to be removed later, the problem described earlier.

Initial proposal

I submitted a PR which tries to deal with the extra barriers issue by marking some side effects as asynchronous, meaning the effects may take effect after the op exits. This signals to the barrier elimination pass that a gpu.barrier can’t ensure the effect is done. This approach doesn’t address some issues, like how do you actually synchronize the effect, and some other details that are described below.

New Proposal

My idea is to create a generalized way to ensure effects of offloaded operations are actually finished when they might be used, using interfaces. Then, IREE and other users could take the approach of writing their copies unguarded and then a separate pass ensures that effects are available when they are used later by using an interface that constructs the necessary IR.

A consideration that is not needed with the nvgpu dialect is that in some architectures the synchronization has to be done with a limited resource, which must be configured before it is used. So a codegen enforcing synchronization may also need to concern itself with managing these resources. Additionally, if the processor waits on the resource multiple times in a row, this may cause a freeze or hard fault. So the codegen also needs to make absolutely sure that no control flow path may try to wait twice without freeing the resource.

I am unsure how to conduct the business of minimizing the use of the synchronization resources. I’ve thought of two possible approaches: the synchronization resources are assigned to the memory resources, according to whether they overlap, possibly using polyhedral analysis. Or, the codegen could overprovision and try to merge async effects that it can prove can’t occur at the same time.

The analysis needed here is not trivial. The pass would have to ensure that the effects of an op are waited on only once in all control flow paths, and that the synchronization resource is initialized correctly in all preceeding control flow. In order to be practical for pipelining, the merging of allocated resources has to work across loop iterations.

Interface

This is a very rough draft for an interface. It allows an op to have multiple async effects that need to be synchronized separately, that can each be associated with multiple side effects.

  • AsyncEffectGroup *getAsyncGroup(EffectInstance &effect) : Get the group associated with effect
  • void getAsyncGroups(SmallVector<AsyncEffectGroup> &groups) : Get all async effect groups. Side
    effects not included in an AsyncEffectGroup are assumed to be synchronous.
  • Operation * createInit(AsyncEffectGroup &effect, PatternRewriter &rewriter) : If the operation’s
    given async effect group effect requires initialization, this function should create one or more
    new ops at the insertion point of rewriter which will prepare the synchronization mechanism for
    use with this operation. Typically, this should be done in a location that is dominant of this
    operation. It returns one of the operations that was constructed or nullptr if no initialization
    was required, or this if the op is its own initialization. This might have a side effect of
    making a return value from the constructed IR an operand of this so that the op uses the
    resource this method allocated.
  • bool doesInitialize(AsyncEffectGroup &effect, Operation * other) : This tests whether other is a
    part of IR that will initialize the synchronization for this op. It should always return true for
    a return value of createInit for the given effect. Operations that don’t require any
    initialization might return true for nullptr.
  • Operation * createWait(AsyncEffectGroup &effect, PatternRewriter &rewriter) : This emits IR at the
    insertion point that blocks until the given effect is certain to be complete. Typically, this
    should be done at a location that is post-dominant of this operation or as a loop-carried
    dependency to ensure the effects are finished by function exit. It returns an operation that was
    constructed.
  • bool doesSync(AsyncEffectGroup &effect, Operation * other) : This tests whether the given
    operation other is guaranteed to synchronize the given effect group effect. It should always
    be true for a value that was returned by createWait for the given effect.

Synchronization pass

The pass would have an option for fallback function(s) for providing init/wait for normally synchronous ops. The codegen is imagined to work something like this:

  1. Enumerate disjoint and asynchronous memory side effects
  2. Map to sync resources
  3. Reduce resource utilization by merging effects that can’t occur at the same time.
  4. Create init code before each async launch. Create waits immediately after.
  5. Migrate init/wait to the furthest possible location(s)

Loose ends?

It might be better to use high-level dialect ops like offload.init instead of the interface stuff. The async dialect doesn’t seem like a good fit, but maybe it can be adapted? I’m concerned with the dialect approach, that there could be multiple different offload systems with incompatible synchronization methods, but a generic dialect could incorrectly try to mix them together. Also, there might need to be additional ops/methods to explicitly free a synchronization resource, or to manually fill the resource. The latter is something that may be needed in some software pipelining cases so that a future wait that was moved to the beginning of a loop will pass through on the first iteration.

See Also

There was some discussion here about making a generic async DMA for the nvidia feature.