Proposal to add stream/queue as an optional argument to few GPU dialect ops

Hello,

We are working on Intel® Extension for MLIR (IMEX) which is a collection of MLIR dialects and passes from Intel for improving upstream MLIR. For enabling MLIR pipeline on Intel GPU, the runtime (SYCL/DPC++), requires stream/queue (to launch kernels on) creation with explicit context and device information.

Upstream GPU dialect ops currently don’t not allow users to pass their custom stream to launch kernels on. Hence, to make it work on Intel GPU, we had our own internal dialect (GPUX where X is eXtension) which is an extension of upstream GPU dialect ops with an added argument for stream/queue.

To give user the flexibility to launch kernels on their custom stream, we propose to add stream/queue as an optional argument to the following upstream GPU dialect ops:

  1. gpu.launch_func

  2. gpu.alloc

  3. gpu.dealloc

  4. gpu.wait

  5. gpu.memcpy

  6. gpu.memset

Adding stream to these ops will allow users to create their own stream (by their create/destroy stream method) and pass it to these ops for device memory allocation and execution.

I’m not a GPU expert, but I can comment on the MLIR side.

The stream option you want to add is of a Stream type, so that should also be added to the GPU dialect. But the create_stream op needs a Device which needs create_device and so on, none of which are in the upstream dialect.

What are your plan for those?

Technically speaking, I don’t see a problem in adding those ops and types to the existing GPU dialect. If the other GPUs don’t need that, they don’t need to use, and if the stream argument is optional, then there’s no change in codegen for them.

But I’d wait to hear from others more involved in GPU code generation to bring their stronger opinions.

Yes, good point, eventually we would like to add new ops like create_stream, create_device and create_context as well. My plan was to have them in our internal dialect for now and upstream eventually once we had these changes in. But if there is no strong objection, we can add those new ops as well as a part of this proposal.

The problem with create_device is that different GPU runtimes can have very different ways to select device, e.g. it can be numerical index, device string or something more complicated. Specifically for our python compiler we will need to create stream either from device filter-string like level_zero:gpu:0 or from memref (for more context - we are (ab)using memref descriptor allocated field to store pointer to control block, which allows us to have reference-counted memrefs and also to attach arbitrary data to memref, like SYCL queue).

So at this stage lets just add create_stream without arguments, which always selects ‘default’ device (where meaning of ‘default’ is completely up to underlying GPU runtime). We can discuss and extend it separately later.

Couldn’t create_device accept only a string as a parameter, and parsing the string is up to the driver?

This isn’t a great way to do it, but having a create_stream without arguments will crash if there are no initialization of the “default target”, or if there is more than one. I know those are probably bugs, but it’s always good to get that kind of bug in the IR validation rather than at runtime, especially in GPUs.

IIRC, generic database drivers (ODBC) use a similar interface (URL) precisely for the same reason.

String will work for us and with our current implementation (Intel Level Zero/SYCL) we don’t even need a separate device concept i.e. create_stream("level_zero:gpu:0") will work for us. But I’d like to hear CUDA users opinion.

1 Like

Any feedback/opinions from others? CUDA users?

It is pretty common to separate the programming model from how to actually enumerate and instantiate devices. Having a stream modeled and passed is programming model. Actually creating such a thing is a runtime concern. It seems unlikely to me that the GPU dialect will be anything but a toy for the latter. If we add such a device creation op, can we explicitly call it out as intended for testing and basic integrations only?

Alternatively, we could leave out any device creation ops and require that they only be passed in. Test runners could be extended with flags to control device creation and then have a convention for always passing the device first (or something).

1 Like

@herhut any thoughts on this?

Can I go ahead and create a PR if there are no objections?

The implementation and plumbing of this is a runtime concern, there is a question of what is the right abstraction at a high-level for this. If you punt this on the runtime, it seems to me that it is like “not building” the abstraction, and any toolchain has to figure out not only of to setup a device for every possible runtime but also how to thread it through the compiler somehow.

That said it is perfectly reasonable to start with the types to model the concept (Stream) without having all the operations to manage it, in particular in a multi-device kind of settings. Operations can always be added later when there is more clarity on “what are the appropriate abstractions” there.

Actually, while “stream” exists in Cuda and this proposal is likely a good match, I’m more curious about how this would all map to the Vulkan model?

I wasn’t necessarily suggesting to punt completely but to isolate. Those aspects are far from standard in all of the programming models, and in MLIR, we usually model such differences with dialects vs lowest common denominator. There’s also substantial priors for these items just coming from a related runtime library with some form of handle interop.

Not very well, I suspect, but these are already at two different levels of abstraction. Would need to check with the vulkan folks regarding how much of the GPU dialect really matches that (vs just happening to have some correspondence that can be used in simple cases).

Looks like there is a consensus to add the gpu.create_stream op as well along with the above listed ops?

This is how the generated MLIR IR will look like with create_stream op

%stream = gpu.create_stream(%device_type) : (!gpu.DeviceType) → !gpu.StreamType
.
.
.
gpu.launch_func(%stream, …)
.

Here, the create stream takes an opaque pointer to the device_type which at the runtime can be casted to appropriate device
This allows user to create their own stream with passed in device type and launch the kernel on their custom stream and device.

@stellaraccident @Hardcode84 @Jianhui-Li

Created a PR here
https://reviews.llvm.org/D146556

The GPU dialect models dependencies with async tokens and those are mapped to streams when lowering to CUDA and ROCm runtime.

This currently doesn’t support multiple GPUs, but it could by specifying the GPU on the gpu.wait async op (which maps to creating a stream in the CUDA/ROCm runtime).

I’m a bit concerned that adding device and stream types to the GPU dialect will introduce a new, orthogonal dependency mechanism (e.g., the lowering code above would need to gain support for gpu.launch_func to specify the stream instead of using async tokens) which doesn’t map well to CUDA graphs or Vulkan command queues.

On the other hand, Mehdi’s point seems valid that we don’t want each lowering to a stream-type target to have to reinvent the mapping from async tokens to streams.

Do you happen to have a good pointer that explains the use of streams in level-zero? Immediate-mode GPU programming models became out of fashion, and I’m wondering if we should rather invest into finding a common abstraction for deferred APIs using command queues. I seem to remember that level-zero also has (non-immediate) command queue support?

1 Like

We are still going to use async tokens and gpu.wait for dependency tracking. Stream itself doesn’t introduce any dependency info. All gpu runtimes need some sort of ‘context’ where you submit your kernels and goal of this proposal to make it explicit. Without it you will need to ether use global variables or to pass stream as async token (which is a hack and also, doesn’t allow diamond dependencies). It also allows more flexibility and things like outlining and caching stream creation or interleaving execution on multiple devices.

On level zero side we are going to implement stream as zeCommandQueueCreate/zeCommandListCreateImmediate and async tokens as events. Queues can asyncronous in L0 (i.e. execute submitted kernels in different order) so we need to sychronize them via events in any case.

1 Like

Yes, the asyn tokens are still going to be used. Also, along with level zero runtime pointers that Ivan provided above, we also use Sycl runtime which handles stream and events in somewhat similar way.
Here is the link to sycl specification
https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:interface.queue.class
https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:interface.event

Sorry, I should have asked earlier. It isn’t great to have two different ways of creating streams. At least the naming is already problematic here. %t = gpu.wait async already provides the abstraction lowered by the creation of streams. Creating a new gpu.create_stream is overlapping the way I see it. Looking at the OP, if a user wants to provide an existing stream (for eg. passed as an arg to a FuncOp), it could be done by modelling the thing being passed as a !gpu.async_token type while at the GPU dialect level, and this would lower, for example, on the LLVM lowering path, to an !llvm.ptr<i8>. Then, passing a cudaStream_t in the case of NVIDIA GPUs, for example, makes the generated/compiled MLIR work perfectly fine (all the runtime calls in it would work with the custom stream passed). I can’t see why such a lowering can’t be built for the Intel GPU. If diamond DAGs can’t be modelled and are desired, then that’s a separate problem to address with the current approach, and we should replace/improve the latter.

Stream/queue is separate concept from synchronization in L0/SYCL, it have async queues which can execute kernels in arbitrary order, and user need to explicitly synchronize kernels via events. That why we are proposing to split it into 2 entities. Stream just provide some ‘context’ where you can enqueue kernels but doesn’t guarantee any execution order (except the case when launch_func is expected to be executed immediately, i.e. doesn’t return async token), and async tokens are using for explicit synchronization. Current implementation when cuda stream is passed as async token doesn’t look like a general solution to me and looks more like a hack and I definitely don’t want to pull such hack to our L0/SYCL impl.