Proposal to add stream/queue as an optional argument to few GPU dialect ops


We are working on Intel® Extension for MLIR (IMEX) which is a collection of MLIR dialects and passes from Intel for improving upstream MLIR. For enabling MLIR pipeline on Intel GPU, the runtime (SYCL/DPC++), requires stream/queue (to launch kernels on) creation with explicit context and device information.

Upstream GPU dialect ops currently don’t not allow users to pass their custom stream to launch kernels on. Hence, to make it work on Intel GPU, we had our own internal dialect (GPUX where X is eXtension) which is an extension of upstream GPU dialect ops with an added argument for stream/queue.

To give user the flexibility to launch kernels on their custom stream, we propose to add stream/queue as an optional argument to the following upstream GPU dialect ops:

  1. gpu.launch_func

  2. gpu.alloc

  3. gpu.dealloc

  4. gpu.wait

  5. gpu.memcpy

  6. gpu.memset

Adding stream to these ops will allow users to create their own stream (by their create/destroy stream method) and pass it to these ops for device memory allocation and execution.

I’m not a GPU expert, but I can comment on the MLIR side.

The stream option you want to add is of a Stream type, so that should also be added to the GPU dialect. But the create_stream op needs a Device which needs create_device and so on, none of which are in the upstream dialect.

What are your plan for those?

Technically speaking, I don’t see a problem in adding those ops and types to the existing GPU dialect. If the other GPUs don’t need that, they don’t need to use, and if the stream argument is optional, then there’s no change in codegen for them.

But I’d wait to hear from others more involved in GPU code generation to bring their stronger opinions.

Yes, good point, eventually we would like to add new ops like create_stream, create_device and create_context as well. My plan was to have them in our internal dialect for now and upstream eventually once we had these changes in. But if there is no strong objection, we can add those new ops as well as a part of this proposal.

The problem with create_device is that different GPU runtimes can have very different ways to select device, e.g. it can be numerical index, device string or something more complicated. Specifically for our python compiler we will need to create stream either from device filter-string like level_zero:gpu:0 or from memref (for more context - we are (ab)using memref descriptor allocated field to store pointer to control block, which allows us to have reference-counted memrefs and also to attach arbitrary data to memref, like SYCL queue).

So at this stage lets just add create_stream without arguments, which always selects ‘default’ device (where meaning of ‘default’ is completely up to underlying GPU runtime). We can discuss and extend it separately later.

Couldn’t create_device accept only a string as a parameter, and parsing the string is up to the driver?

This isn’t a great way to do it, but having a create_stream without arguments will crash if there are no initialization of the “default target”, or if there is more than one. I know those are probably bugs, but it’s always good to get that kind of bug in the IR validation rather than at runtime, especially in GPUs.

IIRC, generic database drivers (ODBC) use a similar interface (URL) precisely for the same reason.

String will work for us and with our current implementation (Intel Level Zero/SYCL) we don’t even need a separate device concept i.e. create_stream("level_zero:gpu:0") will work for us. But I’d like to hear CUDA users opinion.

1 Like

Any feedback/opinions from others? CUDA users?

It is pretty common to separate the programming model from how to actually enumerate and instantiate devices. Having a stream modeled and passed is programming model. Actually creating such a thing is a runtime concern. It seems unlikely to me that the GPU dialect will be anything but a toy for the latter. If we add such a device creation op, can we explicitly call it out as intended for testing and basic integrations only?

Alternatively, we could leave out any device creation ops and require that they only be passed in. Test runners could be extended with flags to control device creation and then have a convention for always passing the device first (or something).

1 Like

@herhut any thoughts on this?