Upstreaming from our MLIR python compiler project

Hello,

We are working on our MLIR-based python compiler project (GitHub - intel/mlir-extensions: A staging ground for MLIR dialects and tools.) and we have accumulated a few things which can be potentially upstreamed. Please let us know if MLIR community is interested in having any of those upstream and we can work on it.

General:

GPU:

GPU dialect - we have our custom GPU dialect, mostly copypasted from upstream GPU dialect with small changes

  • gpu.alloc shared flag - InsertGPUAllocs pass uses shared gpu memory (accessible both on host and on device) to transfer data, so we adding our custom attribute to gpu.alloc to mark a shared allocation instead of device-only
  • Explicit context argument in all GPU ops and corresponding context creation/destruction ops - we need to provide additional info to all GPU ops (Level Zero context and command queue in our case) and we decided to add additional SSA arg to all ops to simplify lowering and to open more optimizations (context creation outlining as it can be quite expensive operation). This includes pass to convert GPU ops from contextless form.
  • Explicit kernel loading operation - special version of gpu.launch_func which takes kernel as SSA value instead of attribute and corresponding kernel creation op. SPIR-V loading and compilation by driver is very expensive operation so we need this special op so it can be outlined by later passes.

TBD (this list will be updated)

Feel free to ask me if you need more specific details on any of those items.

4 Likes

I found it hard to understand the way this is described. Can you provide more detail or rephrase?

In addition to
gpu.launch_func @kernels::@kernel ...

form, add

%0 = gpu.load_module @kernels
%1 = gpu.get_kernel %0, "kernel"
gpu.launch_func %1 ...

Do you have tests for this so we can see roughly what the effects on the IR and the limitations are?

In Polygeist, we have something similar using LLVM’s LoopInfo. Do you have a rough idea how your approach compares to that?

This sounds useful, but likely needs a design discussion on how to implement it in MLIR-generic way.

This may let us lift decouple memref address computation from memref to llvm lowering. I am supportive of this in general, and would like to understand the tradeoffs. We have seen performance improvements attributable to the address computation code emitted by MLIR in the current form compared to Clang, I wouldn’t want to lose those.

cc @wsmoses

Do you have tests for this so we can see roughly what the effects on the IR and the limitations are?

Not much, unfortunately. Most of these tested by our e2e pipeline. But are planning to add more IR tests regardless if it will be upstreamed.

In Polygeist, we have something similar using LLVM’s LoopInfo.

We don’t use LoopInfo, we have couple of patterns searching for if-like CFG:

  BB1             BB1
 /  \             |  \
BB2 BB3           |   BB2
 \  /             |  /
  BB4             BB3

And for loops we have pattern, searching for

  BB1
   |
  BB2
 / | \
|  V  ^
|  | /
 \ BB3
 BB4

and converting it to scf.while and also patterns transforming break-like CFG.

I don’t know how general these patterns are, we have only tested them with our Numba frontend (and, thankfully, python doesn’t have goto).

We also have later passes, which uplift scf.while to scf.for and scf.parallel but they are tied to our python dialect and not general enough.

This may let us lift decouple memref address computation from memref to llvm lowering.

We didn’t have plans to use this approach everywhere. Currently our patterns are limited to gpu kernel code and for the host code we are using upstream memref to llvm lowering.

This seems generally useful and we should just add it.

We used the gpu dialect variant with dependency tokens for this (the AsyncOpInterface) and then lower the dependency tokens to streams. The context is then accessible from the stream object if needed. So an alternative design could be to make these explicit one level later.

Do you have an idea how this would compose with a more generic GPU dialect?

Many practical implementations do this in one form or another. We resorted to hiding this in the runtime, where we implemented a module loading cache. Modeling this in IR makes it easier to pre-load modules for predictable runtime without warmup, though. I would be happy to see this added. Especially, if it is an optional extra op.

This seems generally useful and we should just add it.

shared flag support in gpu.alloc is the easiest one of these, I can prepare patch soon.

We used the gpu dialect variant with dependency tokens for this (the AsyncOpInterface) and then lower the dependency tokens to streams. The context is then accessible from the stream object if needed. So an alternative design could be to make these explicit one level later.
Do you have an idea how this would compose with a more generic GPU dialect?

My idea was to just add this context argument to all existing GPU ops (and add corresponding create/destroyContext ops) but make it optional, so all existing workflows which are not aware of it will still work. And then we can add a pass with translates ops from one form to another. So workflow author can decide if they want to deal with explicit context or not.

What do you think about this approach?

Patch for gpu.alloc shared flag ⚙ D133533 [mlir][gpu] Introduce `shared` flag to `gpu.alloc`

I’ve cleaned cfg-to-scf code somewhat, here is IR tests for it mlir-extensions/cfg-to-scf.mlir at 00cd380beaa0e82f6bf2376c9b986bced52d3b5e · intel/mlir-extensions · GitHub

@herhut I have review opened for shared memory, but there are inconsistencies between nv/intel gpu terminology, do you have any ideas for better naming?

Also, about

My idea was to just add this context argument to all existing GPU ops (and add corresponding create/destroyContext ops) but make it optional, so all existing workflows which are not aware of it will still work. And then we can add a pass with translates ops from one form to another. So workflow author can decide if they want to deal with explicit context or not.

Is this idea ok to you? If so, I can start working on this.

Thanks.