PSA: E2E (compile+run) GPU examples

Recently I’ve been learning CUDA. What better way to understand how the sausage is made than to skip CUDA itself and emit PTX directly and what better way to do that than using our very own MLIR infra :slight_smile:. To that end, I’ve ported this article How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance[1] to MLIR. Naturally, the port is via the Python bindings. I was pleasantly surprised that there was almost nothing missing from upstream (just one gpu.object attribute).

There are two instances of this port:

  1. Runnable Colab Notebook (don’t forget to switch to GPU runtime)
  2. mlir-python-extras/examples/cuda_matmul_opt.py

A few features/aspects to point out:

  1. I don’t claim or speak for the sensibility of the progressive optimizations implemented; my goal was only to implement high-fidelity transliterations (the author has a repo with the implementations).
  2. MLIR is used for compiling/lowering gpu.func to PTX but CuPY is used for runtime stuff (cudaMalloc, cuLaunchKernel, etc.). The way this trick is pulled off is by passing the PTX directly to CuPY and letting it compile to device code. This worked out really nicely because CuPY is (of course) a very polished interface to CUDA runtime APIs.
  3. In order to “port” the ubiquitous C++ template parameters, I implemented (in mlir-python-extras) “reified generics”, i.e., generics that magically turn into values. In colab, which is on py310, this looks a little clunky but in py312 this looks very nice:
    @gpu.func
    def sgemm_naive[M, K, N, dtype](
        A: T.memref(M, K, dtype), B: T.memref(K, N, dtype), C: T.memref(M, N, dtype)
    ):
        one = arith.constant(1.0, type=dtype)
        tmp = arith.constant(0, type=dtype)
    
        r = block_dim.x * block_idx.x + thread_idx.x
        c = block_dim.y * block_idx.y + thread_idx.y
    
        for k, tmp in range_(K, iter_args=[tmp]):
            tmp += A[r, k] * B[k, c]
            tmp = yield tmp
        C[r, c] = tmp + one
    
    Note M, K, N, dtype are indeed “reified” in both the function signature and the body.

Anyway I didn’t really cut any corners to put this together so I plan to upstream most of the bindings extensions that I added to mlir-python-extras in the coming weeks (nothing major, mostly convenience wrappers around ops). Though the “generics” stuff is probably not a candidate for upstreaming (unless there has recently been a tectonic shift in appetites for how “powerful” the bindings should be :person_shrugging:). I’m also going to keep turning the crank and plan to cross-pollinate with @grypp’s work on tensor cores.

P.S. In principle I think all of this should work for AMD as well (CuPY has experimental support for ROCm) but seeing as I don’t own any ROCm supported devices I can’t test/experiment.


  1. No affiliation. ↩︎

9 Likes

This looks great!
Appreciate you found the feasibility of GPU runtime integration via python.
Previously I tried very simple gpu example with using opencl which works fine.
And for rocm, there’s HIP python binding, not tried but can’t think any reason it doesn’t work.

Probably not relative to your specific example but it’d be also interesting to use transform dialect for tiling and gpu mapping when using mlir python binding.

Probably not relative to your specific example but it’d be also interesting to use transform dialect for tiling and gpu mapping when using mlir python binding.

That’s plenty doable - there are lots of examples in-tree (see mlir/test/python/dialects/transform_*) and in mlir-python-extras.