PSA: E2E (compile+run) GPU examples

makslevental · April 23, 2024, 7:33pm

Recently I’ve been learning CUDA. What better way to understand how the sausage is made than to skip CUDA itself and emit PTX directly and what better way to do that than using our very own MLIR infra . To that end, I’ve ported this article How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance^[1] to MLIR. Naturally, the port is via the Python bindings. I was pleasantly surprised that there was almost nothing missing from upstream (just one gpu.object attribute).

There are two instances of this port:

Runnable Colab Notebook (don’t forget to switch to GPU runtime)
mlir-python-extras/examples/cuda_matmul_opt.py

A few features/aspects to point out:

I don’t claim or speak for the sensibility of the progressive optimizations implemented; my goal was only to implement high-fidelity transliterations (the author has a repo with the implementations).
MLIR is used for compiling/lowering gpu.func to PTX but CuPY is used for runtime stuff (cudaMalloc, cuLaunchKernel, etc.). The way this trick is pulled off is by passing the PTX directly to CuPY and letting it compile to device code. This worked out really nicely because CuPY is (of course) a very polished interface to CUDA runtime APIs.

In order to “port” the ubiquitous C++ template parameters, I implemented (in mlir-python-extras) “reified generics”, i.e., generics that magically turn into values. In colab, which is on py310, this looks a little clunky but in py312 this looks very nice:

@gpu.func
def sgemm_naive[M, K, N, dtype](
    A: T.memref(M, K, dtype), B: T.memref(K, N, dtype), C: T.memref(M, N, dtype)
):
    one = arith.constant(1.0, type=dtype)
    tmp = arith.constant(0, type=dtype)

    r = block_dim.x * block_idx.x + thread_idx.x
    c = block_dim.y * block_idx.y + thread_idx.y

    for k, tmp in range_(K, iter_args=[tmp]):
        tmp += A[r, k] * B[k, c]
        tmp = yield tmp
    C[r, c] = tmp + one

Note M, K, N, dtype are indeed “reified” in both the function signature and the body.

Anyway I didn’t really cut any corners to put this together so I plan to upstream most of the bindings extensions that I added to mlir-python-extras in the coming weeks (nothing major, mostly convenience wrappers around ops). Though the “generics” stuff is probably not a candidate for upstreaming (unless there has recently been a tectonic shift in appetites for how “powerful” the bindings should be ). I’m also going to keep turning the crank and plan to cross-pollinate with @grypp’s work on tensor cores.

P.S. In principle I think all of this should work for AMD as well (CuPY has experimental support for ROCm) but seeing as I don’t own any ROCm supported devices I can’t test/experiment.

No affiliation. ↩︎

jungpark · April 26, 2024, 6:27pm

This looks great!
Appreciate you found the feasibility of GPU runtime integration via python.
Previously I tried very simple gpu example with using opencl which works fine.
And for rocm, there’s HIP python binding, not tried but can’t think any reason it doesn’t work.

Probably not relative to your specific example but it’d be also interesting to use transform dialect for tiling and gpu mapping when using mlir python binding.

makslevental · April 26, 2024, 6:39pm

Probably not relative to your specific example but it’d be also interesting to use transform dialect for tiling and gpu mapping when using mlir python binding.

That’s plenty doable - there are lots of examples in-tree (see mlir/test/python/dialects/transform_*) and in mlir-python-extras.

Topic		Replies	Views
How do I join the MLIR discord channel? MLIR mlir	2	129	April 2, 2024
OpenCL example MLIR	4	716	October 27, 2023
MLIR GPU execution without runtime load/unload MLIR	1	447	April 28, 2022
MLIR GPU execution path - non-JIT trial MLIR	5	1137	June 14, 2021
Making linalg.matmul to GPU runnable code MLIR	6	1269	April 19, 2022

PSA: E2E (compile+run) GPU examples

Related Topics