Recently I’ve been learning CUDA. What better way to understand how the sausage is made than to skip CUDA itself and emit PTX directly and what better way to do that than using our very own MLIR infra . To that end, I’ve ported this article How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance[1] to MLIR. Naturally, the port is via the Python bindings. I was pleasantly surprised that there was almost nothing missing from upstream (just one gpu.object
attribute).
There are two instances of this port:
- Runnable Colab Notebook (don’t forget to switch to GPU runtime)
- mlir-python-extras/examples/cuda_matmul_opt.py
A few features/aspects to point out:
- I don’t claim or speak for the sensibility of the progressive optimizations implemented; my goal was only to implement high-fidelity transliterations (the author has a repo with the implementations).
- MLIR is used for compiling/lowering
gpu.func
to PTX but CuPY is used for runtime stuff (cudaMalloc
,cuLaunchKernel
, etc.). The way this trick is pulled off is by passing the PTX directly to CuPY and letting it compile to device code. This worked out really nicely because CuPY is (of course) a very polished interface to CUDA runtime APIs. - In order to “port” the ubiquitous C++ template parameters, I implemented (in
mlir-python-extras
) “reified generics”, i.e., generics that magically turn into values. In colab, which is onpy310
, this looks a little clunky but inpy312
this looks very nice:
Note@gpu.func def sgemm_naive[M, K, N, dtype]( A: T.memref(M, K, dtype), B: T.memref(K, N, dtype), C: T.memref(M, N, dtype) ): one = arith.constant(1.0, type=dtype) tmp = arith.constant(0, type=dtype) r = block_dim.x * block_idx.x + thread_idx.x c = block_dim.y * block_idx.y + thread_idx.y for k, tmp in range_(K, iter_args=[tmp]): tmp += A[r, k] * B[k, c] tmp = yield tmp C[r, c] = tmp + one
M
,K
,N
,dtype
are indeed “reified” in both the function signature and the body.
Anyway I didn’t really cut any corners to put this together so I plan to upstream most of the bindings extensions that I added to mlir-python-extras
in the coming weeks (nothing major, mostly convenience wrappers around ops). Though the “generics” stuff is probably not a candidate for upstreaming (unless there has recently been a tectonic shift in appetites for how “powerful” the bindings should be ). I’m also going to keep turning the crank and plan to cross-pollinate with @grypp’s work on tensor cores.
P.S. In principle I think all of this should work for AMD as well (CuPY has experimental support for ROCm) but seeing as I don’t own any ROCm supported devices I can’t test/experiment.
No affiliation. ↩︎