How to run mlir-cpu-runner with cluster_dim?

For now, I can use mlir-cpu-runner to run host and device code in GPU. I read the document of gpu.launch, but don’t see any option to set cluster_dim?

Are you asking about the NVIDIA Hopper’s CTA cluster? If so, we don’t have a feature to launch a kernel with CTA cluster, but I’ve an internal pull request that I’ll be putting up soon.


That’s exactly what I want. Thanks :grinning:

Is there a rough estimation when this will be available? Thanks.

I’m planning to put the PR enabling cluster kernel next week. The PR will piggyback on gpu.launch /gpu.launch_func Ops. Cluster dimensions will be optional.

By the way, we’ve multicast support for tma load (cp.async.bulk.tensor) and introduced special registers like cluster dim/id within the NVVM dialect.

I’m curious, what specific use-case do you have in mind for utilizing clusters?

1 Like