For now, I can use mlir-cpu-runner to run host and device code in GPU. I read the document of gpu.launch, but don’t see any option to set cluster_dim?
Are you asking about the NVIDIA Hopper’s CTA cluster? If so, we don’t have a feature to launch a kernel with CTA cluster, but I’ve an internal pull request that I’ll be putting up soon.
That’s exactly what I want. Thanks
Is there a rough estimation when this will be available? Thanks.
I’m planning to put the PR enabling cluster kernel next week. The PR will piggyback on
gpu.launch_func Ops. Cluster dimensions will be optional.
By the way, we’ve multicast support for tma load (
cp.async.bulk.tensor) and introduced special registers like cluster dim/id within the NVVM dialect.
I’m curious, what specific use-case do you have in mind for utilizing clusters?