I am a newer of mlir and gpu. So I want to study the gpu dialect and I run the mlir/test/Integration/GPU/CUDA/all-reduce-op.mlir . and I didn’t understand this example. The example is shown below：

And I think %idx = addi %tx, %t2 : index is just got the thread id of every thread. and
in %sum = “gpu.all_reduce”(%val) ({}) { op = “add” } : (f32) → (f32) , %val is the thread id.
So what’s the meaning of gpu.all_reduce, use every thread to get the sum of all elements in the matrix?

On GPU, multiple threads execute the same IR. Allreduce is a pattern where values available in each thread are aggregated (here, added up) and the result is made available in each thread. In a “simple” reduce, the result would be available in one thread only. See also Collective operation - Wikipedia.

Thank you very. the result of this example is the matrix. size is [2, 14, 13] and all the elements are 5356. why the result is 5356. thank you very much

If you look closely, each thread computes its linearized index, i.e. (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x and these indices are later reduced. The linearized indexes here are the integers from 0 to 2*4*13, which sum up exactly to 5356.

I also find this example of all-reduce. I think it is a good optimization method for reduction computation in deep learning model.
But when I try to lower the high level operation at MHLO(mhlo.reduce) to gpu dialect. I can not find a right way to do this, like mhlo.reduce → linalg → gpu.all_reduce.
As the example below, I can convert mhlo.reduce to linalg generic reduce. But when I try to convert linalg generic reduce to gpu.all_reduce, I can not find the right conversion path for this optimization.