What's the meaning of gpu.all_reduce

I am a newer of mlir and gpu. So I want to study the gpu dialect and I run the mlir/test/Integration/GPU/CUDA/all-reduce-op.mlir . and I didn’t understand this example. The example is shown below:

func @main() {
%arg = memref.alloc() : memref<2x4x13xf32>
%dst = memref.cast %arg : memref<2x4x13xf32> to memref<?x?x?xf32>
%c0 = constant 0 : index
%c1 = constant 1 : index
%c2 = constant 2 : index
%sx = memref.dim %dst, %c2 : memref<?x?x?xf32>
%sy = memref.dim %dst, %c1 : memref<?x?x?xf32>
%sz = memref.dim %dst, %c0 : memref<?x?x?xf32>
%cast_dst = memref.cast %dst : memref<?x?x?xf32> to memref<*xf32>
gpu.host_register %cast_dst : memref<*xf32>
gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
threads(%tx, %ty, %tz) in (%block_x = %sx, %block_y = %sy, %block_z = %sz) {
%t0 = muli %tz, %block_y : index
%t1 = addi %ty, %t0 : index
%t2 = muli %t1, %block_x : index
%idx = addi %tx, %t2 : index
%t3 = index_cast %idx : index to i32
%val = sitofp %t3 : i32 to f32
%sum = “gpu.all_reduce”(%val) ({}) { op = “add” } : (f32) → (f32)
memref.store %sum, %dst[%tz, %ty, %tx] : memref<?x?x?xf32>
gpu.terminator
}
call @print_memref_f32(%cast_dst) : (memref<*xf32>) → ()
return
}

And I think %idx = addi %tx, %t2 : index is just got the thread id of every thread. and
in %sum = “gpu.all_reduce”(%val) ({}) { op = “add” } : (f32) → (f32) , %val is the thread id.
So what’s the meaning of gpu.all_reduce, use every thread to get the sum of all elements in the matrix?

On GPU, multiple threads execute the same IR. Allreduce is a pattern where values available in each thread are aggregated (here, added up) and the result is made available in each thread. In a “simple” reduce, the result would be available in one thread only. See also Collective operation - Wikipedia.

Thank you very. the result of this example is the matrix. size is [2, 14, 13] and all the elements are 5356. why the result is 5356. thank you very much

If you look closely, each thread computes its linearized index, i.e. (threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x and these indices are later reduced. The linearized indexes here are the integers from 0 to 2*4*13, which sum up exactly to 5356.

thank you very much. I got it. The following is the kernel func of CUDA.

gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
threads(%tx, %ty, %tz) in (%block_x = %sx, %block_y = %sy, %block_z = %sz) {
%t0 = muli %tz, %block_y : index
%t1 = addi %ty, %t0 : index
%t2 = muli %t1, %block_x : index
%idx = addi %tx, %t2 : index
%t3 = index_cast %idx : index to i32
%val = sitofp %t3 : i32 to f32
%sum = “gpu.all_reduce”(%val) ({}) { op = “add” } : (f32) → (f32)
memref.store %sum, %dst[%tz, %ty, %tx] : memref<?x?x?xf32>
gpu.terminator
}

In this kernel func. thread computes its linearized index. for 0 to 2* 4 * 13 and gpu.all_reduce() will sum up them. Thank you very much

I also find this example of all-reduce. I think it is a good optimization method for reduction computation in deep learning model.
But when I try to lower the high level operation at MHLO(mhlo.reduce) to gpu dialect. I can not find a right way to do this, like mhlo.reduce → linalg → gpu.all_reduce.
As the example below, I can convert mhlo.reduce to linalg generic reduce. But when I try to convert linalg generic reduce to gpu.all_reduce, I can not find the right conversion path for this optimization.

 %0 = "mhlo.reduce"(%arg0, %arg1) ({
  ^bb0(%arg3: tensor<i32>, %arg4 : tensor<i32>):
    %1 = mhlo.maximum %arg3, %arg4 : tensor<i32>
    "mhlo.return"(%1) : (tensor<i32>) -> ()
  }) {dimensions = dense<1> : tensor<1xi64>} : (tensor<5x4xi32>, tensor<i32>) -> tensor<5xi32>
  return %0 : tensor<5xi32>
%2 = linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "reduction"]} ins(%arg0 : tensor<1x1000xf32>) outs(%1 : tensor<1xf32>) {
    ^bb0(%arg1: f32, %arg2: f32):  // no predecessors
      %3 = arith.addf %arg1, %arg2 : f32
      linalg.yield %3 : f32
    } -> tensor<1xf32>
    return %2 : tensor<1xf32>

Can u tell me what I should do if I want to use this gpu optimization(gpu.all_reduce) for large mhlo.reduce computation?