Problems on lowering scf.parallel with dynamic boundary to GPU

Hi folks,

I wrote a simple snippet as follows with mlir to calculate segment sum.

#map = affine_map<(d0, d1)[s0] -> (d0 * 5 + s0 + d1)>
func.func @segment_sum(%arg0 : memref<10xi64>, %arg1 : memref<50x5xf32>, %arg2 : memref<9x5xf32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c9 = arith.constant 9 : index
    scf.parallel (%i) = (%c0) to (%c9) step (%c1) {
      %n = arith.addi %i, %c0 : index
      %lb = memref.load %arg0[%i] : memref<10xi64>
      %hb = memref.load %arg0[%n] : memref<10xi64>
      %size_i64 = arith.subi %hb, %lb : i64

      %size = arith.index_cast %size_i64 : i64 to index
      %offset = arith.index_cast %lb : i64 to index
      %view = memref.subview %arg1[%offset, 0][%size, 5][1, 1] : memref<50x5xf32> to memref<?x5xf32, #map>

      %tensor = bufferization.to_tensor %view : memref<?x5xf32, #map>
      %res_tensor = "tosa.reduce_sum"(%tensor) {axis = 0} : (tensor<?x5xf32>) -> (tensor<5xf32>)
      %res_memref = bufferization.to_memref %res_tensor : memref<5xf32>

      %res_vector = vector.load %res_memref[%c0] : memref<5xf32>, vector<5xf32>
      vector.store %res_vector, %arg2[%i, %c0] : memref<9x5xf32>, vector<5xf32>
    }
    return
  }

It can be successfully lowered to GPU with

mlir-opt ops.mlir \
	--pass-pipeline="func.func(tosa-to-linalg)" \
	--one-shot-bufferize \
	--convert-linalg-to-loops \
	--scf-parallel-loop-fusion \
	--pass-pipeline="func.func(gpu-map-parallel-loops, convert-parallel-loops-to-gpu)" \
	--convert-scf-to-cf \
	--convert-vector-to-llvm \
	--convert-arith-to-llvm \
	--convert-memref-to-llvm \
	--canonicalize \
	--gpu-kernel-outlining \
	--lower-affine \
	--pass-pipeline='gpu.module(strip-debuginfo,convert-vector-to-gpu,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin)' \
	--gpu-to-llvm \
    --reconcile-unrealized-casts | mlir-translate --mlir-to-llvmir

But, when I tried to change input arguments to memref with dynamic dimension (also dynamic upper bound for scf.parallel) as shown below,

#map = affine_map<(d0, d1)[s0] -> (d0 * 5 + s0 + d1)>
func.func @segment_sum(%arg0 : memref<10xi64>, %arg1 : memref<50x5xf32>, %arg2 : memref<?x5xf32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    // %c9 = arith.constant 9 : index
    %dim = memref.dim %arg2, %c0 : memref<?x5xf32>
    scf.parallel (%i) = (%c0) to (%dim) step(%c1) {
      %n = arith.addi %i, %c0 : index
      %lb = memref.load %arg0[%i] : memref<10xi64>
      %hb = memref.load %arg0[%n] : memref<10xi64>
      %size_i64 = arith.subi %hb, %lb : i64

      %size = arith.index_cast %size_i64 : i64 to index
      %offset = arith.index_cast %lb : i64 to index
      %view = memref.subview %arg1[%offset, 0][%size, 5][1, 1] : memref<50x5xf32> to memref<?x5xf32, #map>

      %tensor = bufferization.to_tensor %view : memref<?x5xf32, #map>
      %res_tensor = "tosa.reduce_sum"(%tensor) {axis = 0} : (tensor<?x5xf32>) -> (tensor<5xf32>)
      %res_memref = bufferization.to_memref %res_tensor : memref<5xf32>

      %res_vector = vector.load %res_memref[%c0] : memref<5xf32>, vector<5xf32>
      vector.store %res_vector, %arg2[%i, %c0] : memref<?x5xf32>, vector<5xf32>
    }

I got following errors with the same commands.

ops.mlir:42:5: error: semi-affine expressions (division by non-const) are not supported
    scf.parallel (%i) = (%c0) to (%dim) step(%c1) {
    ^
ops.mlir:38:11: error: failed to legalize operation 'builtin.unrealized_conversion_cast' that was explicitly marked illegal
    %c0 = arith.constant 0 : index
          ^
ops.mlir:38:11: note: see current operation: %35 = "builtin.unrealized_conversion_cast"(%28) : (i64) -> index

Do I miss something? or lowering the scf.parallel with dynamic boundary to GPU is not supported indeed.

Thank you!

Your example did not lower for me (it did not make it through bufferization) but the simplified example

module {
  func.func @segment_sum(%arg0: memref<10xi64>, %arg1: memref<50x5xf32>, %arg2: memref<?x5xf32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %cst = arith.constant 2.000000e+00 : f32
    %0 = memref.dim %arg2, %c0 : memref<?x5xf32>
    scf.parallel (%arg3) = (%c0) to (%0) step (%c1) {
      memref.store %cst, %arg2[%arg3, %c0] : memref<?x5xf32>
      scf.yield
    }
    return
  }
}

can be lowered via

  --pass-pipeline="func.func(tosa-to-linalg)" \
  --one-shot-bufferize \
  --convert-linalg-to-loops \
  --scf-parallel-loop-fusion \
  --pass-pipeline="func.func(gpu-map-parallel-loops, convert-parallel-loops-to-gpu)" \

to the gpu dialect equivalent

#map0 = affine_map<(d0)[s0, s1] -> ((d0 - s0) ceildiv s1)>
#map1 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)>
module {
  func.func @segment_sum(%arg0: memref<10xi64>, %arg1: memref<50x5xf32>, %arg2: memref<?x5xf32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %cst = arith.constant 2.000000e+00 : f32
    %0 = memref.dim %arg2, %c0 : memref<?x5xf32>
    %c1_0 = arith.constant 1 : index
    %1 = affine.apply #map0(%0)[%c0, %c1]
    gpu.launch blocks(%arg3, %arg4, %arg5) in (%arg9 = %1, %arg10 = %c1_0, %arg11 = %c1_0) threads(%arg6, %arg7, %arg8) in (%arg12 = %c1_0, %arg13 = %c1_0, %arg14 = %c1_0) {
      %2 = affine.apply #map1(%arg3)[%c1, %c0]
      memref.store %cst, %arg2[%2, %c0] : memref<?x5xf32>
      gpu.terminator
    } {SCFToGPU_visited}
    return
  }
}

The scf.parallel to gpu.launch code has changed a bit since I worked on it but as far as I remember it requires lower bound and step to be constant but does allow for dynamic upper bounds.

Actually, after writing this I realized the issue might be the affine map0. So I tried running --lower-affine on the above output and I indeed got the error

error: semi-affine expressions (division by non-const) are not supported
    %1 = affine.apply #map0(%0)[%c0, %c1]

#map = affine_map<(d0)[s0, s1] -> ((d0 - s0) ceildiv s1)>

which seems wrong to me. At least I cannot see a non-const division. In fact, the entire affine.apply can be canonicalized to just %0. And indeed a run of --canonicalize will produce

module {
  func.func @segment_sum(%arg0: memref<10xi64>, %arg1: memref<50x5xf32>, %arg2: memref<?x5xf32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %cst = arith.constant 2.000000e+00 : f32
    %0 = memref.dim %arg2, %c0 : memref<?x5xf32>
    gpu.launch blocks(%arg3, %arg4, %arg5) in (%arg9 = %0, %arg10 = %c1, %arg11 = %c1) threads(%arg6, %arg7, %arg8) in (%arg12 = %c1, %arg13 = %c1, %arg14 = %c1) {
      memref.store %cst, %arg2[%arg3, %c0] : memref<?x5xf32>
      gpu.terminator
    } {SCFToGPU_visited}
    return
  }
}

which should lower just fine.

PS: I should add that this is not a bug, even if it is misleading. From an affine map perspective, s1 is not a constant, so there is a non-constant division in the map itself. But in the affine.apply we bind a constant to s1, so after canonicalization all is fine.

1 Like

Thank you so much for the helpful explanation! It works!

I have another question in this lowered gpu dialect. I found the IV of scf.parallel is mapped to gpu.block_id x, which seems making little sense. In general, we should map IVs to threads first, right? because threads in the same block hold shared memory. Do I misunderstand something? or there is some mlir-opt options to control this behavior. Thanks again!

The pass you are using, gpu-map-parallel-loops, really is more of a test pass. It has the static mapping that the outer loop becomes blocks, the second-outer one turns into threads and the rest is mapped to sequential loops.

So for your use case, you have to either surround your example with another parallel loop or change how the mapping attributes are attached.

1 Like