Constructing pipeline lowering an affine parallel loop to NVIDIA GPU

Consider the following snippet of MLIR:

func.func @kernel(%arg0: memref<?xf64>, %arg1: memref<?xf64>, %arg3: memref<1xindex>) attributes {llvm.emit_c_interface} {
  %c0 = arith.constant 0 : index
  %dim = memref.load %arg3[%c0] : memref<1xindex>
  %alloc = memref.alloc(%dim) : memref<?xf64>
  affine.for %arg2 = %c0 to %dim {
    %0 = affine.load %arg0[%arg2] : memref<?xf64>
    %1 = arith.addf %0, %0 : f64
    affine.store %1, %alloc[%arg2] : memref<?xf64>
  }
  return
}

I would like to lower this loop onto a GPU, which should be possible using all of the utilities available in MLIR currently. The following pipeline gets me there, but not in the way I’d expect:

./bin/mlir-opt testing.mlir --pass-pipeline="builtin.module(func.func(affine-loop-fusion{fusion-maximal=true}, affine-scalrep, affine-parallelize, lower-affine, gpu-map-parallel-loops, convert-parallel-loops-to-gpu, lower-affine), gpu-kernel-outlining, gpu.module(convert-gpu-to-nvvm, convert-nvgpu-to-nvvm, gpu-to-cubin), gpu-to-llvm)"

In particular, the generated code launches kernels over a grid = number of elements in the array, with block size of 1. This is obviously terrible for performance. I’m not sure which passes should be applied in order to get the behavior I want. Things are wrong at at least this point in the pipeline:

./bin/mlir-opt testing.mlir --pass-pipeline="builtin.module(func.func(affine-parallelize, lower-affine, gpu-map-parallel-loops))"

where the mapping looks like:

module {
  func.func @kernel(%arg0: memref<?xf64>, %arg1: memref<?xf64>, %arg2: memref<1xindex>) attributes {llvm.emit_c_interface} {
    %c0 = arith.constant 0 : index
    %0 = memref.load %arg2[%c0] : memref<1xindex>
    %alloc = memref.alloc(%0) : memref<?xf64>
    %c0_0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    scf.parallel (%arg3) = (%c0_0) to (%0) step (%c1) {
      %1 = memref.load %arg0[%arg3] : memref<?xf64>
      %2 = arith.addf %1, %1 : f64
      memref.store %2, %alloc[%arg3] : memref<?xf64>
      scf.yield
    } {mapping = [#gpu.loop_dim_map<processor = block_x, map = (d0) -> (d0), bound = (d0) -> (d0)>]}
    return
  }
}

where i’d maybe expect the processor to be some function of block_x, thread and block_dim.

Any guidance is appreciated here! I was playing around with affine-loop-tile but that didn’t seem to be pushing me in the right direction.