Following examples, I can handle allocations and elementwise ops on GPUs. I’m having issues with reduction ops.
Here’s a reduction example with linalg named ops:
%reduce = linalg.reduce { arith.addf } ins(%input: tensor<16xf32>)
outs(%output: tensor<f32>) dimensions = [0]
After linalg-generalize-named-ops pass this becomes
#map = affine_map<(d0) -> (d0)>
#map1 = affine_map<(d0) -> ()>
%0 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["reduction"]} ins(%arg0 : tensor<16xf32>) outs(%arg1 : tensor<f32>) {
^bb0(%in: f32, %out: f32):
%1 = arith.addf %out, %in : f32
linalg.yield %1 : f32
} -> tensor<f32>
And convert-linalg-to-parallel-loops pass converts it to scf.for:
scf.for %arg0 = %c0 to %c16 step %c1 {
%0 = memref.load %memref[%arg0] : memref<16xf32>
%1 = memref.load %memref_1[] : memref<f32>
%2 = arith.addf %1, %0 : f32
memref.store %2, %memref_1[] : memref<f32>
}
As far as I understand, the existing GPU passes like gpu-map-parallel-loops and convert-parallel-loops-to-gpu only apply to scf.parallel loops. They do not modify scf.for.
Are there any existing passes that could be used to offload the above kernel to GPU? We can assume targeting nvvm pipeline for example.