Hi! We’re trying to figure out how to use workgroup memory when lowering (as an example) a linalg.matmul
for execution on an NVidia GPU.
We boiled it down to this example, which is an extended version of a linalg test for memory promotion (which implies to us that there should be a valid lowering path):
func.func @main() {
%A = memref.alloc() : memref<1024x1024xf32>
%B = memref.alloc() : memref<1024x1024xf32>
%C = memref.alloc() : memref<1024x1024xf32>
%cf0 = arith.constant 0.00000e+00 : f32
%cf1 = arith.constant 1.00000e+00 : f32
linalg.fill ins(%cf1 : f32) outs(%A : memref<1024x1024xf32>)
linalg.fill ins(%cf1 : f32) outs(%B : memref<1024x1024xf32>)
linalg.fill ins(%cf0 : f32) outs(%C : memref<1024x1024xf32>)
linalg.matmul ins(%A, %B: memref<1024x1024xf32>, memref<1024x1024xf32>)
outs(%C: memref<1024x1024xf32>)
return
}
module attributes {transform.with_named_sequence} {
transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
%matmul = transform.structured.match ops{["linalg.matmul"]} in %arg1 : (!transform.any_op) -> !transform.any_op
%func = transform.get_parent_op %matmul : (!transform.any_op) -> !transform.any_op
%block_tiled_matmul, %for_blocks = transform.structured.tile_using_forall %matmul tile_sizes [4, 4, 0] (mapping = [ #gpu.block<x>, #gpu.block<y>] ) : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
%gpu_launch = transform.gpu.map_forall_to_blocks %func { generate_gpu_launch } : (!transform.any_op) -> !transform.any_op
%matmul2 = transform.structured.match ops{["linalg.matmul"]} in %gpuLaunch : (!transform.any_op) -> !transform.any_op
%0 = transform.structured.promote %matmul2 { operands_to_promote = [0, 1], mapping = [#gpu.memory_space<workgroup>] } : (!transform.any_op) -> !transform.any_op
transform.structured.convert_to_loops %0 : !transform.any_op
transform.yield
}
}
Assuming transform.structured.convert_to_loops
does delete its target (currently only on the main branch of LLVM), we can apply the schedule to get two memref.copy
assigning to workgroup memory:
Lowered Code
#map = affine_map<(d0) -> (d0 * 4)>
module {
func.func @main() {
%c0 = arith.constant 0 : index
%c1024 = arith.constant 1024 : index
%c4 = arith.constant 4 : index
%c256 = arith.constant 256 : index
%c1 = arith.constant 1 : index
%cst = arith.constant 1.000000e+00 : f32
%cst_0 = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() : memref<1024x4xf32, #gpu.address_space<workgroup>>
%alloc_1 = memref.alloc() : memref<4x1024xf32, #gpu.address_space<workgroup>>
%alloc_2 = memref.alloc() : memref<1024x1024xf32>
%alloc_3 = memref.alloc() : memref<1024x1024xf32>
%alloc_4 = memref.alloc() : memref<1024x1024xf32>
linalg.fill ins(%cst : f32) outs(%alloc_2 : memref<1024x1024xf32>)
linalg.fill ins(%cst : f32) outs(%alloc_3 : memref<1024x1024xf32>)
linalg.fill ins(%cst_0 : f32) outs(%alloc_4 : memref<1024x1024xf32>)
gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c256, %arg7 = %c256, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c1, %arg10 = %c1, %arg11 = %c1) {
%0 = gpu.block_id x
%1 = gpu.block_id y
%2 = affine.apply #map(%0)
%3 = affine.apply #map(%1)
%subview = memref.subview %alloc_2[%2, 0] [4, 1024] [1, 1] : memref<1024x1024xf32> to memref<4x1024xf32, strided<[1024, 1], offset: ?>>
%subview_5 = memref.subview %alloc_3[0, %3] [1024, 4] [1, 1] : memref<1024x1024xf32> to memref<1024x4xf32, strided<[1024, 1], offset: ?>>
%subview_6 = memref.subview %alloc_4[%2, %3] [4, 4] [1, 1] : memref<1024x1024xf32> to memref<4x4xf32, strided<[1024, 1], offset: ?>>
gpu.barrier
memref.copy %subview, %alloc_1 : memref<4x1024xf32, strided<[1024, 1], offset: ?>> to memref<4x1024xf32, #gpu.address_space<workgroup>>
gpu.barrier
memref.copy %subview_5, %alloc : memref<1024x4xf32, strided<[1024, 1], offset: ?>> to memref<1024x4xf32, #gpu.address_space<workgroup>>
gpu.barrier
scf.for %arg12 = %c0 to %c4 step %c1 {
scf.for %arg13 = %c0 to %c4 step %c1 {
scf.for %arg14 = %c0 to %c1024 step %c1 {
%4 = memref.load %alloc_1[%arg12, %arg14] : memref<4x1024xf32, #gpu.address_space<workgroup>>
%5 = memref.load %alloc[%arg14, %arg13] : memref<1024x4xf32, #gpu.address_space<workgroup>>
%6 = memref.load %subview_6[%arg12, %arg13] : memref<4x4xf32, strided<[1024, 1], offset: ?>>
%7 = arith.mulf %4, %5 : f32
%8 = arith.addf %6, %7 : f32
memref.store %8, %subview_6[%arg12, %arg13] : memref<4x4xf32, strided<[1024, 1], offset: ?>>
}
}
}
gpu.terminator
}
return
}
}
Now a further lowering to LLVM eludes us. We tried
mlir-opt test_promote_lowered.mlir --gpu-kernel-outlining --convert-linalg-to-loops --convert-scf-to-cf -expand-strided-metadata -lower-affine -convert-arith-to-llvm --finalize-memref-to-llvm -convert-func-to-llvm --canonicalize --gpu-lower-to-nvvm-pipeline="cubin-chip=sm_75 cubin-features=+ptx75 opt-level=3"
, which leaves us with the following messages:
error: conversion of memref memory space #gpu.address_space<workgroup> to integer address space failed. Consider adding memory space conversions.
error: conversion of memref memory space #gpu.address_space<workgroup> to integer address space failed. Consider adding memory space conversions.
error: 'llvm.call' op 'memrefCopy' does not reference a symbol in the current scope
Adding -convert-gpu-to-nvvm
(which should be part of gpu-lower-to-nvvm-pipeline
) for the mentioned memory space conversions
anywhere in the pipeline and leaving gpu-lower-to-nvvm-pipeline
out leads to this error:
error: 'llvm.call' op 'memrefCopy' does not reference a symbol in the current scope
When trying to use transform.structured.vectorize_children_and_apply_patterns
on those two memref.copy (to try to go via a vector
lowering path) yields us:
error: Unsupported Op, cannot vectorize
Trying to use transform.structured.vectorize_children_and_apply_patterns
on the entire module just leaves those two memref.copy
untouched:
Appended to schedule
// Convert memref.copy to vector.transfer_read/write:
%1 = transform.structured.match ops{["memref.copy"]} in %arg1 : (!transform.any_op) -> !transform.any_op
%copy1, %copy2 = transform.split_handle %1 : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
%2 = transform.get_parent_op %copy1 {isolated_from_above} : (!transform.any_op) -> !transform.any_op
%3 = transform.structured.vectorize_children_and_apply_patterns %2 : (!transform.any_op) -> !transform.any_op
It may be that we are simply missing a single pass, which would resolve our problem. So if anybody is able to give pointers, that would be much appreciated.