I have the following code example that I would like to distribute to threads using the vector-warp-distribute
pass.
func.func @warp_scf_for(%arg0: index) {
%c128 = arith.constant 128 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%0 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>) {
%ini = "some_def"() : () -> (vector<128xf32>)
%k = arith.constant 0 : index
%3:2 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini, %arg5 = %k) -> (vector<128xf32>, index) {
%add = arith.addi %arg3, %c1 : index
%t = arith.addi %k , %c1 : index
%acc = "some_def"(%add, %arg4, %t) : (index, vector<128xf32>, index) -> (vector<128xf32>)
scf.yield %acc, %t : vector<128xf32>, index
}
gpu.yield %3 : vector<128xf32>
}
"some_use"(%0) : (vector<4xf32>) -> ()
return
}
Currently this example fails with following trace,
llvm-project/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp:2301: static void mlir::gpu::WarpExecuteOnLane0Op::build(mlir::OpBuilder&, mlir::OperationState&, mlir::TypeRange, mlir::Value, int64_t, mlir::ValueRange, mlir::TypeRange): Assertion `args.size() == blockArgTypes.size()' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
It looks like WarpOpScfForOp
implicitly require that all the yielded results of scf.for
are used by someone. Here %t
is only used inside the for body but not outside. Is this a bug in the implementation or is there a way to avoid this issue?