Hello,
I wonder if MemAlloc effect is supposed to allow or disallow reordering by definition. For example, Flang uses the MemoryEffects on a special DebuggingResource to keep the nesting structure of fir.dummy_scope operations: llvm-project/flang/include/flang/Optimizer/Dialect/FIROps.td at e53acac022892b58a1576ad9eebe2ccdda407dda · llvm/llvm-project · GitHub
We currently use MemWrite effect, but it will certainly block some analyses/optimizations. May I use MemAlloc effect (which is generally more optimizable) instead and be sure that two fir.dummy_scope operations won’t be reordered by some optimization?
Alternatively, does it make sense to add a core MLIR “debugging” resource that might be handled by some analyses in a special way (e.g. mlir::affine::isLoopMemoryParallel may assume that MemWrite effects to such resource are parallelizable)?
No you can’t: two alloc can be reordered.
That does seem quite ad-hoc to me actually. Can you elaborate with some IR snippet showing what you’re trying to achieve?
Thanks!
I am investigating the feasibility of using Affine dialect and transformations in Flang. One of the aspects is the ability to generate debug and TBAA information for Fortran programs, which is currently done pretty late in Flang pass pipeline. In order to preserve the source level information, Flang uses certain FIR operations like fir.declare and fir.dummy_scope. If I want to apply the Affine transformations “in the middle” of Flang pass pipeline, I may have MLIR like this:
// RUN: fir-opt %s -allow-unregistered-dialect -affine-parallelize
func.func @_QPtest1(%arg0 : memref<10xf32>) {
%cst = arith.constant 1.000000e+00 : f32
affine.for %arg2 = 0 to 10 {
%16 = affine.apply affine_map<(d0) -> (d0 + 1)>(%arg2)
%alloca_0 = memref.alloca() : memref<f32>
%17 = fir.convert %alloca_0 : (memref<f32>) -> !fir.ref<f32>
%18 = fir.dummy_scope : !fir.dscope
%20 = fir.declare %17 dummy_scope %18 arg 1 {uniq_name = "_QFtestFinnerEy"} : (!fir.ref<f32>, !fir.dscope) -> !fir.ref<f32>
%21 = fir.convert %20 : (!fir.ref<f32>) -> memref<f32>
affine.store %cst, %21[] : memref<f32>
%22 = affine.load %21[] : memref<f32>
affine.store %22, %arg0[%16 - 1] : memref<10xf32>
}
return
}
func.func @_QPtest2(%arg0 : memref<10xf32>) {
%cst = arith.constant 1.000000e+00 : f32
affine.for %arg2 = 0 to 10 {
%16 = affine.apply affine_map<(d0) -> (d0 + 1)>(%arg2)
%alloca_0 = memref.alloca() : memref<f32>
%17 = fir.convert %alloca_0 : (memref<f32>) -> !fir.ref<f32>
%20 = fir.declare %17 {uniq_name = "_QFtestFinnerEy"} : (!fir.ref<f32>) -> !fir.ref<f32>
%21 = fir.convert %20 : (!fir.ref<f32>) -> memref<f32>
affine.store %cst, %21[] : memref<f32>
%22 = affine.load %21[] : memref<f32>
affine.store %22, %arg0[%16 - 1] : memref<10xf32>
}
return
}
In test1 I show potential MLIR mixing FIR/affine dialect operations - you may see the fir.dummy_scope in this example. Such code may appear due to MLIR inlining or due to OpenACC private variables materialization early in Flang FE or due to other reasons.
In test2 I manually removed fir.dummy_scope (i.e. I lost some information due to this).
I tested them using my modified fir-opt tool (with registered Affine passes, and ViewLikeOpInterface attached to fir.declare operation): -affine-parallelize can parallelize the loop in test2 but not in test1, because fir.dummy_scope has a MemWrite effect on FIR’s DebuggingResource. As I said before, DebuggingResource is used to guarantee fir.dummy_scope nesting (in the case of MLIR inlining), but it is just an aritificial “metadata” resource and it should not restrict parallelization in any way.
Output MLIR with my modified fir-opt:
#map = affine_map<(d0) -> (d0 + 1)>
module {
func.func @_QPtest1(%arg0: memref<10xf32>) {
%cst = arith.constant 1.000000e+00 : f32
affine.for %arg1 = 0 to 10 {
%0 = affine.apply #map(%arg1)
%alloca = memref.alloca() : memref<f32>
%1 = fir.convert %alloca : (memref<f32>) -> !fir.ref<f32>
%2 = fir.dummy_scope : !fir.dscope
%3 = fir.declare %1 dummy_scope %2 arg 1 {uniq_name = "_QFtestFinnerEy"} : (!fir.ref<f32>, !fir.dscope) -> !fir.ref<f32>
%4 = fir.convert %3 : (!fir.ref<f32>) -> memref<f32>
affine.store %cst, %4[] : memref<f32>
%5 = affine.load %4[] : memref<f32>
affine.store %5, %arg0[%0 - 1] : memref<10xf32>
}
return
}
func.func @_QPtest2(%arg0: memref<10xf32>) {
%cst = arith.constant 1.000000e+00 : f32
affine.parallel (%arg1) = (0) to (10) {
%0 = affine.apply #map(%arg1)
%alloca = memref.alloca() : memref<f32>
%1 = fir.convert %alloca : (memref<f32>) -> !fir.ref<f32>
%2 = fir.declare %1 {uniq_name = "_QFtestFinnerEy"} : (!fir.ref<f32>) -> !fir.ref<f32>
%3 = fir.convert %2 : (!fir.ref<f32>) -> memref<f32>
affine.store %cst, %3[] : memref<f32>
%4 = affine.load %3[] : memref<f32>
affine.store %4, %arg0[%0 - 1] : memref<10xf32>
}
return
}
}
To add to that, I think other MLIR dialects may also use a special “debugging” resource to allow some optimizations, e.g. llvm.dbg.declare intrinsic should probably not block all optimizations due to conservative side-effects.
I agree with you that it may look quite ad-hoc. What might be the other options?
I made an attempt to resolve some issues in MLIR optimizations and alias analysis for operations that access such synthetic resources.
Please feel free to leave comments in [RFC][mlir] Introduced unit SideEffects::Resource. by vzakhari · Pull Request #178291 · llvm/llvm-project · GitHub