[RFC][MLIR] Introduce hoist-pure-ops pass

Abstract

Hope to add hoist-pure-ops in MLIR upstream, hoist-pure-ops pass can hoist the Pure ops, to gain more opportunities for optimisation.

motivation

Following is the the initial ideas.

func.func @hoist_cast_pos_alloc(%arg: i1) -> (memref<?xf32>) {
  %alloc = memref.alloc() : memref<10xf32>
  cf.cond_br %arg, ^bb1, ^bb2
^bb1:
  %cast = memref.cast %alloc : memref<10xf32> to memref<?xf32>
  return %cast : memref<?xf32>
^bb2:
  %cast1 = memref.cast %alloc : memref<10xf32> to memref<?xf32>
  return %cast1 : memref<?xf32> 
}

Hope it transforms into the following IR. However, CSE pass cannot solve this issue.Because % cast does not dominate %cast1. memref.cast is a pure op, perhaps we can reorder it based on SSA dominance. We have had some discussions on this issue before. Will ops without side effects be reordered when running the pass?

func.func @hoist_cast_pos_alloc(%arg: i1) -> (memref<?xf32>) {
  %alloc = memref.alloc() : memref<10xf32>
  %cast = memref.cast %alloc : memref<10xf32> to memref<?xf32>
  cf.cond_br %arg, ^bb1, ^bb2
^bb1:
  return %cast : memref<?xf32>
^bb2:
  return %cast : memref<?xf32> 
}

Based on the above discussion, we can write a pattern for memref.cast and adjust the order of the cast.Now we can use CSE pass.

func.func @hoist_cast_pos_alloc(%arg: i1) -> (memref<?xf32>) {
  %alloc = memref.alloc() : memref<10xf32>
  %cast = memref.cast %alloc : memref<10xf32> to memref<?xf32>
  %cast1 = memref.cast %alloc : memref<10xf32> to memref<?xf32>
  cf.cond_br %arg, ^bb1, ^bb2
^bb1:
  return %cast : memref<?xf32>
^bb2:
  return %cast1 : memref<?xf32> 
}

Thanks @mehdi_amini for asking me this question.“what makes cast “special” with this property? Why not other ops?”, I think if an Op is a Pure Op, we have the opportunity to hoist its position based on SSA dominance, so perhaps we can write a more generic pass to hoist Pure op.

code implementation

I have made a simple implementation of the code above. [mlir] add hoist-pure-ops to mlir by linuxlonelyeagle · Pull Request #168715 · llvm/llvm-project · GitHub .If you have any questions, do let me know. Thanks. :face_blowing_a_kiss:

1 Like

This looks similar to loop-invariant code motion but for conditionals. Can the two be generalized to something that works across control flow operations?

1 Like

I looked into this earlier. The LICM pass is a bit simpler because it directly matches loop-like ops. In particular, it does not need DominanceInfo. So implementation-wise I’m not sure… But maybe it makes sense to put both implementations in the same file and to expose both hoistings via a single pass (with pass options).

Perhaps we could use hoist-pure-ops instead of loop-invariant code motion. :thinking:

They’re not doing the same thing: LICM is intended to reduce the repetition of these operations (dynamically) by moving them out of the loops.
What you’re proposing (moving them close to their definition) is possibly making these executed when they wouldn’t have been before (maybe even potentially moving them into a loop when they were outside before)

maybe even potentially moving them into a loop when they were outside before

I’m not entirely sure whether such a situation actually exists.

Here is a trivial example (just a twist on your example) that demonstrates it:

func.func @hoist_cast_pos_alloc(%arg: i1) -> (memref<?xf32>) {
  cf.br ^entry
^entry:
  %get_memref = "test.get_memref"() : () -> memref<10xf32>
  %cond = "test.loop_cond"() : () -> i1
  cf.cond_br %cond, ^entry, ^exit
^exit:
  %cast = memref.cast %alloc : memref<10xf32> to memref<?xf32>
  return %cast : memref<?xf32>
}

Hmm, I read the proposal as only hoisting if both edges in the condition have the operation…

1 Like

The example in the RFC was misleading: the implementation just checks if the operand is in a different block than the producer and relocalize the operation with the producer.

2 Likes

I think this idea makes sense. Introduce a new constraint: when a block has multiple successor blocks, and two or more of these successors contain the same pure operation, we hoist it. :thinking:

If there are two or more edges, we hoist the same pure op to both dominate their positions. What do you think of doing this? @ftynse @mehdi_amini @matthias-springer . :thinking:

I would think CSE should be the transformation that does that, it would be interesting to see why it does not.

1 Like

I have studied the code within it.

  • The implementation of CSE uses llvm:: ScopedHashTable.
    If CFG is compared to a tree. Leaf nodes can access the SSA values in the path of the root node, but they should not be able to access the SSA values of other leaf nodes.
    The reason for this is that it should be a pre order traversal of moduleOp, creating a new llvm::ScopedHashTableScope when moving from one CFG edge to another.
    At the leaf node, it will search for an op that is the same as the root node to the leaf node, and then replace it with the existing op. :thinking:

  • The reason for doing this can be seen from IR above.
    %cast cann’t be visited by %cast1’s block.
    %cast1 cann’t be visited by %cast’s block.

func.func @hoist_cast_pos_alloc(%arg: i1) -> (memref<?xf32>) {
  %alloc = memref.alloc() : memref<10xf32>
  cf.cond_br %arg, ^bb1, ^bb2
^bb1:
  %cast = memref.cast %alloc : memref<10xf32> to memref<?xf32>
  return %cast : memref<?xf32>
^bb2:
  %cast1 = memref.cast %alloc : memref<10xf32> to memref<?xf32>
  return %cast1 : memref<?xf32> 
}
1 Like

I would like to further hear your thoughts. :smiling_face_with_three_hearts:

Does LLVM CSE has the same limitation? Do you think it is intrinsic to CSE or just something to improve in MLIR CSE implementation?

2 Likes

I will conduct research on it, which may take some time (as I do not have experience using LLVM IR), But it’s quite interesting, this question is turning from a small one to an interesting one :thinking:

I have conducted experiments on it. You can see. LLVM’s behavior is consistent with MLIR :thinking:.

define i32 @test_cse(i1 %cond, i32 %arg0, i32 %arg1) {
entry:
  br i1 %cond, label %if.true, label %if.false

if.true:                                          ; preds = %entry
  %a = add i32 %arg0, %arg1
  %b = mul i32 %a, %arg0
  ret i32 %b

if.false:                                         ; preds = %entry
  %c = add i32 %arg0, %arg1
  %d = mul i32 %c, %arg0
  ret i32 %d
}

// opt -S test.ll -passes=early-cse
define i32 @test_cse(i1 %cond, i32 %arg0, i32 %arg1) {
entry:
  br i1 %cond, label %if.true, label %if.false

if.true:                                          ; preds = %entry
  %a = add i32 %arg0, %arg1
  %b = mul i32 %a, %arg0
  ret i32 %b

if.false:                                         ; preds = %entry
  %c = add i32 %arg0, %arg1
  %d = mul i32 %c, %arg0
  ret i32 %d
}

If we make it this.

define i32 @test_cse(i1 %cond, i32 %arg0, i32 %arg1) {
entry:
  %a = add i32 %arg0, %arg1
  %b = mul i32 %a, %arg0
  br i1 %cond, label %if.true, label %if.false
if.true:                                          ; preds = %entry
  ret i32 %b

if.false:                                         ; preds = %entry
  %c = add i32 %arg0, %arg1
  %d = mul i32 %c, %arg0
  ret i32 %d
}

// opt -S test.ll -passes=early-cse
define i32 @test_cse(i1 %cond, i32 %arg0, i32 %arg1) {
entry:
  %a = add i32 %arg0, %arg1
  %b = mul i32 %a, %arg0
  br i1 %cond, label %if.true, label %if.false

if.true:                                          ; preds = %entry
  ret i32 %b

if.false:                                         ; preds = %entry
  ret i32 %b
}