Speculative LICM?

Hi, it looks like the current MLIR LICM pass doesn’t support speculatively hoisting an operation out of the loop. For example, given a program like below, hoisting the pointer dereference out of the loop seems not available currently.

int f(int *ptr, int n) {
   int s = 0;
   for (int i = 0; i < n; ++i) {
     s += *ptr;
   }
   return s;
}

My gut feeling is that we don’t do loop rotation as a canonical pass, so there is no guarantee that the loop must be run when it comes to LICM. Am I understanding it correctly?

Also it seems that LICM is pretty strong about not moving memory ops out of the loop, i.e, isMemoryEffectFree is required, while LLVM is relatively weak on that. Is it also related to speculation?

Thanks,
Hongtao

My gut feeling is that we don’t do loop rotation as a canonical pass, so there is no guarantee that the loop must be run when it comes to LICM. Am I understanding it correctly?

This depends on the specific loop operation but is true for something like scf.for and for the “before”-region of a scf.while.
In some scenarios you could deduce from the bounds of a scf.for whether it is excecuted at least once in which case speculatability would not be required to hoist an operation, only that it has no side-effects.

Also it seems that LICM is pretty strong about not moving memory ops out of the loop, i.e, isMemoryEffectFree is required, while LLVM is relatively weak on that.

Correct. Moving side-effecting operations out of loops is sadly non-trivial as this requires analysis to ensure that the loop does not contain other side-effecting operations that affect the operation that is to be hoisted. LLVM can analyze this with MemorySSA but in general requires more infrastructure such as alias-analysis and more.

Is it also related to speculation?

Not quite. Speculatability is mostly an indicator for whether an operation has wholly defined behaviour or not which is somewhat orthoganal to side-effecting operations, but in practice often appear together. Both need to be accoutned for when determining whether code motion is legal. See the original discussion that lead to the introduction of speculatability here: [RFC] Mark tensor.dim and memref.dim as side effecting

1 Like

Thanks for your response!

Regarding the memory side effect analysis, do we have any plan to bring it up in the near future? MemorySSA might be a bit fancy at the moment. A much simpler analyzer, e.g, detecting no stores but only loads in the loop, as a start should work well for us.

That’s not quite generic enough to assume. For example:

  • n = 0, the loop doesn’t run, the pointer is never accessed. Hoisting makes it be accessed once at least.
  • ptr points to a memory-mapped register or similar side-effect read. Reading it zero, once or more times may have different interpretations in hardware.

Unless you have annotation that the ptr is side-effect free, you cannot assume that. Unless n is known at compile time to be non-zero (and non-negative, because this is int), you cannot assume the loop will run at all.

I believe we’re saying the same thing, but the way you said (to me) sounds like just that loop in the example could be interpreted as side-effect free. Maybe I got it wrong.

It sounds like the term side-effect might be a bit too general here. Is there a way to separate the side effect for the following case from other general memory accesses?

  • ptr points to a memory-mapped register or similar side-effect read. Reading it zero, once or more times may have different interpretations in hardware.

Yes this precisely the scenario I was describing.

Volatile reads, similar to atomic reads, mustn’t be described as only having read-side effects and therefore would be excluded from the LICM logic I described.
See e.g. llvm-project/mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp at 079746d2c0804ddf616766eb525270d9c57ab542 · llvm/llvm-project · GitHub

Sounds very doable to me, though note that you’d have to not just check for no stores but no “any kind of write-effect OR unknown side-effects”.

2 Likes

Side-effect is the precise term here because any type of side-effect would stop LICM, but you’re interested in some effects and not others.

It’s not just volatile. Register mapped (or I/O mapped) reads can cause attached hardware to take action. For example, a read from a co-processor can flush the value and push another value into the register, while a write to the same position may make the co-processor send this data to other devices.

Luckily, such wild cases are rare nowadays outside of deeply embedded targets, so perhaps can be ignored for most MLIR use-cases. But still, we should not assume total safety, just “accept” partial safety and be clear about it.

@htyu, this may be completely irrelevant to your case, and the LLVM dialect may not even have semantics for that, so take what I said with a grain of salt. @zero9178’s answer is spot-on and should help you directly.

1 Like

Sounds good. I’ll give it more thoughts and come back with what we can contribute.

1 Like

I have some changes that I will be cleaning up and submitting for review soon, which address the ideas discussed here. When an operation is side-effect free, it will be hoisted normally. However, if it has only read side-effects, it will be hoisted with a trip-count check guard. This guard wraps the for loop similar to WrapInZeroTripCheck.cpp. I hope the following example is helpful.

Before

func.func @test_always_speculatable_op_with_read_side_effect_success(%lb: index, %ub: index, %step: index) -> i32 {
  %cst_0 = arith.constant 0 : i32
  %cst_42 = arith.constant dense<42> : tensor<64xi32>
  %ind_42 = arith.constant 42 : index
  %sum_result = scf.for %i = %lb to %ub step %step iter_args(%acc = %cst_0) -> i32 {
    %always = "test.always_speculatable_op"() : () -> i32
    %always_read = "test.always_speculatable_op_with_memread"(%cst_42, %ind_42) : (tensor<64xi32>, index) -> i32
    %i_cast = arith.index_cast %i: index to i32
    %i_sum = arith.addi %acc, %i_cast : i32
    %test_sum = arith.addi %i_sum, %always_read : i32
    scf.yield %test_sum : i32
  }
  return %sum_result : i32
}

After

func.func @test_always_speculatable_op_with_read_side_effect_success(%arg0: index, %arg1: index, %arg2: index) -> i32 {
  %c0_i32 = arith.constant 0 : i32
  %cst = arith.constant dense<42> : tensor<64xi32>
  %c42 = arith.constant 42 : index
  %0 = arith.subi %arg1, %arg0 : index
  %1 = arith.ceildivsi %0, %arg2 : index
  %c0 = arith.constant 0 : index
  %2 = arith.cmpi sgt, %1, %c0 : index
  %3 = "test.always_speculatable_op"() : () -> i32
  %4 = scf.if %2 -> (i32) {
    %5 = "test.always_speculatable_op_with_memread"(%cst, %c42) : (tensor<64xi32>, index) -> i32
    %6 = scf.for %arg3 = %arg0 to %arg1 step %arg2 iter_args(%arg4 = %c0_i32) -> (i32) {
      %7 = arith.index_cast %arg3 : index to i32
      %8 = arith.addi %arg4, %7 : i32
      %9 = arith.addi %8, %5 : i32
      scf.yield %9 : i32
    }
    scf.yield %6 : i32
  } else {
    scf.yield %c0_i32 : i32
  }
  return %4 : i32
}

where test.always_speculatable_op_with_memread is defined as:

def AlwaysSpeculatableOpWithReadSideEffect : TEST_Op<"always_speculatable_op_with_memread",
    [ConditionallySpeculatable, MemoryEffects<[MemRead]>]> {
  let description = [{
    Op used to test LICM conditional speculation.  This op can always be
    speculatively executed and has only memory read effect.
  }];

  let arguments = (ins Variadic<AnyType>:$inputs);
  let results = (outs Variadic<AnyType>:$outputs);

  let extraClassDeclaration = [{
    ::mlir::Speculation::Speculatability getSpeculatability();
  }];

  let extraClassDefinition = [{
    ::mlir::Speculation::Speculatability
    AlwaysSpeculatableOpWithReadSideEffect::getSpeculatability() {
      return ::mlir::Speculation::Speculatable;
    }
  }];
}

Please share your feedback and ideas both on here and on the code review.

2 Likes

cc: @qed

PR is here

1 Like

sorry for the late response. I think one of the downside of adding extra control flow around the loop is that it is likely to block other optimizations as the form of the loop is not as canonical. I wonder if this is a good fit for the existing moveLoopInvariantCode function.

Maybe others have thoughts about that.

An alternative would be to generate something like:

  %cmp = arith.cmpi sge, %lb, %ub : index
  %pre = scf.if %cmp ->32 {
     %5 = "test.always_speculatable_op_with_memread"(%cst, %c42) : (tensor<64xi32>, index) -> i32
    scf.yield %r
   } else {
      %p = ub.poison : i32
      scf.yield %p
   }
  scf.for %i = %lb to %ub step
...

The decision of how to hoist becomes less obvious though.

Thanks for the response. We thought about that option too, i.e., instead of placing the loop under that speculative if check, only the hoisted code will be placed there. I’m not sure how the optimizations would perform differently though. If we are looking at optimizing the loop only, perhaps an unconditional out-of-loop value might make it easier? But for optimizations tracking the loop result, wrapping the loop under an if would make the result conditional.

Another thought is that how the if could be removed if the hoisted load can be run in a predicated way (not supported on MLIR). Would it make sense to clean up the if in a downstream pass (e.g, a Triton pass)?

1 Like

The parallel if and loop structure Thomas outlined does look more standard and closer to a canonical loop optimizer where loop unswitch is usually done first.

cc: @cxy

Thanks for the context. PR comment updated.

1 Like

@cxy Thanks for the comments on the PR. Do you have thoughts about Speculative LICM? - #12 by ThomasRaoux and Speculative LICM? - #14 by htyu? Once we have consensus on which route, I can start implementing and/or addressing the comments on the PR.

This implementation is more reasonable, and its impact on other optimizations outside of LICM can be minimized. In our downstream llvm, we have implemented licm also for operations with read/write effect and this is exactly how we do it.

But it’s still worth mentioning here, in some cases, it can also affect the accuracy of the analysis. Considering that we are performing reaching definitions analysis to the following program:

    A = alloc
    def A // op0
    for  {
       def A.subview0 // op1
       def A.subview1 // op2
    }
    use A

We can infer that the definition of A originates from op1 or from co-definition of op0/1/2.

If licm is performed on op1, we have our result:

    A = alloc
    def A // op0
    if xx (
       def A.subview0 // op1
    )
    for  {
       def A.subview1 // op2
    }
    use A

Performing reaching definitions analysis again on the program, the information that op1/2 must appear together is lost. Our analysis result includes the possibility of being defined only by op0/1, which does not exist in the original program.

Of course, this can be addressed by further optimizing the data flow analysis to obtain more precise results.

1 Like