Should `scf.for` have `AutomaticAllocationScope` trait?

Background:
We recently found a stack overflow issues in MLIR Sparsifier due to the use of memref.alloca inside scf.for ([mlir][sparse] fix stack overflow due to alloca instruction inside loops by PeimingLiu · Pull Request #69765 · llvm/llvm-project · GitHub).

Our first attempt to fix the issue is using memref.alloca_scope. However, it does NOT solve the issue as scf.for has the AutomaticAllocationScope trait. As the result, AllocaScopeHoister hoists memref::alloca from the alloca_scope into scf.for, leading to incorrect code.

E.g.,

scf.for {
  memref.alloca_scope {
     memref.alloca
     xxx
  }
}

is canonicalized into

scf.for {
  memref.alloca
  memref.alloca_scope {
    xxx
  }
}

Note that this is a correct transformation provided that scf.for indeed adjusts stack pointer between iterations (though it doesn’t).

Question: Should we remove the AutomaticAllocationScope trait from scf.for or insert stack.save/stack.restore when lowering the operation?

1 Like

Seems to me that the correct canonicalization for

scf.for {
  memref.alloca_scope {
     memref.alloca
     xxx
  }
}

(or any op like scf.for having AutomaticAllocationScope)
should be:

scf.for {
  memref.alloca
  xxx
}

That is the semantics of memref.alloca_scope should already be implied here.
Of course there is a lowering problem, where every op that has the AutomaticAllocationScope should lower the exact same way as memref.alloca_scope

You are actually right! There seems to be another AllocaScopeInliner (MLIR: AllocaScopeInliner Struct Reference). Though, it seems that the Inliner is not applied after Hoister in a single canonicalization pass when I tested it.

I do think that the loop should have the trait, it was added exactly to prevent the kind of situation you observed. The lowering is likely incorrect in this case. I vaguely remember a discussion about llvm.intr.stacksave/restore being unsupported for some targets, but we can always try and see what happens.

Alternatively, we could also just lower to LLVM by placing the alloca in the function entry block, and maybe use the lifetime intrinsics instead?

See use of alloca and lifetime intrinsics in the LLVM
“Performance Tips for Frontend Authors” guide.