Memref-to-llvm: llvm.alloca operations emitted for memref.copy cause segfault if embedded in loop

Hi,

the MemRef to LLVM conversion pass emits llvm.alloca operations when
lowering memref.copy operations. The original stack position is
never restored after the allocations, which creates an issue when the operation
is embedded into a loop with a high trip count, ultimately resulting
an a segmentation fault due to the stack growing too large.

The problem is exacerbated when the copy is performed on a memref with
a mapping resulting non-contiguous memory, since the associated
lowering path involving the invocation of memrefCopy emits even more
llvm.alloca operations.

Below is as a minimal example illustrating the issue:

#map = affine_map<(d0, d1) -> (d0 * 64 + d1 + 1056)>

module {
  func.func @main() {
    %arg0 = memref.alloc() : memref<32x64xi64>
    %arg1 = memref.alloc() : memref<16x32xi64>
    %lb = arith.constant 0 : index
    %ub = arith.constant 100000 : index
    %step = arith.constant 1 : index
    %slice = memref.subview %arg0[16,32][16,32][1,1] : memref<32x64xi64> to memref<16x32xi64, #map>

    scf.for %i = %lb to %ub step %step {
       memref.copy %slice, %arg1 : memref<16x32xi64, #map> to memref<16x32xi64>
    }

    return
  }
}

When running the code above, e.g., with mlir-cpu-runner, the
execution crashes with a segmentation fault:

$ mlir-opt --convert-memref-to-llvm --convert-scf-to-cf --convert-func-to-llvm --convert-cf-to-llvm -reconcile-unrealized-casts <file> | mlir-cpu-runner -e main -entry-point-result=void \
--shared-libs=$PWD/build/lib/libmlir_c_runner_utils.so
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ./build-Release/bin/mlir-cpu-runner -e main -entry-point-result=void --shared-libs=/usr/src/homomorphizer-master/compiler/build-Release/lib/libmlir_c_runner_utils\
.so
 #0 0x0000558358e962cf PrintStackTraceSignalHandler(void*) (./build-Release/bin/mlir-cpu-runner+0x29c2cf)
 #1 0x0000558358e93cec SignalHandler(int) (./build-Release/bin/mlir-cpu-runner+0x299cec)
 #2 0x00007f072f269730 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12730)
 #3 0x00007f072d40fa69 memrefCopy (/usr/src/homomorphizer-master/compiler/build-Release/lib/libmlir_c_runner_utils.so+0x18a69)
 #4 0x00007f072f2930ec
 #5 0x00007f072f29311d
 #6 0x000055835930185f compileAndExecute((anonymous namespace)::Options&, mlir::ModuleOp, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**) (./build-Release/bin/mlir-\
cpu-runner+0x70785f)
 #7 0x0000558359301c9c compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::ModuleOp, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig) (./build-Release/bin/m\
lir-cpu-runner+0x707c9c)
 #8 0x0000558359305902 mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) (./build-Release/bin/mlir-cpu-runner+0x70b902)
 #9 0x0000558358e19a46 main (./build-Release/bin/mlir-cpu-runner+0x21fa46)
#10 0x00007f072ed3d09b __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409b)
#11 0x0000558358e7dbca _start (./build-Release/bin/mlir-cpu-runner+0x283bca)
Segmentation fault

The execution only succeeds if the trip count of the scf.for loop is
sufficiently low, e.g., by setting %ub = arith.constant 100 : index.

Is the allocation issue supposed to be fixed by applying a subsequent
optimization pass? If yes, what pass should be run?

I played around a bit with llvm.lifetime.start and
llvm.lifetime.end annotations for the stack allocations, but
couldn’t find a pass exploiting this information and optimizing the
allocations.

Thanks,
Andi

Try wrapping the loop body into an additional memref.alloca_scope operation. It lowers to a pair of llvm.stacksave/restrore intrinsics and should mitigate the problem.

Thanks @ftynse, that worked! Is there anything that speaks against inserting that op (or rather the instrinsics) automatically upon the lowering of memref.copy?

If not, I’d be happy to submit a patch.

I don’t remember why the lowering allocates, so make sure the allocation isn’t necessary after the op.

The patch is here: D135756.