Why can't LLVM emit a memcpy when loading and storing integer arrays and simple structs?

I am struggling to understand why is llvm not allowed to optimize any of these two into a memcpy

target datalayout = "e-S128-i32:32-i8:8-i16:16-p271:32:32:32:32-p272:64:64:64:64-f16:16-f128:128-p270:32:32:32:32-f64:64-i128:128-i64:64-p0:64:64:64:64-i1:8-f80:128"
target triple = "x86_64-unknown-linux-gnu"

%struct.asd = type { [28 x [29 x i32]] }

; Function Attrs: noalias
define void @example(ptr noalias nocapture %0, ptr noalias nocapture %1) {
  %3 = load [29 x [28 x i64]], ptr %1, align 8
  store [29 x [28 x i64]] %3, ptr %0, align 8
  ret void
}

; Function Attrs: noalias
define void @example2(ptr noalias nocapture %0, ptr noalias nocapture %1) {
  %3 = load %struct.asd , ptr %1, align 8
  store %struct.asd %3, ptr %0, align 8
  ret void
}

Running opt -O2 on them results into a full unroll of the load and stores. I understand that somehow it relates to the assumption that the frontend will handle this situations, but i don’t get why marking them noalias is not enough to ensure they can be replaced.

I don’t think it’s true to say that it isn’t allowed to, rather that it doesn’t. Using aggregate types in SSA registers has been discouraged for at least ten years and so I doubt any of the front ends that the optimisers are tested with will generate IR like this. Clang, for example, would lower this kind of construct to a memcpy intrinsic in the front end.

Is there a reason that your front end cannot generate a memcpy? Generally, the only reason that you should emit a load is if you want to compute on the value, and there are no instructions in LLVM IR that compute over arrays.

Is there a reason that your front end cannot generate a memcpy?

not really, indeed the fact that emitting memcpy in this situation was so easy is why i assumed that LLVM would do it for me, i just found it really surprising that would not.

Using aggregate types in SSA registers has been discouraged for at least ten years and so I doubt any of the front ends that the optimisers are tested with will generate IR like this.

i see, i will follow that rule then, thank you

As a general resource, you may find Performance Tips for Frontend Authors — LLVM 18.0.0git documentation helpful. I wrote this a few years ago when working on a new LLVM frontend with the goal of summarizing good practice. It’s a little bit out of date by now, but still mostly accurate.

1 Like

You can get LLVM to convert this to memcpy by running the memcpyopt pass.

The reason this doesn’t work as expected when running the O2 pipeline is that InstCombine converts the load and store to a rather large (3k+ instructions) sequence of smaller loads/stores: Compiler Explorer

In cases where unpacking/packing the load/stores results in a huge amount of code and a memcpy could be used, it would probably make sense to prefer that.