Why can't LLVM emit a memcpy when loading and storing integer arrays and simple structs?

blallo · January 16, 2024, 11:38pm

I am struggling to understand why is llvm not allowed to optimize any of these two into a memcpy

target datalayout = "e-S128-i32:32-i8:8-i16:16-p271:32:32:32:32-p272:64:64:64:64-f16:16-f128:128-p270:32:32:32:32-f64:64-i128:128-i64:64-p0:64:64:64:64-i1:8-f80:128"
target triple = "x86_64-unknown-linux-gnu"

%struct.asd = type { [28 x [29 x i32]] }

; Function Attrs: noalias
define void @example(ptr noalias nocapture %0, ptr noalias nocapture %1) {
  %3 = load [29 x [28 x i64]], ptr %1, align 8
  store [29 x [28 x i64]] %3, ptr %0, align 8
  ret void
}

; Function Attrs: noalias
define void @example2(ptr noalias nocapture %0, ptr noalias nocapture %1) {
  %3 = load %struct.asd , ptr %1, align 8
  store %struct.asd %3, ptr %0, align 8
  ret void
}

Running opt -O2 on them results into a full unroll of the load and stores. I understand that somehow it relates to the assumption that the frontend will handle this situations, but i don’t get why marking them noalias is not enough to ensure they can be replaced.

davidchisnall · January 17, 2024, 9:58am

I don’t think it’s true to say that it isn’t allowed to, rather that it doesn’t. Using aggregate types in SSA registers has been discouraged for at least ten years and so I doubt any of the front ends that the optimisers are tested with will generate IR like this. Clang, for example, would lower this kind of construct to a memcpy intrinsic in the front end.

Is there a reason that your front end cannot generate a memcpy? Generally, the only reason that you should emit a load is if you want to compute on the value, and there are no instructions in LLVM IR that compute over arrays.

blallo · January 17, 2024, 11:24am

Is there a reason that your front end cannot generate a memcpy?

not really, indeed the fact that emitting memcpy in this situation was so easy is why i assumed that LLVM would do it for me, i just found it really surprising that would not.

Using aggregate types in SSA registers has been discouraged for at least ten years and so I doubt any of the front ends that the optimisers are tested with will generate IR like this.

i see, i will follow that rule then, thank you

preames · January 17, 2024, 3:23pm

As a general resource, you may find Performance Tips for Frontend Authors — LLVM 18.0.0git documentation helpful. I wrote this a few years ago when working on a new LLVM frontend with the goal of summarizing good practice. It’s a little bit out of date by now, but still mostly accurate.

fhahn · January 17, 2024, 4:10pm

You can get LLVM to convert this to memcpy by running the memcpyopt pass.

The reason this doesn’t work as expected when running the O2 pipeline is that InstCombine converts the load and store to a rather large (3k+ instructions) sequence of smaller loads/stores: Compiler Explorer

In cases where unpacking/packing the load/stores results in a huge amount of code and a memcpy could be used, it would probably make sense to prefer that.

Topic		Replies	Views
[RFC] Correct implementation of memcpy with metadata IR & Optimizations core , rfc , llvm	5	906	June 20, 2023
Optimization issues (Alias Analysis?) LLVM Dev List Archives	3	77	July 6, 2016
llvm.memcpy for struct copy LLVM Dev List Archives	13	120	February 2, 2018
[RFC] Generalize llvm.memcpy / llvm.memmove intrinsics. LLVM Dev List Archives	16	117	September 8, 2015
Trying to optimize out store/load pair LLVM Dev List Archives	0	79	July 17, 2011

Why can't LLVM emit a memcpy when loading and storing integer arrays and simple structs?

Related Topics