[ARM] Should Use Load and Store with Register Offset

Hello LLVM Community (specifically anyone working with ARM Cortex-M),

While trying to compile the Newlib C library I found that Clang10 was generating slightly larger binaries than the libc from the prebuilt gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy, strcpy, etc.) and noticed that LLVM does not tend to generate load/store instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and instead prefers the immediate offset form.

When copying a contiguous sequence of bytes, this results in additional instructions to modify the base address. https://godbolt.org/z/T1xhae

void* memcpy_alt1(void* dst, const void* src, size_t len) {
char* save = (char*)dst;
for (size_t i = 0; i < len; ++i)
((char)(dst + i)) = ((char)(src + i));
return save;
}

clang --target=armv6m-none-eabi -Os -fomit-frame-pointer

memcpy_alt1:
push {r4, lr}
cmp r2, #0
beq .LBB0_3
mov r3, r0
.LBB0_2:
ldrb r4, [r1]
strb r4, [r3]
adds r1, r1, #1
adds r3, r3, #1
subs r2, r2, #1
bne .LBB0_2
.LBB0_3:
pop {r4, pc}

arm-none-eabi-gcc -march=armv6-m -Os

memcpy_alt1:
movs r3, #0
push {r4, lr}
.L2:
cmp r3, r2
bne .L3
pop {r4, pc}
.L3:
ldrb r4, [r1, r3]
strb r4, [r0, r3]
adds r3, r3, #1
b .L2

Because this code appears in a loop that could be copying hundreds of bytes, I want to add an optimization that will prioritize load/store instructions with register offsets when the offset is used multiple times. I have not worked on LLVM before, so I’d like advice about where to start.

  • The generated code is correct, just sub-optimal so is it appropriate to submit a bug report?

  • Is anyone already tackling this change or is there someone with more experience interested in collaborating?

  • Is this optimization better performed early during instruction selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)

  • What is the potential to cause harm to other parts of the code gen, specifically for other arm targets. I’m working with armv6m, but armv7m offers base register updating in a single instruction. I don’t want to break other useful optimizations.
    So far, I am reading through the LLVM documentation to see where a change could be applied. I have also:

  • Compiled with -S -emit-llvm (see Godbolt link)
    There is an identifiable pattern where a getelementptr function is followed by a load or store. When multiple getelementptr functions appear with the same virtual register offset, maybe this should match a tLDRr or tSTRr.

  • Ran LLC with --print-machineinstrs
    It appears that tLDRBi and tSTRBi are selected very early and never replaced by the equivalent t<LDRB|STRB>r instructions.
    Thank you,

Hello Daniel,

LLVM and GCC’s optimisation levels are not really equivalent. In Clang, -Os makes a performance and code-size trade off. In GCC, -Os is minimising code-size, which is equivalent to -Oz with Clang. I have’t looked into details yet, but changing -Os to -Oz in the godbolt link gives the codegen you’re looking for?

Cheers,
Sjoerd.

Hello Sjoerd,

Thank you for your response! I was not aware that -Oz is a closer equivalent to GCC’s -Os. I tried -Oz when compiling with clang and confirmed that the Clang’s generated assembly is equivalent to GCC for the code snippet I posted above.

clang --target=armv6m-none-eabi -Oz -fomit-frame-pointer

memcpy_alt1:
push {r4, lr}
movs r3, #0
.LBB0_1:
cmp r2, r3
beq .LBB0_3
ldrb r4, [r1, r3]
strb r4, [r0, r3]
adds r3, r3, #1
b .LBB0_1
.LBB0_3:
pop {r4, pc}

Daniel Way

Hi Daniel,

Your observations seem valid to me. Some high-level comments from my side.

As you said, the loops are quite similar. We have also observed that in general we generate more code around loops, in the function prologue and epilogue, where some data and arguments get moved and reshuffled etc. While this is very obvious in these micro-benchmarks, it hasn’t bothered us enough yet for larger apps where this is less important (or where others things are more important). The outlier looks indeed to be Clang -Oz for memcpy_alt2, that is perhaps a “code-size bug”. As I haven’t looked into it, it’s too early for me to blame this on just the addressing modes as there could be several things going on.

Since this is a micro-benchmark, and lowering memcpy is a bit of an art ;-), for which a specialised implementation is probably available, you might want to look at some other codes too that are important for you.

Your remarks about execution times might be right too, and as you said, probably best confirmed with benchmark numbers. In our group, we have not really looked into performance for the Cortex-M0, probably because it’s the only v6m core (although the Cortex-m23 and Armv8-M Baseline is very similar) and code-size would be more important for us, but there might be something to be gained here.

Cheers,
Sjoerd.

Thank you, Sjoerd.

Your high-level comments are very helpful and much appreciated. I ended up rebuilding the Newlib-nano source with -Oz instead of -Os and found an overall improvement in code size. The final size is still larger than the gcc-arm-none-eabi toolchain. Of course there are a few caveats to this:

  • Newlib is designed around GCC;
  • I’m not sure I perfectly reproduced the build settings for the pre-built toolchain (macros, etc.);
  • and this comparison considers all libc functions, many of which may not end up in the final image.
    For now, I’ve submitted BUG 46801 for the case when -Oz produces more instructions than -Os. I don’t know if it needs to be a priority, but thought it should be recorded.

I may try benchmarking the memcpy implementations as well as a few other libc functions, but I haven’t done this before. Of course, I’ll share my results if I do end up testing.

Thank you for the help.

Hi Daniel,

Thanks for the feedback, that’s interesting, and for raising the bug. I will do a first finger on the pulse and see if I can address that issue on the side, but am not promising anything. :slight_smile:

Cheers,
Sjoerd.