[RFC] [ARM] v6m: Suggestions for a slightly different set of default optimizer settings.

Hello to all.

When studying forums and mailing lists it seems to me that llvm usage for very small arm v6m targets is not so common.

In the last months, I have spent some time on analyzing performance of llvm/clang for very small targets. My main objective was to get the best possible performance from portable (non-assembly) crypto numerics for cortex-M0(+) targets.
As a result (crypto paper is in the review process and not yet published), llvm did perform best and did outperform gcc and different versions of armcc by a *very* significant factor.
In this mail I would like to summarize some of my results. Based on my analysis, I am convinced, that LLVM will provide an excellent solution also for small bare-metal targets, already with only some changes in the default settings.

Before suggesting new configurations for v6m, as a first step, I'd like to suggest a definition of the "typical arm v6m system" and it's optimization priorities from my perspective.

In my opinion the most important v6m targets will be cortex m0 systems. If somebody is using M0 and not M3/M4, the system will be either very *low cost* or very *low power*. Otherwise thumb2 will be available. Low cost and low power means in a first step "little RAM", because of the silicon area and the leakage currents.
I assume that the vast majority of these systems will be flashed-based or ROM based microcontrollers. I have worked on quite a lot of these systems. On such systems for all projects where I have been involved, the main bottleneck was RAM and not program memory.
This is the reason, why my first patch on llvm deals with tail call optimization for thumb1 and it's possible benefit with respect to stack usage.

After optimizing for RAM usage, code size and speed are the equally important second goals for such small embedded targets in my opinion. Since speed is often very much the same as power, I expect that we should not aggressively focus on code size only.
However in practice, my observation with crypto code was that in fact optimization for size with -Os did in fact not only reduce code size but give significantly better speed than -O2 or -O3 on the cortex-M0. I did not figure out why exactly, but I assume that some optimizations meant to improve ILP scheduling at -O3 did in fact pessimize for the cortex-M0.

In my observation on crypto code and signal processing, the main bottleneck of v6m is the slow memory interface and the large register pressure due to effectively 8 usable registers. Register spills are, thus,*very* expensive. Usage of a frame pointer effectively reduces the register set to 7 registers only, requiring even more spills. This is extremely costly for thumb1 targets.

So, as a first suggestion, I would like to suggest to enable frame pointer elimination by default in clang for v6m, as soon as any optimization level is chosen. With trunc clang, frame pointer elimination today seems to be deactivated, even with -O3 or -Os. Was there a specific reason not to activate this as default?

With respect to the extreme register pressure, my analysis of the crypto code performance did show that instruction scheduling is extremely important. I observed that, when using the -pre-RA-sched=source option, LLVM did a significantly better job. This was the main factor with which I succeeded making LLVM outperforming gcc by almost a factor of 2. Unfortunately, this option is not available on the clang interface level and I had to explicitly generate bytecode data intermediately. I think it would help much to expose this feature to the clang level or as part of the default optimization settings -Os, -O2, -O3.

I observed, that with today's head version, -pre-RA-sched appears to be hidden from the end user. At a first glance, -misched=ilpmin -enable-misched -misched-regpressure did give similar results. I would like to suggest to use them (or a similar configuration) as default for clang for higher optimization levels. Only if these passes may be considered to be stable, of course.

As a last point, it would be helpful to empower clang to do the linking of the code itself. I did not manage doing this in a first step and use gcc so far for final linking. Concerning include paths for the target, command line switches are available. I did not yet find out (and did not yet spend much time) on how to get linking running. Probably, it might be best to directly start with using binutils-gold instead of binutils-ld?

Summarizing, I am convinced, with the above issues being resolved, LLVM will provide an excellent choice also for the very smallest targets. Thank's to the LLVM community to do an excellent job.

Yours,

Björn

P.S.: Some Aspects related to compile-time:

Having in mind, that typical armv6m targets typically will have 32k program memory, I expect that embedded software guys will be willing to tolerate much longer compilation times. Maybe there are expensive options that I am nor aware of, that currently are not activated by default due to performance reasons.

P.P.S.: Ideas for further code improvements of LLVM for small targets:

When comparing our hand coded assembly version with the best compiler-generated version we observed a speed gain of almost a factor of 2. It might be interresting to find out, where the biggest weaknesses of the compiler generated code were in order to find points for improvement.
For the most important v6m system, cortex M0 / M0+, the main speed bottlenecks were register pressure and the slow (2-cycle) overhead for memory accesses. Besides special tricks, the asm optimizations did improve by changing internal calling conventions (no callee-saved-regs, all regs saved by caller), by replacing individual LDR/STR by LDM/STM sequences operating on more registers and by using the upper register half as spill bank.

When looking at those points, I suppose that the last aspect might be implemented in LLVM without too much of problems. Basically, the idea is to use R8,R10,R11,R12 and R13 as temporary spill slots that may be accessed with only 1 cycle instead of the 2 cycles required for memory accesses. For our crypto, we have tried hard but in vain using the upper registers for anything useful beside spill bank usage.
If llvm identifies large functions with lots of stack slots, it might be a good idea considering adding the upper regs to the spill list and replacing stack slot accesses to register accesses instead, if possible.

Hello to all.

When studying forums and mailing lists it seems to me that llvm usage
for very small arm v6m targets is not so common.

...snip...

For the most important v6m system, cortex M0 / M0+, the main speed
bottlenecks were register pressure and the slow (2-cycle) overhead for
memory accesses. Besides special tricks, the asm optimizations did
improve by changing internal calling conventions (no callee-saved-regs,
all regs saved by caller), by replacing individual LDR/STR by LDM/STM
sequences operating on more registers and by using the upper register
half as spill bank.

When looking at those points, I suppose that the last aspect might be
implemented in LLVM without too much of problems. Basically, the idea is
to use R8,R10,R11,R12 and R13 as temporary spill slots that may be
accessed with only 1 cycle instead of the 2 cycles required for memory
accesses. For our crypto, we have tried hard but in vain using the upper
registers for anything useful beside spill bank usage.
If llvm identifies large functions with lots of stack slots, it might be
a good idea considering adding the upper regs to the spill list and
replacing stack slot accesses to register accesses instead, if possible.

This sounds like a really interesting idea. One concern about this would be the cost of spilling from one of these hi-reg spill slots (since push & pop only operate on lo regs). Because of that, you'd need to avoid using them to spill live ranges that cross calls.

Cheers,

Jon

Hello to all.

When studying forums and mailing lists it seems to me that llvm usage
for very small arm v6m targets is not so common.

...snip...

For the most important v6m system, cortex M0 / M0+, the main speed
bottlenecks were register pressure and the slow (2-cycle) overhead for
memory accesses. Besides special tricks, the asm optimizations did
improve by changing internal calling conventions (no callee-saved-regs,
all regs saved by caller), by replacing individual LDR/STR by LDM/STM
sequences operating on more registers and by using the upper register
half as spill bank.

When looking at those points, I suppose that the last aspect might be
implemented in LLVM without too much of problems. Basically, the idea is
to use R8,R10,R11,R12 and R13 as temporary spill slots that may be
accessed with only 1 cycle instead of the 2 cycles required for memory
accesses. For our crypto, we have tried hard but in vain using the upper
registers for anything useful beside spill bank usage.
If llvm identifies large functions with lots of stack slots, it might be
a good idea considering adding the upper regs to the spill list and
replacing stack slot accesses to register accesses instead, if possible.

This sounds like a really interesting idea. One concern about this
would be the cost of spilling from one of these hi-reg spill slots
(since push & pop only operate on lo regs). Because of that, you'd
need to avoid using them to spill live ranges that cross calls.

I did not quite get the point. All of those regs (except R13 of course)
are required to be callee saved by the ABI. So you could safely use
them to hold spilled data across calls. Of course, initial pushing and

Sorry, you're right. I was incorrectly thinking they were caller-saved.

From the AAPCS:

"A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that
designate r9 as v6).
In all variants of the procedure call standard, registers r12-r15 have special roles. In these roles they are labeled
IP, SP, LR and PC."

Jon