>>
>> All these patterns have one important downside. They are suboptimal if
>> more than one store happens in a row. E.g. the 0 store is better
>> expressed as xor followed by two register moves, if a register is
>> available... This is most noticable when memset() gets inlined
>
> Note that LLVM's -Os option does not quite mean the same as GCC's flag.
> It disables optimizations that increase code size without a clear performance gain.
> It does not try to minimize code size at any cost.
Jakob is right, but there is a clear market for "smallest at any cost".
The FreeBSD folks would really like to build their bootloader with
clang for example :).
Yes, I have the same problem for NetBSD. All but two of the boot loaders
are working. One is currently off by less than 300 Byte, one by 800.
It should be reasonably easy to add a new "optsize2" function attribute
to LLVM IR, and have that be set with -Oz (the "optimize for size at
any cost") flag, which could then enable stuff like this.
There are lots of other cases where this would be useful, such as
forced use of "rep; movsb" on x86, which is much smaller than a call
to memset, but also much slower :).
Agreed. From studying GCC's peep hole optimisation list and the
assembler code, I see the following candiates for space saving:
setcc followed by movzbl into xor and setcc. This is #8785 and a general
optimisation. I have seen enough code to profit from this.
The optimised memory set from this thread. The question of assigning a
scratch register if multiple instructions want to use the same 32bit
immediate would be useful in other cases too.
Function prologue and epilogue can often be optimised by adjusting %esp
or %rsp using push/pop. E.g. for 32bit mode, "addl $4, %esp" and "addl
$8, %esp" are more compact as one or two pops to a scratch register.
This is also a hot path for most CPUs. Same for subtracting and 64bit
mode. Generally using push/pop for stack manipulation would be much
nicer for code size, but require extensive changes to the code
generator. I think this is the majority of why GCC creates smaller
binaries.
Using cmp/test against a constant before a conditional branch can often
be optimised if the register is dead. For cmp, a check against -1 or 1
can be replaced with inc/dec and inverting the condition. This saves 2
Bytes in 32bit mode and 1 Byte in 64bit mode. It applies generally for
all optimiser levels. Compares against 8bit signed immediates for 32bit
/ 64bit registers can be expressed as add or sub, saving 2 Bytes in all
cases.
Joerg