TableGen syntax for matching a constant load

Hi all,
I'm trying to add a X86 pattern to turn
  movl $-1, %eax
into
  orl $-1, $eax
I can't find a way to express this in TableGen syntax though.

def : Pat<(set GR32:$src, (i32 -1)), (OR32ri8 GR32:$src, -1)>;

results in an assertion about 'Unknown Node'.

Joerg

Hi Joerg,

This is problematic to define the obvious way because "orl" takes two inputs and produces an output, but this specific form of it ignores the inputs and just sets the register (and flags).

We can handle this by defining a new pseudo-op that just defines the register+flags. Take a look at how "xor reg,reg" -> "mov 0, reg" is done. MOV32r0 is the place to start: MCInstLowering then turns it into the actual operation.

-Chris

Please make sure to measure the performance impact of doing this. You are creating a false dependency on the last instruction to write %eax, and the CPU won't be able to execute the following instructions in parallel.

You may want to consider using xorl+decl instead. It is also three bytes, and there are no false dependencies. The xor idiom is recognized by processors as old as Pentium 4 as having no dependencies.

/jakob

> I'm trying to add a X86 pattern to turn
> movl $-1, %eax
> into
> orl $-1, $eax

Please make sure to measure the performance impact of doing this. You
are creating a false dependency on the last instruction to write %eax,
and the CPU won't be able to execute the following instructions in
parallel.

I am primarily interested in size here, not speed.

You may want to consider using xorl+decl instead. It is also three
bytes, and there are no false dependencies. The xor idiom is recognized
by processors as old as Pentium 4 as having no dependencies.

Any examples of how to create more than one instructions for a given
pattern? There are some other cases I could use this for.

Joerg

def : Pat<(i32 -1), (DEC32r (MOV32r0))>;

/jakob

Hm. Right. This gives the me first set of size peep hole optmisations as
attached. I didn't add the above rule for 64bit builds, since it is
larger than the to-be-figured out OR32rmi8 / OR64rmi8.

Joerg

X86InstrCompiler.td.diff (876 Bytes)

All these patterns have one important downside. They are suboptimal if
more than one store happens in a row. E.g. the 0 store is better
expressed as xor followed by two register moves, if a register is
available... This is most noticable when memset() gets inlined

Joerg

Note that LLVM's -Os option does not quite mean the same as GCC's flag.
It disables optimizations that increase code size without a clear performance gain.
It does not try to minimize code size at any cost.

When you said you weren't concerned about performance, I assumed you wouldn't be submitting patches. Sorry about the confusion.

Implementing constant stores as load+bitop+store is almost certainly not worth the small size win.

As for materializing (i32 -1) in 3 bytes instead of 5, but with 2 µ-ops instead of 1, I would like to see some performance numbers first. It might be cheap enough that it is worth it.

The MOV32ri instruction can be rematerialized and is considered to be as cheap as a move. That is not true for xorl+decl, and unfortunately the register allocator currently doesn't know how to rematerialize multiple instructions which means that the register containing -1 could get spilled. We really don't want that to happen.

Until the register allocator learns how to rematerialize multiple instructions, you would need to use a pseudo-instruction representing the xorl+decl pair. That instruction could be marked as rematerializable and even as cheap as a move.

Then you can start measuring the performance impact :wink:

Thanks,
/jakob

Jakob is right, but there is a clear market for "smallest at any cost". The FreeBSD folks would really like to build their bootloader with clang for example :).

It should be reasonably easy to add a new "optsize2" function attribute to LLVM IR, and have that be set with -Oz (the "optimize for size at any cost") flag, which could then enable stuff like this.

There are lots of other cases where this would be useful, such as forced use of "rep; movsb" on x86, which is much smaller than a call to memset, but also much slower :).

-Chris

Such a mode would also:

- Lower the inlining threshold to 0 so functions are only inlined when it would decrease the overall code size.

- Disable the SSEExecutionDomain pass which selects equivalent, but larger vector instructions.

- Disable tail duplication and be more aggressive about tail merging.

- Perhaps select x86 instructions with 16-bit immediates?

You could also tweak the heuristics used by register coalescing and spill code placement, but that involves more work.

/jakob

>>
>> All these patterns have one important downside. They are suboptimal if
>> more than one store happens in a row. E.g. the 0 store is better
>> expressed as xor followed by two register moves, if a register is
>> available... This is most noticable when memset() gets inlined
>
> Note that LLVM's -Os option does not quite mean the same as GCC's flag.
> It disables optimizations that increase code size without a clear performance gain.
> It does not try to minimize code size at any cost.

Jakob is right, but there is a clear market for "smallest at any cost".
The FreeBSD folks would really like to build their bootloader with
clang for example :).

Yes, I have the same problem for NetBSD. All but two of the boot loaders
are working. One is currently off by less than 300 Byte, one by 800.

It should be reasonably easy to add a new "optsize2" function attribute
to LLVM IR, and have that be set with -Oz (the "optimize for size at
any cost") flag, which could then enable stuff like this.

There are lots of other cases where this would be useful, such as
forced use of "rep; movsb" on x86, which is much smaller than a call
to memset, but also much slower :).

Agreed. From studying GCC's peep hole optimisation list and the
assembler code, I see the following candiates for space saving:

setcc followed by movzbl into xor and setcc. This is #8785 and a general
optimisation. I have seen enough code to profit from this.

The optimised memory set from this thread. The question of assigning a
scratch register if multiple instructions want to use the same 32bit
immediate would be useful in other cases too.

Function prologue and epilogue can often be optimised by adjusting %esp
or %rsp using push/pop. E.g. for 32bit mode, "addl $4, %esp" and "addl
$8, %esp" are more compact as one or two pops to a scratch register.
This is also a hot path for most CPUs. Same for subtracting and 64bit
mode. Generally using push/pop for stack manipulation would be much
nicer for code size, but require extensive changes to the code
generator. I think this is the majority of why GCC creates smaller
binaries.

Using cmp/test against a constant before a conditional branch can often
be optimised if the register is dead. For cmp, a check against -1 or 1
can be replaced with inc/dec and inverting the condition. This saves 2
Bytes in 32bit mode and 1 Byte in 64bit mode. It applies generally for
all optimiser levels. Compares against 8bit signed immediates for 32bit
/ 64bit registers can be expressed as add or sub, saving 2 Bytes in all
cases.

Joerg