Question about target instruction optimization

This is a question about optimizing the code generation in a (new) Z80 backend:

The CPU has a couple of 8 bit physical registers, e.g. H, L, D and E, which are overlaid in 16 bit register pairs named HL and DE.

It has also a native instruction to load a 16 bit immediate value into a 16 bit register pair (HL or DE), e.g.:

     LD HL,<imm16>

Now when having a sequence of loading two 16 bit register pairs with the *same* immediate value, the simple approach is:

     LD HL,<imm16>
     LD DE,<imm16>

However, the second line can be shortened (in opcode bytes and cycles) to load the overlaid 8 bit registers of HL (H and L) into the overlaid 8 bit registers of DE (D and E), so the desired result is:

     ; optimized version: saves 1 byte and 2 cycles
     LD D,H (sets the high 8 bits of DE from the high 8 bits of HL)
     LD E,L (same for lower 8 bits)

Another example: If reg pair DE needs to be loaded with imm16 = 0, and another physical(!) register is known to be 0 (from a previous immediate load, directly or indirectly) - assuming that L = 0 (H might be something else) - the following code:

     LD DE,0x0000

should become:

     LD D,L
     LD E,L

I would expect that this needs to be done in a peephole optimizer pass, as during the lowering process, the physical registers are not yet assigned.

Now my question:
1. Is that correct (peephole instead of lowering)? Should the lowering always emit the generic, not always optimal "LD DE,<imm16>". Or should the lowering process always split the 16 bit immediate load in two 8 bit immediate loads (via two new virtual 8 bit registers), which would be eliminated later automatically?
2. And if peephole is the better choice, which of these is recommended: the SSA-based Machine Code Optimizations, or the Late Machine Code Optimizations? Both places in the LLVM code generator docs say "To be written", so I don't really know which one to choose... or even writing a custom pass?

...and more importantly, how would I check if any physical register contains a specific fixed value at a certain point (in which case the optimization can be done) - or not.


This is so far down the list of problems you’ll have (and the difference so trivial to program size and speed) that I think you should ignore it until you have a working compiler.

As far as two registers getting the same value, that should be picked up by common subexpression elimination in the optimiser anyway.

You might want to consider having a pseudo-instruction for LD {BC,DE,HL,IX,IY},{BC,DE,HL,IX,IY} (all combinations are valid except those containing two of HL,IX,IY). You could expand this very late in the assembler, or during legalisation.

Yes, such optimizations are something for the “last 20%” of the project, nice to have’s.

As of now, I have yet to get a feeling of what LLVM can do on its own, depending on what it’s from the instruction tables and where it needs help, and how much in other processing stages.
As this affects the way how the instruction info table will be set-up, I appreciate your suggestions very much!

Now that you mentioned using a the pseudo instruction for the possible 16 bit LD command combinations:

Regarding the heavily overlapped register structure and the asymmetric instruction set of the Z80, would you recommend to try mapping more instructions in the form of generic pseudos that expand to multiple instructions during legalisation (leading to more custom lowering code), or try to map as many instructions and variations according to the allowed / limited operators as possible in a 1:1 way, leading to simpler lowering code (not sure if I am using the right words here)?


I don't know. It's going to be tough.

The register structure and especially the overlaying is very close to the
8086/386, so you can probably get some ideas there. AF is related to AX
(AH/AL). The 8086 expands A to a full 16 bit accumulator and puts F
elsewhere, but there are instructions to copy AH to F and to copy F to AH!
BC, DE, and HL are a lot like BX, CX, and DX. IX and IY are I guess like SI
and DI except they're overlaid on top of HL. BP is extra on 8086.

The available instructions are a LOT more limited on Z80 though, especially
for memory addressing and 16 bit operations. That's going to be tough. SP
in particular is so crippled that I think you'll have to keep a copy of it
in IX or IY basically all the time so you can access variables in stack
slots using the (IX+n) addressing mode. You can copy HL, IX or IY to SP but
not vice-versa! But you can add SP to HL, IX or IY. If you tie up, say, IX
with a copy of the stack pointer all the time then you're going to be very
very short on other places to keep pointers.

The more I look at it, the harder I think z80 will be. I'd almost rather do
6502! At least there you just admit the registers are useless and use zero
page as registers -- and there are as many as you could ever want! The code
size is awful though.

Yes, "crippled" is the right word to describe some areas of the instruction set.
You are right with the register comparison with X86's regs.

Indeed, the idea is to save IX in a function's prologue and assign the stack pointer to it in a function's prologue, and in the epilogue, restore the SP from IX and IX's original value *if* the function needs a frame (i.e. has parameters or variables on the stack).
Plus, even adjusting the stack pointer to outside the local storage area requires sacrificing the HL register pair for larger amounts (or a series of "inc sp" or dummy-pushes for shorter ones). So the emitted code for the frame-setup will be quite dynamic, depending on the circumstances, but can be done with ~3-4 cases.

Stack access to locals / spilled vars within the boundaries of the offset range (-128 - +127) will be done via IX, and any access to outside the range will require address calculation from the current stack base via HL (loading the offset into HL and add SP).

And even then, saving / spilling a physical register (16 bit pair) into a stack position within the range of IX is a costly sequence (LD (IX+n),low8 / LD (IX+(n+1)),high8 to store a value, and the reverse to restore it) - 8 bytes in total, and IX/IY-instructions are always slower than operations with other regs - so "push" and "pop" for temporary short-term (single) register spilling would be preferred - those need only 1 byte and save / restore the whole 16 bit value in one go.
Not sure how to tell LLVM to do so, though.

However, for functions small enough to do any computation in the available registers, or where spilling can be limited to some push and pop operations, the whole call frame setup can be skipped at all. If a few params can be passed through registers, the resulting code can be as efficient as hand-written assembly (or even more, regarding the capabilities of the SSA).

Knowing that, a developer will have control over it to at least *some* degree. Making a local var "static" would allow the compiler to use the efficient instruction to store and restore a 16 bit variable directly at a memory address (for the cost of losing recursion and the that the optimizer could keep it in a register).
But then, using existing C compilers for the Z80 (from *way* back then) was always a game of compiling, checking the emitted code, rearrange, build, check again if the function is time-critical, or writing it in assembly. In most cases, the just just needs to be "good enough", and good compilers achieved 50-90% of the performance of hand-written assembly.
And of course, I expect LLVM to beat that >;->

I started off with jacobly0's (E)Z80 backend heritage, made it build with a recent version of LLVM again and try to understand the shortcomings and areas of improvement.
It is targeted at EZ80's more powerful instruction set, introduced a lot of custom code into the LLVM base to support ZE80's 24 bit native pointers and custom binary file output (which caused it to no longer work with the current LLVM codebase), and last but not least, is incomplete
I created a project area in Github ( where I'm started working with a friend from the MSX community on that, with - apart from making a decent Z80 backend - some long term goals such as banking support for a "far" memory model to be able to compile "bigger" applications such as a ZIP archiver. Might end-up horribly slow, though :smiley:
As of now, we are in the phase to get a feeling on how to do things in the LLVM backend and to define the rules for the instruction lowering, calling conventions and frame setup (shamelessly studying the code that old Z80 C compilers emit :wink: