Schedules, latency and register liveness for complex instructions

The CPU that I am targeting is VLIW with no hardware interlocking (the next instruction does not wait for the previous to complete). This leads to fairly complex scheduling, but can be generally accommodated well in LLVM.

However, I have a small number of useful instructions with quite complex scheduling interactions between latency, register liveness and which have more than one register as input and more than one as output.

LLVM assumes that input registers are read as an instruction commences and become dead at that time; while output registers are committed when the instruction latency is complete and they become live at that time.

But for some instructions this is not the case. I have one particular example of an instruction where one of the input registers is read and becomes dead as the instruction starts, but the other input register is read and becomes dead at the commencement of the following cycle. It also writes-back one of the output registers at the end of the 2nd cycle which is when that register becomes truly live, and the other output register is written-back 4 cycles later which is when it becomes live.

The TableGen descriptions do not seem to have any means of binding a register-liveness schedule to specific operands.

So far I have omitted supporting these particular instructions as I can’t figure out how they can be modelled within LLVM. Does anybody know how I might approach describing these kind of semantics to LLVM so that I can safely schedule them?

Variations of this involve instructions designed for pipelined execution, and in pipelined mode then schedule for some operands is different than for single-issue execution.

Thanks,

MartinO

I remember in particular that Fujitsu FR has / had something like this, but they implemented the solution at the actual instruction level when dealing with assembly. Performing a division required running div0, then a series of about 20 div1, then a div2s or div2u or similar to complete the output. It was sort of what I imagine VLIW instructions do in microcode, but had to be written out. What about treating this similarly at the MachineInst level as a PseudoInst or Bundle type format or something to allow other instructions to be inserted where they can be efficiently pipelined without infringing on register use-liveness delay, but emitting it as a single instruction in the original position with the other insts not emitted? In any case it sounds like a difficult proposition

Maybe take a look at whatever is being done in the ARM backend about the LMDIA STMIA style instructions which have different timings depending on how many registers they load or store. In that case I don’t know if the loaded registers are all materialized at once after an appropriate delay according to processor timing or whether the additional cycles taken by versions with more operands make the registers live as they go along. I’m guessing they modeled it as all outputs of the load being live after the full instruction delay, because ARM’s documents are unclear on whether or not any become live before the “normal” delay of a single load and the timings are weird. That’s why I suggest it though, it’s the closest thing I can think of to what you’re talking about and has very weird timings depending on how many registers are loaded at once, or did on ARMv7 anyway.

This sounds like ReadAdvance (grep the codebase); we use this to model a few instructions that fetch instructions earlier than would otherwise be necessary (or later!).

—escha