Help making 'narrow instruct microcode' Backend

I've been experimenting with llvm/clang as a user for a while now, but now
I'm interested in writing my own backend. I'm also developing the target
architecture (maybe to go in an fpga eventually) and I'm intentionally
making it extremely simple. I think of it as a narrow microcode, because
(for example) performing an add requires a sequence of instructions like:
set aluin1 = r1
set aluin2 = r2
aluop add
set r3 = aluout

I've started implementing the backend in clang, and I got this basic
example working by modifying my backend's implementation of
SelectionDAGISel::Select to handle ISD::ADD and transform it into the
above 4-instruction sequence. However, I'd like to expand the architecture
to have more than 1 alu, with one set of alu registers for each alu (ex
aluAin1, aluAin2, aluAout, and aluBin1, aluBin2, aluBout). I'd also like
to have llvm's register allocation and instruction scheduling select which
alu to use at what time, but I don't think I can acomplish this with my
current approach.

One thought I had was making the aluin1/2 and aluout subregisters of a
larger register, and then making multiple instances of that register type
and a class for the larger registers, but I'm not sure how to tell llvm
that it should use the multi-instruction sequence for ADD. Is there a way
to do it in the Select function like I did for 1, but with virtual
registers so the allocator/scheduler can pick which alu to use?

Any information on how I might implement the super/subregister solution or
any alternative approaches to implementing a backend for this architecture
are greatly appreciated.

This sounds very similar to some sorts of existing VLIW architectures, which may look something like this:

// START BUNDLE
{
// input phase
s0 = <some input>
s1 = <some input>
[…]
sN = <some input>
// phase 0
t0 = <some operation using s registers>
t1 = <some operation using s registers>
[…]
tN = <some operation using s registers>
// phase 1
// stuff using t registers
[…]
// phase N
// stuff using t registers
// output phase
rX = t0
rY = t1
rZ = t2
}
// END BUNDLE

Of course, the bundles can be as complex or simple as you like; they could have just one phase and one operation, for example. Either way, this sort of thing where it defines inputs, then defines an instruction, then defines outputs, sounds very much like a problem for VLIW bundles. VLIW tends to be very simple/“close to the metal” insomuch as the bits in the instruction encoding directly map to muxes between the phases, i.e. selecting which register is passed from where to where.

—escha

I've been experimenting with llvm/clang as a user for a while now, but
now
I'm interested in writing my own backend. I'm also developing the target
architecture (maybe to go in an fpga eventually) and I'm intentionally
making it extremely simple. I think of it as a narrow microcode, because
(for example) performing an add requires a sequence of instructions
like:
set aluin1 = r1
set aluin2 = r2
aluop add
set r3 = aluout

This sounds very similar to some sorts of existing VLIW architectures,
which may look something like this:

// START BUNDLE
{
// input phase
s0 = <some input>
s1 = <some input>
[…]
sN = <some input>
// phase 0
t0 = <some operation using s registers>
t1 = <some operation using s registers>
[…]
tN = <some operation using s registers>
// phase 1
// stuff using t registers
[…]
// phase N
// stuff using t registers
// output phase
rX = t0
rY = t1
rZ = t2
}
// END BUNDLE

Of course, the bundles can be as complex or simple as you like; they could
have just one phase and one operation, for example. Either way, this sort
of thing where it defines inputs, then defines an instruction, then
defines outputs, sounds very much like a problem for VLIW bundles. VLIW
tends to be very simple/“close to the metal” insomuch as the bits in
the instruction encoding directly map to muxes between the phases, i.e.
selecting which register is passed from where to where.

—escha

My target has a lot in common with VLIW architectures, except for the 'VL'
part. I'd like to essentually have similar low-level semantics without
bundling instructions in order to take advantage of the llvm scheduling
and register allocation capabilities (and also make the target simpler,
not having to perform any but the simplest operations).

Currently, I have it partially working by defining an ALU as a v4i32
register with 4 subregisters (in1, in2, out, flags), and then in my
SelectionDAGISel::Select, I replace a node like 'ISD::ADD' like so:
r3 = add r1, r2
into
AluReg = temporary from ALURegClass
Copy To AluReg, (IMPLICIT_DEF)
AluIn1 = ALUIN1 (Copy From AluReg), r1
AluIn2 = ALUIN2 AluIn1, r2
AluResult = ADD AluIn2
res = temporary from GPRRegClass
Copy To res (ALUOUT AluResult)
r3 = res

Where every value/register named Alu* is the same v4i32 alu register in
various stages of update. I found the copy of implicit def through an alu
seems to help llvm reason about the subreg liveness (using just implicit
def, llvm seems to think the regs live forever, and using just a temp
register, llvm complains it is used before it is defined), and the copy of
the result through a gpr helps with register pressure (presumably because
GPRs are more plentiful than alu registers).

However, I still have an issue with register allocation. If, in my
subclass of TargetLowering, I set the scheduling preference to
Sched::RegPressure, things work ok most of the time, but I get extremely
poor scheduling because every alu operation uses aluA and computes
everything in order exactly as described above. If I use any other
scheduling preference, I get much better scheduled code, but llvm moves
the ALUIN1r/ALUIN2r instruction pair far from the ADD/ALUOUTr instruction
pair and thus creates an excessive number of long live ranges. It then
tries to spill the ALU registers, which isn't possible, and everything
fails.

Any advice on getting any 'real' scheduling to work without introducing
unneeded spilling (unneeded because it IS possible to keep the 4
instructions grouped and thus never have an alu register live outside an
actual alu operation, and up to N groups can be interleaved when I have N
alu registers) would be greatly appreciated. I know I can bundle the
instructions, but that (afaik) eliminates the possibility of the
instruction scheduler interleaving multiple alu instructions, or mixing
alu instructions with non-alu instructions.

Hi James,

My target has a lot in common with VLIW architectures, except for the 'VL'
part. I'd like to essentually have similar low-level semantics without
bundling instructions in order to take advantage of the llvm scheduling
and register allocation capabilities (and also make the target simpler,
not having to perform any but the simplest operations).

I wonder if you already know about Transport Triggered Architectures?
Their programming model seems very close to your microcode approach.

You might be interested to take a look at TTA or exposed/explicit
datapath architectures in general, and our related TCE toolset work.

I wrote a blog about TCE and LLVM some 5y ago:

http://blog.llvm.org/2010/06/tce-project-co-design-of-application.html

There's an open source release of TCE which has been used widely in
academia and also in some commercial designs.

We utilize LLVM for big part of the compiler, but have our own final
code gen / isched steps due to the unconventional programming model.

HTH,