Creating an LLVM backend for a very small stack machine

Hi folks,

I am interesting in creating an LLVM backend for a very small stack machine.

Before I start in on the actual implementation, I'd like to make sure that
I'm on the right track. Any comments, suggestions, warnings, tips, etc
would be greatly appreciated.

Background

* Has anyone else out there targeted (or tried to target) a stack machine
before? Was it successfull? What problems did you have?

Haven't done that, and I don't think there are any existing backends
like this. It should be feasible, though; the backend code is pretty
flexible.

* What parts of the LLVM backend code generator infrastructure would be
usable for targeting a stack machine? e.g. Is it even possible to use
TableGen to target a stack machine?

You should be able to use existing LLVM backend code and TableGen at
least through instruction selection; I'm not sure whether you'll want
register allocation or not, but it should be easy to choose either
way. The whole thing is quite flexible; see
LLVMTargetMachine::addCommonCodeGenPasses in
lib/CodeGen/LLVMTargetMachine.cpp for a high-level overview of how
CodeGen works. It might also be useful to look at LLVM handles x87
floating-point; the relevant code is in
lib/Target/X86/X86FloatingPoint.cpp.

* When/where/how do things like big integer (iXXXXX), phi nodes, llvm.*
instrincs get lowered; e.g. does my target have to do that, or is it done
generically?

Aribitrary-width integers, vectors, llvm.*, etc. are lowered
generically by the Legalize infrastructure; the backend just has to
say what it can and can't support. See
lib/Target/X86/X86ISelLowering.cpp for an example. I don't know the
details of PHI nodes, but that's also taken care of by instruction
selection.

-Eli

> * Has anyone else out there targeted (or tried to target) a stack
> machine before? Was it successfull? What problems did you have?

Haven't done that, and I don't think there are any existing backends
like this. It should be feasible, though; the backend code is pretty
flexible.

At the very least, there isn't anything in the LLVM instruction set that I
think I would have any trouble lowering to the architecture ... but so far
that's just on paper. :wink:

I would love to see a Kalescope-like tutorial that goes step-by-step through
making a backend. At the very least, I'll be documenting my adventure, so
maybe once I know what I'm doing I can turn it into a tutorial.

You should be able to use existing LLVM backend code and TableGen at
least through instruction selection; I'm not sure whether you'll want
register allocation or not, but it should be easy to choose either
way. The whole thing is quite flexible; see
LLVMTargetMachine::addCommonCodeGenPasses in
lib/CodeGen/LLVMTargetMachine.cpp for a high-level overview of how
CodeGen works. It might also be useful to look at LLVM handles x87
floating-point; the relevant code is in
lib/Target/X86/X86FloatingPoint.cpp.

Thank you for the references, I'll have a look at those.

I've read quite a few papers on stack-machine-specific "register allocation"
algorithms, but at least for the first pass, I want to make this as
straightforward as possible.

One thing I was considering was pretending I had multiple registers, and
then when they actually get emitted I would thunk in code to do stack
manipulations, hoping that my architecture specific peephole optimizer
would be able to clean it up (or not, for the first cut).

Aribitrary-width integers, vectors, llvm.*, etc. are lowered
generically by the Legalize infrastructure; the backend just has to
say what it can and can't support. See
lib/Target/X86/X86ISelLowering.cpp for an example. I don't know the
details of PHI nodes, but that's also taken care of by instruction
selection.

Okay, I obviously need to learn more about the infrastructure here, but this
at least sounds promising. I was worried that if I didn't have a register
architecture that I'd have to reinvent the wheel in more places than it
soudns like I will have to.

I'm sure I will be back with more questions once I seriously try starting a
target.

Have you seen:
http://llvm.org/docs/WritingAnLLVMBackend.html

If you're targeting a stack machine, I'd strongly recommend not using the llvm register allocators and just run you own custom stackifier pass instead.

-Chris

Hi Wesley,

I've done quite a lot of work on register allocation for stack machines.
You might want to look at my papers:
http://www.dcs.gla.ac.uk/~marks/euroforth.pdf
http://www.dcs.gla.ac.uk/~marks/thesis.pdf

Good luck,
Mark.

Have you seen:
http://llvm.org/docs/WritingAnLLVMBackend.html

I have, and it's certainly helpful. Since the Kalescope tutorial is so
amazingly easy to work through that it makes me jealous for a similar
tutorial on the backend. But I'm definitely not complaining. =)

If you're targeting a stack machine, I'd strongly recommend not using
the llvm register allocators and just run you own custom stackifier
pass instead.

After reading
<http://www.llvm.org/docs/CodeGenerator.html#regAlloc_ssaDecon&gt; is correct
to say that if I don't use an existing LLVM register allocation pass, that
I would need to do my stackification directly on the SSA form?

Could/should I still reuse parts like the PHIElimination pass? It sounded
sort of like this was very coupled with the register allocators.

I'm was initial sort of thinking that for a first (unoptimized) cut I might
be able to just use the "simple" built-in allocator that spills every
value, then replace that afterwards with a stack-based allocator. Does that
sound practical, or would I be painting myself into a corner in the first
step?

Anyway, I'm just trying to make sure I'm taking the right approach before I
run head first into any dead ends. Ultimately, I'd rather do things "right"
than "fast", since in my experience, "right" takes less time in the long
run. :wink:

Hi Mark,

I've read your papers, and in fact they were part of the data that convinced
me that I really could go with a stack machine for the work I'm doing. I'm
planning on implementing register allocation based partially on your paper.

Actually, I was wondering if any of your lcc implementation work was
publicly available.

Have you seen:
http://llvm.org/docs/WritingAnLLVMBackend.html

I have, and it's certainly helpful. Since the Kalescope tutorial is so
amazingly easy to work through that it makes me jealous for a similar
tutorial on the backend. But I'm definitely not complaining. =)

If you're targeting a stack machine, I'd strongly recommend not using
the llvm register allocators and just run you own custom stackifier
pass instead.

After reading
<http://www.llvm.org/docs/CodeGenerator.html#regAlloc_ssaDecon&gt; is correct
to say that if I don't use an existing LLVM register allocation pass, that
I would need to do my stackification directly on the SSA form?

Could/should I still reuse parts like the PHIElimination pass? It sounded
sort of like this was very coupled with the register allocators.

Sure, reusing just phi elim and 2-addr elim should be possible.

I'm was initial sort of thinking that for a first (unoptimized) cut I might
be able to just use the "simple" built-in allocator that spills every
value, then replace that afterwards with a stack-based allocator. Does that
sound practical, or would I be painting myself into a corner in the first
step?

Sure, that could work. The x86 floating point stack is a stack with maximum depth of 8 entries. The x86 backend implements it by register allocating for a 7 entry "register file" and then doing a custom transformation to change the code to use pushes/pops exchanges etc.

-Chris

Wesley,

Regarding access to the source code;
I would send you the code, but I might be stepping on a few toes.
The person to speak to is Chris Bailey at the University of York (in the UK).

However it is written for the lcc tree-based IR, rather than the SSA-based IR of LLVM, so I don't think it will be that much use.

A lot of the analysis it does is to find information that is explicit in the SSA form anyway, such as generating a flow graph.

The SSA form already provides a head start for chosing stack allocation candidates. For example, PHI-nodes are obvious candidates for edges in the flow-graph.

I would be interested to see how good SSA-form is for stack allocation,
as that was my intended direction if I had stayed longer at York.

Cheers,
Mark.

Wesley J. Landaker wrote:

Regarding access to the source code;
I would send you the code, but I might be stepping on a few toes.
The person to speak to is Chris Bailey at the University of York (in the
UK).

However it is written for the lcc tree-based IR, rather than the
SSA-based IR of LLVM, so I don't think it will be that much use.

A lot of the analysis it does is to find information that is explicit in
the SSA form anyway, such as generating a flow graph.

That's okay, I wasn't sure it would be much use either; mostly I was just
curious and thought maybe I could learn something from it if it was readily
available.

The SSA form already provides a head start for chosing stack allocation
candidates. For example, PHI-nodes are obvious candidates for edges in
the flow-graph.

I'm a fan of SSA already (used it a bit in a hardware synthesis context),
but this is my first attempt to use it an LLVM context, or to target a
stack machine. I guess I'll see how it goes.

I would be interested to see how good SSA-form is for stack allocation,
as that was my intended direction if I had stayed longer at York.

I hope to eventually get to this on my project. For my first cut I'm going
to do a rather dumb allocation just to get things working. In my niche,
most programs for this machine will be under 256 (!) instructions, and
speed doesn't matter much. So on one hand, if it fits, optimizations don't
matter much. But on the other hand, it has to fit! =)