Register spilling fix for experimental 6502 backend

So I've been designing an experimental 6502 backend for LLVM. Link:

The 6502 only has a few registers, one accumulator and two indexes, none of
which are large enough to hold an absolute pointer. Because of this, the
backend really tests the power of LLVM's register allocator (RA).

I've made a change to the RA that might be of interest to devs of other
backends, read on.

Other devs suggested reserving the zero page of memory (addresses $00-$FF) and
treating it as a giant set of registers. And I don't know, because doesn't that
sort of defeat the purpose of register allocation? It might be inevitable that
we have to lie to LLVM about the 6502's registers, but, for now, I want to see
how far I can get using the allocator as intended.

The registers are A, X, and Y. They belong to a hierarchy of Register Classes:
  Acc class: A
  Index class: X, Y
  General class: Acc, Index

There's a fictional instruction "SUBidx Acc, Index". It subtracts an Index from
an Acc. Bear with me.

What if we want to subtract A from X?
We can't say "SUBidx X, A". The operands must be Acc, Index.

The DAG looks something like this:

  %vreg2 = COPY %X ; %vreg2 is Index
  %vreg0 = COPY %A ; %vreg0 is Acc
  SUBidx %vreg2, %vreg0

To make this work, we first have to swap A and X.
For technical reasons, we can't use Y as temporary storage. Instead, we must
spill to the stack.

When spilling a reg to the stack, the RA needs two virtual regs, one to store
and one to load. Currently, the loaded reg always has the same class as the
stored reg. Because of this limitation, I ran into a problem: the RA would try
to assign two Accs to physical regs, but only one is available, and the RA
would fail.

Line 1290 of InlineSpiller.cpp, inside the function spillAroundUses, says:
"// FIXME: Infer regclass from instruction alone"
(As opposed to the current behavior of always using the same class as the reg
being spilled.)
The line refers to this very limitation. It was written in 2010, so we've known
about the limitation since the beginning.

My RA patch can be found in this commit:

This patch allows spills to use a more general reg class, so a spilled Acc can
be reloaded as a General.

Previously, the RA would fail because it would spill X, but it couldn't reload
X into an Acc.

Now, with the patch, the DAG compiles into this:

  # store X to stack
  STX_stack [0]
  # transfer A to X
  # load A from stack
  LDA_stack [0]
  # now that A and X are swapped, we can subtract!
  SUBidx A, X

It works, but I'm not sure if I used the best approach. If you're familiar with
the RA, feel free to look at the patch and leave feedback.

- Nolan

I don’t think that it contradicts the purpose of register allocation. Register allocation is really shorthand for operand location selection (though registers do have the nice property that, on the compiler’s side of the instruction decoder, they never alias). On a RISC-like register-register architecture, these are precisely the same. On modern x86 processors (at least when running in 32-bit mode), if you don’t move the stack pointer then the top few stack slots are stored in rename registers so register-memory instructions that operate on them can use the same rename logic as register-register instructions, so it would be entirely valid to treat a few esp-relative addresses that the compiler guarantees are not address taken as registers.

For the 6502, you might want to look at the papers describing compilers for the Xerox Alto as a good reference. As was common in that era, they defined a virtual instruction set implemented in microcode as the compiler target. The Alto’s instruction set was specifically designed to be used in this way, with compilers generating bytecode and shipping a bytecode interpreter. If you think that this is slow, remember that Smalltalk on the Alto ran an entire multitasking GUI on a processor not much faster than a 6502, though with a bit more RAM. Without long pipelines, a jump-threaded bytecode interpreter is very fast. See also P-code from the UCSD Pascal compiler.

SWEET16 is an example of the same concept for the 6502, providing effectively a 16-bit virtual ISA for compilers and interpreters to target. When targeting a processor like the 6502 from a vaguely high-level language (which, given the amount of structure that it implies, includes LLVM IR in this context), you are going to find so many repeated instruction sequences that it will be very difficult to fit within the memory constraints if you expand them at every use.

Even Clang and GCC targeting modern architectures end up doing this to some degree, for example providing software floating point routines in compiler-rt / libgcc. If you tried to inline them at every use, then you’d end up with code that would suffer from instruction cache footprint on larger processors and completely fail to fit on M-profile processors.

If I were writing a 6502 back end, then I would define an SWEET16-derived virtual ISA, include the 6502 assembly for that code in compiler-rt, and target that ISA from the LLVM back end. I’d end up with much denser code at the end and not have to fight the LLVM infrastructure so much,


Have you tried implementing getLargestLegalSuperClass(RC) in M6502RegisterInfo?

The default implementation returns RC but if you make it return the General class then live range splitting will use General registers. That might avoid the need for a patch to the register allocator.


Hi Nolan,

One thing you could try is to define a register class that is the union of X and Acc.
That register class would be used for spill and reload instructions.
In other words, both store and load would use the “same" register class.

By doing so, you shouldn’t need to patch any of the RA pipeline.


Thank you for pointing me to getLargestLegalSuperClass, implementing
this is adequate to fix the reg allocation problem without patching
LLVM's register allocator.

It is understandable that I should avoid touching the Reg Allocator,
as doing so might potentially break other backends.

Nolan Check