Instructions on a target with no general purpose registers

I've mentioned my sneaky plans to target the MOS6502 here before.

The big issue I think is that a lot of instructions don't really have a choice for output register. It all just goes into the accumulator, X index, or Y index based on the specific instruction.

So, my question is, when I'm defining my ins, outs and registers for these instructions, is it going to be a problem that different instructions outputs are defined by the function that is called? And I guess, how do I tell LLVM it can't pick any old register for a given instruction?

Thanks,

Edwin

Hi Edwin

I’ve refrained from raining on your other thread, but LLVM’s instruction selection model, and in fact a large portion of its code generation model, is not well-fitted to modeling the 6502. I don’t think you’re going to get significant benefit from trying to bend it to work for the 6502, and you’ll spend a lot of time working around baked in assumptions that aren’t true for your target.

—Owen

Hi Edwin,

I'm not sure how actively he follows the lists, but you should talk to Steve Montgomery. He has some experience in porting LLVM to 8-bit microcontrollers (and gave a talk about it at the Cambridge LLVM Day in 2012).

David

Although I’m new to LLVM, I don’t think that can really be true. Sure, LLVM has the assumption of a machine with “lots” of registers, but it works with the x86, which doesn’t have many registers, and has instructions that work only with fixed input or output registers.

Actually, LLVM is also poorly suited to machines with lots of registers. A lot of the register allocation infrastructure is built around small-ish register files with relatively sparse aliasing relationships. From experience, modeling a machine with 256 GPRs and extensive use of register sequences results in a lot of pain in various parts of the code generator.

To a first approximation, you can treat the 6502 as a machine with 256 8 bit registers, and register to register (and register to memory) pseudo-instructions that are expanded at a late stage to short sequences using A, X, Y as temps. That will certainly work. Not optimally for 8 bit values, but it’s pretty much the best you can do anyway if you’re dealing with 16 or 32 bit values.

This will definitely work, but you’re not really compiling for 6502 at that point. You’re compiling for a bytecode VM that has 256 GPRs, where each bytecode opcode is implemented as a sequence of 6502 instructions. If your goal is just to get something to run on a 6502, this will definitely work, but it’s not actually solving the problem of making LLVM work as a proper 6502 compiler.

Fundamentally, you have to ask what you’re looking to get out of this. If it’s just for fun, or if you just want to get (say) C++11 to run on your 6502 without caring about the low-level performance, that’s definitely doable. But if you’re trying to actually build a production-grade 6502 compiler, you’ll find that you’re unlikely to benefit from most of the facilities that LLVM provides in the SelectionDAG and Machine layers. They’re simply built for machines that work very, very differently. I’m not claiming it’s impossible to bend them to your will, but I am questioning the value in doing so. I think you’d get a better result with less work if you translated out of LLVM IR into a code generation pipeline more specialized for the needs of a 6502.

—Owen

[...]

So, my question is, when I'm defining my ins, outs and registers for these instructions, is it going to be a problem that different instructions outputs are defined by the function that is called? And I guess, how do I tell LLVM it can't pick any old register for a given instruction?

You can define a register class which matches a single register only and
then define an instruction which accepts parameters of only that class.
Unfortunately I actually tried this for my Videocore code generator and
the most likely outcome is that the register allocator will turn up its
toes and die. It really doesn't like this.

However, if you squint hard enough the 6502 becomes a processor with 256
byte-sized registers --- zero page. You may have better luck with a
processor model which uses a subset of these as your registers, and
possibly not telling LLVM about A, X and Y at all. Plus that gets you
useful free stuff like indirect indexed, which you can use to implement
*(addr+offset) for 8-bit offsets, and sneak a 16-bit add for free out of
the processor.

...

In another life I help look ater the ACK compiler suite, which is
ancient and crufty and produces terrible code, but dates from about the
era when the 6502 was still a thing and I've played with its 6502 code
generator (and the 8080 code generator); these processors will basically
not run modern languages acceptably. It's because modern languages are
all stack-frame based, and these processors don't have any support for
stack relative accesses.

On the Z80 you can just about fake it by using ix or iy as a user stack
pointer, but it's painfully slow, painfully bloaty (8 bytes of code to
read or write a 16-bit value from the stack frame!) and basically just
painful. I believe later Z80s eventually gained a stack-relative load
instruction, but I don't know if there was a corresponding store.

The 6502 isn't quite as bad. If your user stack pointer is in zero page,
you can use indirect indexed to give access to a 256-byte stack frame,
but it's not pretty:

  ldy #offset
  lda (sp), y # load high(?) byte
  iny
  ldx (sp), y # load low byte

...seven bytes, I think.

At least it's not as bad as the 8080.

Being able to forbid recursion and therefore map every stack frame
statically into RAM would improve things no end, as now every stack slot
has a compile-time known address; I remember asking here about this ages
back and was told that there wasn't a built-in pass to do this, but it
would be fairly easy to do. *shrug*

Or you could just start using FORTRAN.

So far, this is mostly a for-fun project as opposed to a production-grade compiler. If it somehow magically gets there, hurray. If I lose interest in it, no worries.

For an end goal, if I can write in C and produce native code, I’ll be happy (ideally no interpreter). Even if I need to inline literally everything to make my programs work. Current goal is to produce an assembler so I can hand-code stuff using the instructions in my handy C64 Programmer’s Reference Guide.

I have to admit I’m somewhat worried about function calls, the stack, and performance. Despite that, I’ll just keep plowing ahead until I bang into a wall somewhere. I enjoy learning something new, and since the whole Target backend is way different than anything I’ve worked with, I’m already pretty happy with where I’ve gotten and where this is going.

Now to specifics (inline):

...
However, if you squint hard enough the 6502 becomes a processor with 256
byte-sized registers --- zero page. You may have better luck with a
processor model which uses a subset of these as your registers, and
possibly not telling LLVM about A, X and Y at all. Plus that gets you
useful free stuff like indirect indexed, which you can use to implement
*(addr+offset) for 8-bit offsets, and sneak a 16-bit add for free out of
the processor.

I had planed on implementing a register bank in zero page already (register definitions are already in place!), but hadn’t thought about outright hiding A, X, and Y. That’s a great plan. It shouldn’t be much work to see what I can try to optimize after that fact.

...

In another life I help look ater the ACK compiler suite, which is
ancient and crufty and produces terrible code, but dates from about the
era when the 6502 was still a thing and I've played with its 6502 code
generator (and the 8080 code generator); these processors will basically
not run modern languages acceptably. It's because modern languages are
all stack-frame based, and these processors don't have any support for
stack relative accesses.

Disappointing.

On the Z80 you can just about fake it by using ix or iy as a user stack
pointer, but it's painfully slow, painfully bloaty (8 bytes of code to
read or write a 16-bit value from the stack frame!) and basically just
painful. I believe later Z80s eventually gained a stack-relative load
instruction, but I don't know if there was a corresponding store.

The 6502 isn't quite as bad. If your user stack pointer is in zero page,
you can use indirect indexed to give access to a 256-byte stack frame,
but it's not pretty:

ldy #offset
lda (sp), y # load high(?) byte
iny
ldx (sp), y # load low byte

...seven bytes, I think.

At least it's not as bad as the 8080.

That’s not a bad plan. I’m assuming though that this would only be needed when a 16 bit value is passed in by reference? Sorry I don’t have any experience with compilers, so I may be completely off here.

As for efficiency, the 6502 is pretty much garbage at handling anything given its age, so I can’t really see a way even with pure assembly to get it to do anything efficiently. Doing a 32bit add requires just a tonne of work, no matter what you’re doing. 40 years of progress can iron out a heck of a lot efficiency problems.

I probably need to go and read up on Sweet-16 to see how it handles things like large integers and the like.

Being able to forbid recursion and therefore map every stack frame
statically into RAM would improve things no end, as now every stack slot
has a compile-time known address; I remember asking here about this ages
back and was told that there wasn't a built-in pass to do this, but it
would be fairly easy to do. *shrug*

Getting ahead of me here, but it could be good times.

Or you could just start using FORTRAN.

Cobol’s more my jam. :stuck_out_tongue:

Hi Edwin,

as David says, I've implemented a back-end for the Freescale 8 and 16-bit microcontrollers, HC08, HC12, HCS12, S12, S12X, ... It's far from perfect and I've not finished but, as indicated in my off-list e-mail to you, I'd be happy to share tips and tricks.

I didn't have a particular problem with the small number of registers. I wrote instruction patterns in which the operand register classes often contained a single register. I found the default greedy register allocator handled this pretty well provided you remember to allow register classes to be inflated (implement MOS6502RegisterInfo::getLargestLegalSuperClass). If you don't them the register allocator has to reload into the same class that a vreg was spilled from which means it can't break circular dependencies sometimes. If a dependency can't be broken then the allocator will report running out of registers.

If you go this route, you will also need to implement MOS6502InstrInfo::foldMemoryOperandImpl so you can fold the reloads into other instructions.

LLVM assumes that each binary operation has a corresponding instruction that can operate on register operands. This isn't the case for the microcontrollers I've been working on and isn't the case for the 6502 either. I think the best solution is to implement pseudo instructions and let the register allocator's spill/reload deal with them. Much of the time, an operand will be reloaded, so can be folded. Any pseudos that remain after register allocation can be handled by spilling to memory, e.g. a dedicated fixed offset in the stack frame.

If you decide to use zero-page as registers, and not tell LLVM about A, X and Y, then you've got more registers and they can all be the same register class. I imagine it might be best in this situation to lower the IR mainly to library calls.

Hope that helps. Let me know if you want any further info about how I implemented the CPU12 backend.

Regards,

Steve