[M68k] Code Beads discussion


I wanted to discuss the machine code generation technique that the M68k target currently uses and possible alternatives. To summarise my understanding of the current architecture, each instruction has a set of ‘code beads’ which are essentially a domain-specific bytecode, with instructions like ‘emit constant bits’, ‘emit register number as bits’, etc.

This is possibly a little bit early but it seems like the M68k target might land soon and I figure that there shouldn’t be too much risk talking about this ahead of time. Apologies to those that read my comment on the related bug (#48792) as this will be very repetitious.

I’m pretty new to the backend infrastructure of LLVM, so please correct me as necessary!

I’ve had a look at the instruction encoding over the past few days and I’m finding the number of instruction definitions to be overwhelming. There is currently one instruction defined per operand per addressing mode.

I’ve been playing about with an LLD implementation and I’m currently finding it hard to implement relaxation to 32-bit operands due to the number of instruction permutations.

Additionally, my proof-of-concept disassembler is not very efficient since it needs to consider the entire instruction length (which can be up to 22 bytes IIRC). This could be implemented as a radix-tree-style code generator, which would be efficient, but quite complicated.

So I spent some time looking through the processor manual for the M68k and I’ve observed a couple of things:

  • Almost every instruction can be identified from just the first word.
  • Whilst lots of memory operand modes are illegal for particular instructions, there are very few different encodings for operands. (Effective addressing and a few register / immediate encodings).

So, instead of code beads, could we:

  • Use a MCTargetExpr subclass to represent the M68k effective addressing operand
  • Use just the first word for disassembly (possibly with code generation)
  • Have a few special cases in the emitter & disassembler code for outlier instructions
  • Use TSFlags to encode the first word & mask, along with the encoding modes of each operand

The prior art for custom MCTargetExprs seems to be for external symbols so I’m not sure if there are any caveats there.

The code emitter would be slightly less flexible and possibly require a few special cases (but would be easier to follow IMO).

Relaxation would be much easier to implement as there would be fewer instructions and relaxation per-operand could be implemented once. The only complication is knowing which EA modes are valid for a given instruction, to know when to expand into a (load + op).

Disassembly by checking the first word (even checking the highest 4 bits in a switch maybe) would be fast and straight-forward to follow. There would need to be a few special cases here too though.

I’ve not thought much about ensuring only valid instruction permutations get parsed and emitted.

Ricky Taylor,