[RFC] A new CodeEmitterGen infrastructure for variable-length instructions

(This is a long proposal. If you prefer, here is the web version: https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b)

Background

CodeEmitterGen is a TableGen backend that generates instruction encoder functions for MCCodeEmitter from a concise TableGen syntax. It is, however, almost exclusively designed for targets that use fixed-length instructions. It’s nearly impossible to use this infrastructure to describe instruction encoding scheme for ISAs with variable-length instructions, like X86 and M68k.

To have a better understanding on this problem, let’s look at an example. For a fixed-length instruction ISA, developers write the following TableGen syntax to describe an instruction encoding:

class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction {
bits<32> Inst;

bits<4> dst;
bits<16> imm;
let Inst{31-28} = 0b1011;
...
let Inst{19-16} = dst;
let Inst{15-0} = imm;
}

The Inst field tells us the length of an instruction – 32 bits in this case. Each bit in this field describes the encoded value, which is either a concrete value or symbolic one like dst and imm in the example above. The dst and imm variables correspond to the output operand ($dst) and the second input operand ($imm), respectively. Meaning, the encoder function (generated by CodeEmitterGen) will eventually insert the encoding for these two operands into the right bit ranges (bit 19~16 for dst and 15~0 for imm).

Though this TableGen syntax fits well for fixed-length instructions, it imposes some difficulties to instructions with variable length and memory poerands with complex addressing modes:

  1. The bit width of the Inst field is fixed. Though we can declare the field with maximum instruction size in the ISA, it requires extra code to adjust the final instruction size.
  2. Operand encoding can only be placed at fixed bit positions. However, the size of an operand in a variable-length instruction might vary.
  3. In the situation where a single logical operand is consisting of multiple MachineOperand-s / MCOperand-s, the current syntax cannot reference a sub-operand. Which means we can only reference the entire logical operand at places where we actually should put sub-operands. Making the TG code less readable and bring more burden to the operand encoding functions (because they don’t know which sub-operand to encode).

In short, we need a more flexible CodeEmitterGen infrastructure for variable-length instructions: describe the instruction encoding in a “position independent” fashion and be able to reference sub-operands with ease.

Proposal

We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the aforementioned shortcomings. It is consisting of new TableGen syntax and some modifications to the existing CodeEmitterGen TableGen backend.

Suppose we are dealing with an instruction MyVarInst:

class MyMemOperand<dag sub_ops> : Operand<iPTR> {
let MIOperandInfo = sub_ops;
}

class MyVarInst<MyMemOperand memory_op> : Instruction {
let OutOperandList = (outs GR64:$dst);
let InOperandList = (ins memory_operand:$src);
}

It has the following encoding format:


15 8 0

From an M68k point-of-view, I like this. I’m a little uneasy about the operand node though. Is it possible that there will ever be an operand type which needs multiple encodings?

I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like (seq (flip 0b0101010) (slice my_operand.Base, 4, 8)), be possible?

As you say, I suspect that implementing an interpreter-style disassembler generator like the fixed length one would be fairly straight-forward (and much better than the M68k disassembler implementation I provided).

Ricky,

Thanks for the feedbacks!

From an M68k point-of-view, I like this. I’m a little uneasy about the operand node though. Is it possible that there will ever be an operand type which needs multiple encodings?

Do you mean an operand or sub-operand might have different encodings depend on its context / bit positions? I don’t know about other (future) architectures, M68k doesn’t have this issue: Each memory operand is consisting of (multiple) primitive construction(s) like registers and immediate, that is, sub-operands. While a sub-operand might be placed in arbitrary positions, each sub-operand has a fixed encoding.

I can think of a solution for operands / sub-operands whose encodings depend on its context: passing custom encoder function name into operand. So if you write:

(operand “$foo”, 4, “encodeFoo”)

The foo operand at this particular bit position will be encoded by encodeFoo.
(The current CodeEmitterGen has a similar stuff — via the EncoderMethod field in each operand. But that one is bit-position-invariant. So an operand can only use a single custom encoder regardless of its context / bit-position)
BTW, this again shows the advantage of using DAG for carrying encoding info: It’s easy to augment with new parameters / features.

I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like (seq (flip 0b0101010) (slice my_operand.Base, 4, 8)), be possible?

IIUC, you’re proposing to replace(seq (seq:$dec 0b0101010), (operand “…”, 4, 8)) with the new syntax.

I think it’s a good idea because then we’re not overloading a single DAG operator. It’s trivial to change, too.

Nevertheless, I haven’t implemented seq:$dec for fixed bits value so actually you can’t write (seq:$dec 0b0101010) (This is not hard to implement though).

As you say, I suspect that implementing an interpreter-style disassembler generator like the fixed length one would be fairly straight-forward (and much better than the M68k disassembler implementation I provided).

Good to hear!

Cheers,
-Min

Thanks for the feedbacks!

From an M68k point-of-view, I like this. I’m a little uneasy about the operand node though. Is it possible that there will ever be an operand type which needs multiple encodings?

Do you mean an operand or sub-operand might have different encodings depend on its context / bit positions? I don’t know about other (future) architectures, M68k doesn’t have this issue: Each memory operand is consisting of (multiple) primitive construction(s) like registers and immediate, that is, sub-operands. While a sub-operand might be placed in arbitrary positions, each sub-operand has a fixed encoding.

Yes, I was thinking about the solution in a general sense. For M68k we’re good either way. :slight_smile:

I can think of a solution for operands / sub-operands whose encodings depend on its context: passing custom encoder function name into operand. So if you write:

(operand “$foo”, 4, “encodeFoo”)

The foo operand at this particular bit position will be encoded by encodeFoo.
(The current CodeEmitterGen has a similar stuff — via the EncoderMethod field in each operand. But that one is bit-position-invariant. So an operand can only use a single custom encoder regardless of its context / bit-position)
BTW, this again shows the advantage of using DAG for carrying encoding info: It’s easy to augment with new parameters / features.

This sounds good to me. (And I very much agree, using a DAG for this stuff is very convenient.)

I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like (seq (flip 0b0101010) (slice my_operand.Base, 4, 8)), be possible?

IIUC, you’re proposing to replace(seq (seq:$dec 0b0101010), (operand “…”, 4, 8)) with the new syntax.

I think it’s a good idea because then we’re not overloading a single DAG operator. It’s trivial to change, too.

I guess you could allow custom encoding with something like (encode "SpecialMode", "$foo") (which could call encodeSpecialMode, or similar).

Either way, I’m pretty satisfied. :slight_smile:

FYI: As a preview for this new CodeEmitterGen component, I’ve refactored some of the M68k instructions using VarLenCodeEmitterGen: https://reviews.llvm.org/D115234

-Min