What does MCOperand model?

A question for LLVM code generator developers:

After having read through "The LLVM Target-Independent Code Generator"
[1] I'm unclear about what precisely the objects MCInst and MCOperand
represent. They sit in the space between assembly syntax and binary
encodings, but which are they modeling? For example, a Thumb 2 branch
instruction 'b' takes an immediate. That syntax "b #1234" can map to
a couple different encodings. If it is an even number between -2048
and 2046, it can be encoded with a 16-bit instruction, otherwise a
32-bit instruction. If the MC objects are to model the syntax, then
one would expect both encodings to have identical values in the
MCOperand, a 32-bit signed integer. On the other hand, if MC objects
are to model the encoding, one would expect the MCOperand for the
16-bit encoding to contain a number between -1024 and 1023. Which one
is it?

My intuition says the MCOperand should model the assembly syntax and
contain the 32-bit signed integer, and that the EncoderMethod and
DecoderMethod are responsible for mapping that high-level number to
the low-level binary representation. If, however, the MCOperand
models the encoding, then EncoderMethod and DecoderMethod glue need
not exist, and that bit-twiddling logic would be pushed to whoever
creates the MCOperand.

Looking at the Thumb backend, I believe it has been written assuming
the MC objects model the syntax, not the encoding, which matches my
intuition. There has been some discussion on the llvm-commits list
encouraging us to store the encoded value in the MCOperand. The
justification, as I understand it, is that the MCOperand should not
contain values that cannot be encoded. This effectively means that
the MCOperands would be modeling the binary encoding, not the syntax.
Are folks making this transition in other backends as well?

[1] http://llvm.org/docs/CodeGenerator.html

Thanks,
Greg

Owen is correct in his descriptions. The MCOperand values are intended to model the instruction encoding. Where that doesn't match the assembly syntax, the asm parser (and codegen) and the instruction printer are responsible for encoding/decoding the values.

For targets that predate the MC layer, this isn't always the case, leading to things being a bit confusing when just reading the code. Any new targets should absolutely consider the instruction encoding to be the canonical representation and map assembly syntax onto that, not the other way around.

Regards,
-Jim

the MCOperand should not contain values that cannot be encoded

In the case of pre-encoding a shifted immediate, we acknowledge that
we've only moved the invalid encodings from ones that set the bottom
bit to ones that set the top?

Is there a backend that is implemented in this style I can use as a reference?

The MCOperand values are intended to model the instruction
encoding. Where that doesn't match the assembly syntax,
the asm parser (and codegen) and the instruction printer are
responsible for encoding/decoding the values.

As my colleague and I try to implement instructions in the recommended
style, we are finding it to be harder with the constraint of MCOperand
needing to be pre-encoded. I've attached a diagram of my
understanding of the code flow if using the recommended style versus
using the MCOperand to model the syntax. How far off am I?

[See attached]

In the diagram with pre-encoding, a shared function EncodeImm() has to
be referenced from 3 locations. As a newcomer to LLVM going after a
simple encoding bug, I wasn't expecting to have to grok every client
of MCOperand just to fix how it is encoded. To pre-encode, it seems
the .td file needs to use a custom operand that inherits from a
generic one for the only purpose of routing to the shared encoding
function. Is there better alternative for getting from the LLVM
target-independent IR to the pre-encoded MCOperand?

Thanks,
Greg

MCOperandAsSyntax.html (10.5 KB)