Hi Daniel,
Thanks for the feedback.
Sorry for the delay in responding, other tasks got in the way.
I was able to experiment a bit with your suggestion. The WIP is here:
https://reviews.llvm.org/D55644 Emitting MCInstruction from Protocol Buffer Message.
I modified the example fuzzer and the “unconstrained fuzzer” to dump both the MCInst and assembly instruction based on
the operand types defined in the Protocol Buffer messages.
Just to recap, the “unconstrained fuzzer” means valid instruction operand types are defined in the Protocol Buffer messages –
they can be register, immediate, immediate + register pair etc.
Valid instruction opcodes are also defined in the Protocol Buffer messages.
The fuzzer then generates an assembly statement as a combo of any opcode and operands types.
This is the version of the fuzzer that proved to be useful in finding bugs.
It can generate both valid and invalid instructions like “add”, “add x0”, add “x1, x2, x3”, add “f1, x2, 10”, etc.
Here are some observations from modifying the assembler fuzzer to emit both assembly and MCInst:
1) Emitting MC instructions requires access to a backend’s MC-layer internal values.
For example, we need the symbolic names and enum values for instruction opcodes and registers.
So we depend on *.td files (e.g., RISCVGenInstrInfo, RISCVGenRegisterInfo) to print the MCInst opcode and operand values correctly.
Example:
add x8
# <MCInst #168 ADD
# <MCOperand Reg:9>>
For the assembly and disassembler fuzzers we created, we only relied on the ISA manual to create the Protocol Buffer messages.
But if we now additionally emit MCInst, we will need to add that dependence.
2) It seems it is possible to emit MCInst operands based on the protocol buffer messages we defined for valid operand types.
However, at the time we emit an assembly instruction from the protocol buffer messages, we don’t know yet if the instruction is
valid and will be parsed, but we still print the expected MCInst operands.
For example: see the invalid add with only one operand “add x8” above. We still print the MCInst even though it is not a
valid instruction.
The post-processing tool that will compare MCInst generated by the fuzzer with what the assembler outputs,
will have to handle/discard this situation.
3) Need to add more protocol messages to encode more info needed by the fuzzer that emits MCInst.
For example, it is common that FP register name are the same (F1, F2, etc) in SP and DP instructions (e.g, fadd.s, fadd.d),
but they have different MC symbolic names and values (F1_32, F1_64, F2_32, F2_64, etc).
The proto-to-asm converter does not know if it is generating a 32-bit, 64-bit, SP or DP assembly instruction,
it does not need to care about that.
But the proto-to-MCInstr converter needs that kind info. However it does not have any context since it prints operand by operand,
it does not know which opcode was processed before, or which target triple or ISA extension was intended.
Issues like this probably can be addressed by adding/changing the protocol messages
(e.g., Allow register operand type to be one of FP or Integer type, allow FP register be one of SP or DP type.
The proto-to-asm converter only needs the first info, the proto-to-MCInst converter needs both info).
There a more scenarios like (3) that I came across when trying to emit MCInsts from the assembler fuzzer. It seems they have solutions tough.
So, in principle, you can add proto-to-asm, proto-to-MCInst (and proto-to-encoding) converters based on the Protocol Messages.
The only downside I see is that the Protocol Messages definitions will depend on a backend's MC layer details (to print MCInsts correctly),
while before we defined the Protocol Buffer Messages based on the ISA manual only.
I have not tried to do the same with the disassembler fuzzer, because in that one we just generate random 32 bit numbers, varying some non-fixed fields in the number that represent operands and opcodes.
Let me know what you think.
Thanks,
Ana.