[RFC] Implementing LLVM MC Protobuf Fuzzer for Assembly and Encoding for RISC-V target

Hello,

We have implemented LLVM Machine Code Protobuf fuzzers for the RISC-V target as part of a Summer internship project with our intern Jocelyn Wei.

The fuzzers for the assembler and disassembler proved to be useful. We uncovered bugs and detected compatibility issues with other tools, e.g., by running a driver program that implements a round trip with a golden (i.e., more tested) tool such as GNU AS.

We built different fuzzer versions to experiment with the level of fuzzing for the instruction operands.
The versions are labeled sample, semi-constrained, unconstrained. We fix opcodes, and depending on the fuzzer version, allow number of operands, operand value ranges, and operand types to vary.

The code is available for review:
⚙ D51710 Implemented Protobuf fuzzer for LLVM RISC-V MC Disassembler Implemented Protobuf fuzzer for LLVM RISC-V MC Disassembler
⚙ D51144 Implemented Protobuf fuzzer for LLVM RISC-V MC Assembler Implemented Protobuf fuzzer for LLVM RISC-V MC Assembler

We would like to assess people's interest in adding this type of tool to the LLVM code base.

It can be further improved for RISC-V target and also expanded to other targets.

We have a Poster about the fuzzers at the LLVM Dev Conf this week.

Please visit our poster and come by with your comments and suggestions. We appreciate your feebdack.

Thank you,
Ana.

Hi Ana,

I think this looks interesting although unfortunately I'm not sure I'm going to be able to make use of it for my current target as I don't have a golden reference tool available. One of the key weaknesses of llvm-mc-disassembler-fuzzer for most targets is that it only finds a corpus of tests that improve coverage but doesn't provide any assessment on what the correct behaviour is. A human is required to make proper test cases out of the corpus and feed it back in so the fuzzer can drop the corresponding generated tests. Having a fuzzer that can verify the behaviour as well would be very useful for targets with access to a golden reference tool.

One thing that occurred to me while skimming through D51144 was that something similar to proto_to_asm_main.cpp could be used to generate MCInst objects directly from the same protobuf. This would allow you to attribute bugs to the parser, instruction printer, or object emitter since you'd be able to tell, for example, that the parser emitted the an MCInst that matched the one expected by the protobuf.

Hi Ana,

I think this looks interesting although unfortunately I'm not sure I'm going to be able to make use of it for my current target as I don't have a golden reference tool available.

Thinking about it a bit more, the lack of a golden reference only really affects my ability to use the driver script. With a different/modified driver I should be able to use the underlying fuzzer without a reference tool available.

Hi Daniel,

Thanks for the feedback.

That is correct, you can invoke the fuzzers without a golden reference implementation. The driver program to compare behaviors is just a convenient tool for those who have a reference implementation.

I am not sure I understood your suggestion about de-serializing Protobuf messages as MCInst objects. Can you clarify?

Thanks,
Ana.

Hi Ana,

Hi Daniel,

Thanks for the feedback.

That is correct, you can invoke the fuzzers without a golden reference implementation. The driver program to compare behaviors is just a convenient tool for those who have a reference implementation.

I am not sure I understood your suggestion about de-serializing Protobuf messages as MCInst objects. Can you clarify?

I’m thinking that a single protobuf message could be used to generate the Assembly, MCInst, and Encoding for a given instruction. From that you can:

  • Parse the generated assembly and check the intermediate MCInst matches the generated MCInst
  • Disassemble the generated encoding and check the intermediate MCInst matches the generated MCInst
  • Assemble the generated MCInst and check the encoding matches the generated encoding
  • Disassemble the generated MCInst and check the assembly matches the generated assembly

in addition to testing the Assembly->Encoding and Encoding->Assembly. With these additional tests, the fuzzer can identify whether it’s the parsing, assembling, disassembling, or instruction printing that is incorrect.

For example, given ‘addiu $1, $2, 3’, ‘ADD reg(1), reg(2), imm(3)’, and 0x0123 the fuzzer would be able to report something like:
0x0123 didn’t match golden reference (0x0123 != 0x0126)
Parsing ‘addiu $1, $2, 3’ produces ‘ADD reg(1), reg(2), imm(3)’: Parser is ok
Assembling ‘ADD reg(1), reg(2), imm(3)’ produces ‘0x0126’: Assembler is wrong
Disassembling ‘0x0123’ produces ‘ADD reg(1), reg(2), imm(3)’ produces ‘0x0124’: Disassembler is ok
Printing ‘ADD reg(1), reg(2), imm(3)’ produces ‘addiu $1, $2, 3’: Instruction printer is ok

I’m also thinking that fuzzing the MCInst->Assembly, and MCInst->Encoding paths in isolation would allow us to be sure that the MCInst’s that are emitted by CodeGen behave correctly when given to the MC layer. At the moment, we assume that Assembly->MCInst->Encoding and Encoding->MCInst->Assembly being correct implies that CodeGen->MCInst->Assembly and CodeGen->MCInst->Encoding are correct too. However, this doesn’t quite hold as there are multiple ways of expressing the same operand in an MCInst. For example:

  • addExpr(MCConstantExpr::create(2, Ctx))
  • addImm(2)
  • addExpr(/* some REL relocation evaluating to Symbol+2 */)

and Assembly->MCInst->Encoding/Encoding->MCInst->Assembly will only exercise one of them.

Hi Daniel,

Thanks for the feedback.

Sorry for the delay in responding, other tasks got in the way.

I was able to experiment a bit with your suggestion. The WIP is here:

https://reviews.llvm.org/D55644 Emitting MCInstruction from Protocol Buffer Message.

I modified the example fuzzer and the “unconstrained fuzzer” to dump both the MCInst and assembly instruction based on
the operand types defined in the Protocol Buffer messages.

Just to recap, the “unconstrained fuzzer” means valid instruction operand types are defined in the Protocol Buffer messages –
they can be register, immediate, immediate + register pair etc.
Valid instruction opcodes are also defined in the Protocol Buffer messages.
The fuzzer then generates an assembly statement as a combo of any opcode and operands types.
This is the version of the fuzzer that proved to be useful in finding bugs.
It can generate both valid and invalid instructions like “add”, “add x0”, add “x1, x2, x3”, add “f1, x2, 10”, etc.

Here are some observations from modifying the assembler fuzzer to emit both assembly and MCInst:

1) Emitting MC instructions requires access to a backend’s MC-layer internal values.
For example, we need the symbolic names and enum values for instruction opcodes and registers.
So we depend on *.td files (e.g., RISCVGenInstrInfo, RISCVGenRegisterInfo) to print the MCInst opcode and operand values correctly.
Example:
   add x8
# <MCInst #168 ADD
# <MCOperand Reg:9>>

For the assembly and disassembler fuzzers we created, we only relied on the ISA manual to create the Protocol Buffer messages.
But if we now additionally emit MCInst, we will need to add that dependence.

2) It seems it is possible to emit MCInst operands based on the protocol buffer messages we defined for valid operand types.
However, at the time we emit an assembly instruction from the protocol buffer messages, we don’t know yet if the instruction is
valid and will be parsed, but we still print the expected MCInst operands.
For example: see the invalid add with only one operand “add x8” above. We still print the MCInst even though it is not a
valid instruction.
The post-processing tool that will compare MCInst generated by the fuzzer with what the assembler outputs,
will have to handle/discard this situation.

3) Need to add more protocol messages to encode more info needed by the fuzzer that emits MCInst.
For example, it is common that FP register name are the same (F1, F2, etc) in SP and DP instructions (e.g, fadd.s, fadd.d),
but they have different MC symbolic names and values (F1_32, F1_64, F2_32, F2_64, etc).
The proto-to-asm converter does not know if it is generating a 32-bit, 64-bit, SP or DP assembly instruction,
it does not need to care about that.
But the proto-to-MCInstr converter needs that kind info. However it does not have any context since it prints operand by operand,
it does not know which opcode was processed before, or which target triple or ISA extension was intended.
Issues like this probably can be addressed by adding/changing the protocol messages
(e.g., Allow register operand type to be one of FP or Integer type, allow FP register be one of SP or DP type.
The proto-to-asm converter only needs the first info, the proto-to-MCInst converter needs both info).

There a more scenarios like (3) that I came across when trying to emit MCInsts from the assembler fuzzer. It seems they have solutions tough.

So, in principle, you can add proto-to-asm, proto-to-MCInst (and proto-to-encoding) converters based on the Protocol Messages.

The only downside I see is that the Protocol Messages definitions will depend on a backend's MC layer details (to print MCInsts correctly),
while before we defined the Protocol Buffer Messages based on the ISA manual only.

I have not tried to do the same with the disassembler fuzzer, because in that one we just generate random 32 bit numbers, varying some non-fixed fields in the number that represent operands and opcodes.

Let me know what you think.

Thanks,
Ana.