X86 Disassembler

Dear mailing list:

the attached diff implements a table-driven disassembler for the X86 architecture (16-, 32-, and 64-bit incarnations), integrated into the MC framework. The disassembler is table-driven, using a custom TableGen backend to generate hierarchical tables optimized for fast decode. The disassembler consumes MemoryObjects and produces arrays of MCInsts, adhering to the abstract base class MCDisassembler (llvm/MC/MCDisassembler.h).

The disassembler is documented in detail in

  • lib/Target/X86/X86Disassembler.h (disassembler runtime)
  • utils/TableGen/X86DisassemblerEmitter.h (table emitter)

    as well as in the individual files, functions, and classes.

I implemented a use case in tools/llvm-mc/HexDisassembler.cpp, which implements an interactive disassembler for sequences of bytes entered in hex on standard input. Example output:

localhost:llvm spyffe$ ./Debug/bin/llvm-mc -disassemble
74 22 ← user input

1 instructions:
74 22 je 34


The disassembler is currently fully implemented, and testing is ongoing. I still need to handle LEA correctly, and of course the instruction tables (lib/Target/X86/X86InstrInfo.td and friends) always need expanding. However, because this is a non-invasive patch implementing largely independent functionality but its size makes maintaining it outside LLVM is a heavy burden, I am submitting it for consideration now.

The patch is tested against SVN revision 79306. Please let me know if there are any problems, and tell me what I can fix to make this worthy of inclusion. Thanks for your time.

Sincerely,
Sean Callanan

llvm.diff (217 KB)

Hi Sean,

the attached diff implements a table-driven disassembler for the X86 architecture (16-, 32-, and 64-bit incarnations), integrated into the MC framework. The disassembler is table-driven, using a custom TableGen backend to generate hierarchical tables optimized for fast decode. The disassembler consumes MemoryObjects and produces arrays of MCInsts, adhering to the abstract base class MCDisassembler (llvm/MC/MCDisassembler.h).

The disassembler is documented in detail in

- lib/Target/X86/X86Disassembler.h (disassembler runtime)
- utils/TableGen/X86DisassemblerEmitter.h (table emitter)

as well as in the individual files, functions, and classes.

I implemented a use case in tools/llvm-mc/HexDisassembler.cpp, which implements an interactive disassembler for sequences of bytes entered in hex on standard input. Example output:

localhost:llvm spyffe$ ./Debug/bin/llvm-mc -disassemble
74 22 <-- user input

1 instructions:
74 22 je 34

The disassembler is currently fully implemented, and testing is ongoing. I still need to handle LEA correctly, and of course the instruction tables (lib/Target/X86/X86InstrInfo.td and friends) always need expanding. However, because this is a non-invasive patch implementing largely independent functionality but its size makes maintaining it outside LLVM is a heavy burden, I am submitting it for consideration now.

The patch is tested against SVN revision 79306. Please let me know if there are any problems, and tell me what I can fix to make this worthy of inclusion. Thanks for your time.

Great work! It looks really cool.

Here are some initial comments. Most of them are generic/stylistic comments, but some are more important. Also, I'm not trying to be picky. :slight_smile:

0. Watch out for tabs!
1. Includes like this "#include <llvm/MC/MCInst.h>" should be in quotes, not angle brackets.
2. In the "class Indenter", the "push" and "pop" methods check for over/under runs before proceeding. Should these be "asserts" instead?
3. In include/llvm/Support/MemoryObject.h, you changed all "uintptr_t" to "uint64_t". Is there a good reason for this? If not, please change them back.
4. If you have a class that's completely contained within a .cpp file, go ahead and define it in an anonymous namespace and mark it "VISIBILITY_HIDDEN". Like this:

namespace {
class VISIBILITY_HIDDEN StringMemoryObject : public MemoryObject {
...

5. In "readBytes" in StringMemoryObject, you're returning a "-1", but the method's return type is "uint64_t". Do you *really* want to return 0xfffffffffffU, or is "-1" a sentinel value of some sort? If the latter, perhaps changing the return type to "int64_t" instead. If the former, then changing the return to "-1U" to make it explicit.

6. The use of "#include <iostream>" is strictly forbidden. Please remove it wherever you find it (HexDisassembler.h for one).

7. Look for opportunities to use the LLVM small container classes. For instance, the "fOperands" ivar in RecognizableInsn.

8. While we're on that, please use "Instruction" or "Instr" instead of "Insn" for consistency's sake.

9. You have types called "operandType_t", etc. We normally don't use the "_t" suffix for types. Perhaps renaming them.

10. In the "static inline bool outranks(instructionContext_t upper, instructionContext_t lower)", it would be good to assert that upper and lower are strictly less than IC_max.

11. In several places, you put 'default: assert(0 && "Badness happens here!");" at the end of a switch statement. When assertions are turned off, I believe that this could emit warnings. It would be better to put it as the first case of the "switch". And maybe use the new "llvm_unreachable()" macro as well.

12. You use "std::endl" a lot. This is discouraged by the coding standard:

  http://llvm.org/docs/CodingStandards.html#ll_avoidendl

13. You use "std::ostream" a lot. Would it be appropriate to use "raw_ostream" instead?

14. In DisassemblerTables::setTableFields, you emit things to "std::cerr". Use "errs()" instead. Or, better yet, "llvm_report_error". :slight_smile:

15. Prefer pre-increment/decrement for variables:

  http://llvm.org/docs/CodingStandards.html#micro_preincrement

16. In X86RecognizableInsn.cpp, you use the large macro: HANDLE_OPERAND. Could you make it an inlined function instead? It could take a pointer to a member function for the method it wants to call. Debugging large macros is impossible. :slight_smile:

17. I'm confused by the C files in lib/Target/X86. Why aren't they C++ files?

Okay. That's all I found for now. :slight_smile: I haven't looked at the algorithm too much, though it seems fairly straight-forward. I'm sure that since you're continually testing it, that it's of high quality. When you do get around to submission, we will want all of the tests you've been running to be added to the test/ directory.

Thanks!
-bw

Bill,

thanks for your comments. I’ll respond to them individually. I’ve attached a new revision of the patch that addresses them. Patch built and tested against SVN 79487, with the additional attached fix that fixes an Intel table bug.

Sean

  1. Watch out for tabs!

Fixed. Thanks.

  1. Includes like this “#include <llvm/MC/MCInst.h>” should be in quotes, not angle brackets.

Fixed.

  1. In the “class Indenter”, the “push” and “pop” methods check for over/under runs before proceeding. Should these be “asserts” instead?

Yeah, I made them asserts as you suggested. I was thinking about maybe making them saturate, but honestly if we run out of space we should simply break and allow the developer to make the Indenter bigger. Maybe we should have another argument to the constructor specifying how large the buffer should be…?

  1. In include/llvm/Support/MemoryObject.h, you changed all “uintptr_t” to “uint64_t”. Is there a good reason for this? If not, please change them back.’

Yes, there is. If you’re disassembling in a memory space that’s larger than yours (for example, on a computer where LLVM is built 32-bit but the source of the bytes is a 64-bit address space), then I don’t want to be limited by the local process’s notion of uintptr_t.

  1. If you have a class that’s completely contained within a .cpp file, go ahead and define it in an anonymous namespace and mark it “VISIBILITY_HIDDEN”. Like this:

namespace {
class VISIBILITY_HIDDEN StringMemoryObject : public MemoryObject {

I used VISIBILITY_HIDDEN for both StringMemoryObject and X86DisassemblerEmitter::X86DEBackend.
I wrapped StringMemoryObject in an anonymous namespace.
I didn’t wrap X86DEBackend because it’s already nested.

  1. In “readBytes” in StringMemoryObject, you’re returning a “-1”, but the method’s return type is “uint64_t”. Do you really want to return 0xfffffffffffU, or is “-1” a sentinel value of some sort? If the latter, perhaps changing the return type to “int64_t” instead. If the former, then changing the return to “-1U” to make it explicit.

-1 is a sentinel value. I changed readBytes to return an int and accept a uint64_t* that it fills in if it’s non-NULL.

  1. The use of “#include ” is strictly forbidden. Please remove it wherever you find it (HexDisassembler.h for one).

Yeah, I guess those static initializers aren’t so great. Removed.

  1. Look for opportunities to use the LLVM small container classes. For instance, the “fOperands” ivar in RecognizableInsn.

I looked at my uses of std::vector but they wouldn’t really benefit from the good performance of SmallVector for small values of N.

  • fInstructions in X86Disassembler.h is not a good candidate for this, because the number of instructions could potentially be quite large.
  • numberedInstructions in X86DisassemblerEmitter.cpp is not a good candidate because there are upwards of a thousand instructions in the x86 table.
  • fInstructionSpecifiers in X86DisassemblerTables.h is not a good candidate either… the number is (again) very large.
  • fOperands is not a good candidate because is never actually created. It is always set to point to CodeGenInstruction::OperandList.
  1. While we’re on that, please use “Instruction” or “Instr” instead of “Insn” for consistency’s sake.

Fixed. Thanks for the heads-up.

  1. You have types called “operandType_t”, etc. We normally don’t use the “_t” suffix for types. Perhaps renaming them.

Fixed. Now for example operandType_t is OperandType.

  1. In the “static inline bool outranks(instructionContext_t upper, instructionContext_t lower)”, it would be good to assert that upper and lower are strictly less than IC_max.

Asserted.

  1. In several places, you put 'default: assert(0 && “Badness happens here!”);" at the end of a switch statement. When assertions are turned off, I believe that this could emit warnings. It would be better to put it as the first case of the “switch”. And maybe use the new “llvm_unreachable()” macro as well.

All right, I moved them and used llvm_unreachable(). In the C code, where I can’t safely #include “llvm/Support/ErrorHandling.h,” I declared a macro that does the same thing.

  1. You use “std::endl” a lot. This is discouraged by the coding standard:

http://llvm.org/docs/CodingStandards.html#ll_avoidendl

Purged.

  1. You use “std::ostream” a lot. Would it be appropriate to use “raw_ostream” instead?

To make sure that the table definition file looks pretty, I use , and none of those functions work with raw_ostreams. I actually only create three std::ostringstreams – during table generation – so for now I’m going to leave that the way it is.

  1. In DisassemblerTables::setTableFields, you emit things to “std::cerr”. Use “errs()” instead. Or, better yet, “llvm_report_error”. :slight_smile:

Fixed to use errs().

  1. Prefer pre-increment/decrement for variables:

http://llvm.org/docs/CodingStandards.html#micro_preincrement

Learned a new one there. Thanks!

  1. In X86RecognizableInsn.cpp, you use the large macro: HANDLE_OPERAND. Could you make it an inlined function instead? It could take a pointer to a member function for the method it wants to call. Debugging large macros is impossible. :slight_smile:

I factored the code out into a function. There are still macros to keep things pretty, but they just wrap the function.

  1. I’m confused by the C files in lib/Target/X86. Why aren’t they C++ files?

The reason is that I am aware of clients that need disassembly but can’t link in C++ code, so for those clients I’ve separated out a C core. They won’t get the benefits of the MCInst, but that’s their problem :slight_smile:

Okay. That’s all I found for now. :slight_smile: I haven’t looked at the algorithm too much, though it seems fairly straight-forward. I’m sure that since you’re continually testing it, that it’s of high quality. When you do get around to submission, we will want all of the tests you’ve been running to be added to the test/ directory.

Thanks!
-bw

Sean

llvm.diff (227 KB)

pcmpestrm.diff (779 Bytes)

thanks for your comments. I'll respond to them individually. I've attached a new revision of the patch that addresses them. Patch built and tested against SVN 79487, with the additional attached fix that fixes an Intel table bug.

Thanks Sean, comments below. Are you sure you attached the updated patch? I still see a lot of std::endl's etc.

3. In include/llvm/Support/MemoryObject.h, you changed all "uintptr_t" to "uint64_t". Is there a good reason for this? If not, please change them back.'

Yes, there is. If you're disassembling in a memory space that's larger than yours (for example, on a computer where LLVM is built 32-bit but the source of the bytes is a 64-bit address space), then I don't want to be limited by the local process's notion of uintptr_t.

Makes perfect sense, please add a comment in the MemoryObject doxygen comment that explains this. You can split this change out of your megapatch and commit it separately.

13. You use "std::ostream" a lot. Would it be appropriate to use "raw_ostream" instead?

To make sure that the table definition file looks pretty, I use <iomanip>, and none of those functions work with raw_ostreams. I actually only create three std::ostringstreams – during table generation – so for now I'm going to leave that the way it is.

I actually find iomanip to be pretty ugly for simple things. llvm/Support/Format.h lets you do things like:

OS << "mynumber: " << format("%4.5f", 1234.412) << '\n';

with raw_ostreams.

Using "printf style" format strings is a lot more pretty than:

+#define BYTE_FLAGS std::hex << std::setw(2) << std::setfill('0') << std::right

:slight_smile:

Not a critical issue, but please put a space after control flow keywords like if/while.
      while(current - address < size && current < limit) {
        if(readByte(current, &buf[(current - address)]))

Please stick to plain ascii letters in source files:
+ * The raison d'être for this header file is to provide those definitions that

One meta comment: it is generally preferable to split out patches into small pieces. Please propose the APIs without the X86 implementation for the next round, it should be much easier to get that piece in than doing the whole patch at once.

Some micro level coding stuff:

+#include <llvm/MC/MCInst.h>
+#include <llvm/Support/MemoryObject.h>
+#include <llvm/Support/raw_ostream.h>

Please use "" instead of <>. You can also just forward declare all these classes instead of #including their headers.

+#include "assert.h"

Please use the c++ version of c headers when possible: <cassert>.

+ MCDisassembler(MemoryObject& region,
+ raw_ostream& vStream) {

Please use "foo_t &f" instead of "foo_t& f".

+ virtual ~MCDisassembler() {
+ }

Make sure each class with a virtual method has at least one method defined out of line:
http://llvm.org/docs/CodingStandards.html#ll_virtual_anch

Some bigger stuff:

include/llvm/MC/MCDisassembler.h seems to have four copies of itself in the same file.

In terms of API design:

+ /// Constructor - Performs initial setup for the disassembler. This may
+ /// do the work of disassembly, or may (especially on
+ /// fixed-instruction-length CPUs) set things up for lazy
+ /// disassembly.
+ ///
+ /// @param region - The memory object to use as a source for machine code.
+ /// @param vStream - The stream to print warnings and diagnostic messages on.
+ MCDisassembler(MemoryObject& region,
+ raw_ostream& vStream) {
+ }

Forcing diagnostics to get rendered to a stream is somewhat unfortunate. Different clients may want different information. Would it be possible to use error codes, or make each action (like getInstruction) fill in an error code struct with possible failure info?

The MCDisassembler API is also somewhat strange to me. Right now it is designed to analyze a memory region when constructed and contain them. It seems that there is a more useful low level API which would look something like this:

class MCDisassembler {
public:

   virtual ~MCDisassembler();

   /// DisassembleInstruction - Disassemble the first instruction in the specified region, printing the disassembled instruction to the specified raw_ostream, and returning the size of the instruction in bytes. On error, this returns zero and fills in ErrorInfo with a human readable description of the error.
   virtual unsigned DisassembleInstruction(MemoryObject &region, raw_ostream &OS, std::string &ErrorInfo) = 0;

}

Having this sort of stateless API means that higher level clients (which are stateful) can be built on top of them without adding overhead to clients who don't care about the state.

+++ include/llvm/Support/Indenter.h (revision 0)
@@ -0,0 +1,72 @@
+//===- Indenter.h - Output indenter -----------------------------*- C++ -*-===//

I'm not a big fan of manipulators like this (as David Greene can tell you ;-). I just added a new raw_ostream::indent method, so now instead of stuff like this:

+void DisassemblerTables::emitContextDecision(
+ std::ostream& o2,
+ Indenter& i2,
...

+ o2 << i2.indent() << "struct ContextDecision " << name << " = {" << std::endl;
+ i2.push();
+ o2 << i2.indent() << "{ /* opcodeDecisions */" << std::endl;
+ i2.push();

You can use:

+void DisassemblerTables::emitContextDecision(
+ raw_ostream &o2,
+ unsigned &Indent2,

+ o2.indent(Indent2++) << "struct ContextDecision " << name << " = {\n";
+ o2.indent(Indent2++) << "{ /* opcodeDecisions */\n";

Which reads better, is more efficient, and scales to more than 256 levels of indentation.

+#include <fcntl.h> // for fcntl, open
+#include <stdio.h> // for perror
+#include <sys/types.h> // for read
+#include <sys/uio.h> // for read
+#include <unistd.h> // for read

unistd etc won't work on windows. Please use the llvm/Support/MemoryBuffer.h API for reading files. The LineReader is easier to implement with MemoryBuffer because you just scan for \n and \r in the memory range.

+std::string stringFromMCInst_x86(const MCInst& insn,
+ const Target* target,
+ std::string& triple) {

Please mark this static so that you can shrink the anon namespace.

+ X86ATTAsmPrinter* asmPrinter =
+ dynamic_cast<X86ATTAsmPrinter*>(functionPass);

Please don't use dynamic_cast: we're trying to eliminate RTTI a C cast should be fine.

+int HexDisassembler::disassemble(const Target* target,
+ LineReader& reader) {
...

Bill,

thanks for your comments. I'll respond to them individually. I've attached a new revision of the patch that addresses them. Patch built and tested against SVN 79487, with the additional attached fix that fixes an Intel table bug.

Hi Sean,

Thanks for the fixes!

1. Includes like this "#include <llvm/MC/MCInst.h>" should be in quotes, not angle brackets.

Fixed.

I still saw a few more sneaking in there. Also, don't put system headers (like assert.h, cstdio, etc.) in quotes. I just meant the LLVM headers. :slight_smile:

2. In the "class Indenter", the "push" and "pop" methods check for over/under runs before proceeding. Should these be "asserts" instead?

Yeah, I made them asserts as you suggested. I was thinking about maybe making them saturate, but honestly if we run out of space we should simply break and allow the developer to make the Indenter bigger. Maybe we should have another argument to the constructor specifying how large the buffer should be...?

Either that or make it a SmallVector with initial size "256". I *think* that all of the elements in that vector are guaranteed to be contiguous. (Might want to check.) But that way you won't have to worry about the array growing too big and needing more space.

3. In include/llvm/Support/MemoryObject.h, you changed all "uintptr_t" to "uint64_t". Is there a good reason for this? If not, please change them back.'

Yes, there is. If you're disassembling in a memory space that's larger than yours (for example, on a computer where LLVM is built 32-bit but the source of the bytes is a 64-bit address space), then I don't want to be limited by the local process's notion of uintptr_t.

I see. As Chris said, that makes sense. :slight_smile: I was just worried that bits would fall on the floor if you convert a pointer value to a fixed bit-width size. Of course, when we go to 128-bit processors, we'll have to change this. :smiley:

4. If you have a class that's completely contained within a .cpp file, go ahead and define it in an anonymous namespace and mark it "VISIBILITY_HIDDEN". Like this:

namespace {
class VISIBILITY_HIDDEN StringMemoryObject : public MemoryObject {
...

I used VISIBILITY_HIDDEN for both StringMemoryObject and X86DisassemblerEmitter::X86DEBackend.
I wrapped StringMemoryObject in an anonymous namespace.
I didn't wrap X86DEBackend because it's already nested.

Thanks.

5. In "readBytes" in StringMemoryObject, you're returning a "-1", but the method's return type is "uint64_t". Do you *really* want to return 0xfffffffffffU, or is "-1" a sentinel value of some sort? If the latter, perhaps changing the return type to "int64_t" instead. If the former, then changing the return to "-1U" to make it explicit.

-1 is a sentinel value. I changed readBytes to return an int and accept a uint64_t* that it fills in if it's non-NULL.

Thanks. I'm not a huge fan of sentinel values if they can be avoided.

7. Look for opportunities to use the LLVM small container classes. For instance, the "fOperands" ivar in RecognizableInsn.

I looked at my uses of std::vector but they wouldn't really benefit from the good performance of SmallVector for small values of N.

- fInstructions in X86Disassembler.h is not a good candidate for this, because the number of instructions could potentially be quite large.
- numberedInstructions in X86DisassemblerEmitter.cpp is not a good candidate because there are upwards of a thousand instructions in the x86 table.
- fInstructionSpecifiers in X86DisassemblerTables.h is not a good candidate either... the number is (again) very large.
- fOperands is not a good candidate because is never actually created. It is always set to point to CodeGenInstruction::OperandList.

Thanks for checking!

11. In several places, you put 'default: assert(0 && "Badness happens here!");" at the end of a switch statement. When assertions are turned off, I believe that this could emit warnings. It would be better to put it as the first case of the "switch". And maybe use the new "llvm_unreachable()" macro as well.

All right, I moved them and used llvm_unreachable(). In the C code, where I can't safely #include "llvm/Support/ErrorHandling.h," I declared a macro that does the same thing.

Great!

15. Prefer pre-increment/decrement for variables:

  http://llvm.org/docs/CodingStandards.html#micro_preincrement

Learned a new one there. Thanks!

No prob. :slight_smile:

16. In X86RecognizableInsn.cpp, you use the large macro: HANDLE_OPERAND. Could you make it an inlined function instead? It could take a pointer to a member function for the method it wants to call. Debugging large macros is impossible. :slight_smile:

I factored the code out into a function. There are still macros to keep things pretty, but they just wrap the function.

Thanks!

17. I'm confused by the C files in lib/Target/X86. Why aren't they C++ files?

The reason is that I am aware of clients that need disassembly but can't link in C++ code, so for those clients I've separated out a C core. They won't get the benefits of the MCInst, but that's their problem :slight_smile:

Ah! Okay...I wonder if it would be good to place them in the llvm-c bindings somehow? I honestly don't know much about those bindings, so this might not be feasible. But it would be good to keep icky C code out of pristine C++ code (tongue *firmly* planted in cheek).

-bw

I was away doing other things for a while, but I have an API patch separated out, which (in addition to being much smaller than past megapatches) corrects two issues Chris identified in his most recent set of patches:

  • First, it makes the API a good deal simpler. Now, you can instantiate a single MCDisassembler and, each time you want an instruction disassembled, you can simply pass it a MemoryRegion, an offset, and an MCInst to populate.

  • Second, it adds MCDisassembler to the list of things you can get from a TargetMachine, so you don’t have to #include something from lib/Target/X86 just to use the X86 disassembler.

Please let me know what you think of this API. Thanks for your time.

Sean

llvm-disassembler-api.diff (10.4 KB)

Hi Sean,

Sorry for the delay, this got buried in my inbox.

This API looks great, please commit it with a couple of minor tweaks:

+++ include/llvm/MC/MCDisassembler.h (revision 0)

+#include “llvm/MC/MCInst.h”
+#include “llvm/Support/MemoryObject.h”
+#include “llvm/Support/raw_ostream.h”

Please forward declare these classes instead of #including their headers.

  • virtual bool getInstruction(MCInst& instr,
  • uint64_t& size,
  • MemoryObject &region,
  • uint64_t address,
  • raw_ostream &vStream) const = 0;

“MCInst &instr”, likewise with size. Can “region” be passed in by const reference?

+++ lib/MC/MCDisassembler.cpp (revision 0)
@@ -0,0 +1,20 @@
+//===-- lib/MC/MCDisassembler.h - Disassembler interface --------- C++ --===//

Please fix the comment line (.h → .cpp)

+namespace llvm {