Disassembly arbitrary machine-code byte arrays

Hi,

My apologies if this appears to be a very trivial question -- I have
tried to solve this on my own and I am stuck. Any assistance that
could be provided would be immensely appreciated.

What is the absolute bare minimum that I need to do to disassemble an
array of, say, ARM machine code bytes? Or an array of Thumb machine
code bytes? For example, I might have an array of unsigned chars --
how could I go about decoding these into MCInst objects? Does such a
decoding process take place in one fell swoop or do I parse the stream
one instruction at a time? Can I ask it to "decode the next 10 bytes"?
What follows is my (feeble) attempt at getting started. It probably
doesn't help that I am only familiar with C and Objective-C and find
C++ syntax absolutely bewildering.

Kind regards,
Aidan Steele

int main (int argc, const char *argv[])
{
LLVMInitializeARMTargetInfo();
LLVMInitializeARMTargetMC();
LLVMInitializeARMAsmParser();
LLVMInitializeARMDisassembler();

const llvm::Target Target;

llvm::OwningPtr<const llvm::MCSubtargetInfo>
STI(Target.createMCSubtargetInfo("", "", ""));
llvm::OwningPtr<const llvm::MCDisassembler>
disassembler(Target.createMCDisassembler(*STI));

llvm::OwningPtr<llvm::MemoryBuffer> Buffer;
llvm::MemoryBuffer::getFile(llvm::StringRef("/path/to/file.bin"), Buffer);
llvm::MCInst Inst;
uint64_t Size = 0;

disassembler->getInstruction(Inst, Size, *Buffer.take(), 0,
llvm::nulls(), llvm::nulls());

// llvm::StringRef TheArchString("arm-apple-darwin");
// std::string normalized = llvm::Triple::normalize(TheArchString);
//
// llvm::Triple TheTriple;
// TheTriple.setArch(llvm::Triple::arm);
// TheTriple.setOS(llvm::Triple::Darwin);
// TheTriple.setVendor(llvm::Triple::Apple);
// llvm::Target *TheTarget = NULL;

return 0;
}

Hi Aiden,

The easiest thing I can do is to point you to the source of the "llvm-mc" tool, which does exactly what you ask in its "-disassemble" mode. The code is rather small, so it should be easy to work out.

tools/llvm-mc

Cheers,

James

Hi Aiden,

The 'C' based interface you could use in is llvm/include/llvm-c/Disassembler.h, which in there is:

/**
* Disassemble a single instruction using the disassembler context specified in
* the parameter DC. The bytes of the instruction are specified in the
* parameter Bytes, and contains at least BytesSize number of bytes. The
* instruction is at the address specified by the PC parameter. If a valid
* instruction can be disassembled, its string is returned indirectly in
* OutString whose size is specified in the parameter OutStringSize. This
* function returns the number of bytes in the instruction or zero if there was
* no valid instruction.
*/
size_t LLVMDisasmInstruction(LLVMDisasmContextRef DC, uint8_t *Bytes,
                             uint64_t BytesSize, uint64_t PC,
                             char *OutString, size_t OutStringSize);

This is used in darwin's otool(1) which is an objdump(1) like tool. It ends up in the libLTO shared library.

Kev

Hi Kev and James,

Thanks to both of you for responding. I had looked at the otool
release published for 10.7.2 (cctools-800), but it seems that it only
snuck in after that and by the cctools-809 release!

In any case, both that and llvm-mc should be more than adequate! A
follow-up question: is the C interface to LLVM a second-class citizen
or should I reasonably be able to expect to do everything with it that
I could do as a consumer of the C++ API?

Regards,
Aidan

It is a second class citizen in some ways: you can't do everything with the C API that you can do with the C++ API. On the other hand, the C API is stable (we don't change the API) where the C++ API changes all the time.

-Chris

The other thing to note about the C api, is that things are added to it
mostly on an as needed basis. So if something is missing, you can ask it
got it added, which I guess should happen, unless it might be difficult
so support long term (i.e. it exposes something unstable).

  Tom