MC disassembler for ARM

Hi,

I’m considering to use MC disassembler for ARM target in a binary translation project. However after trying some ARM binary and I find that there are a lot of instructions that the disassembler fails to to decoding.

Could anyone give me some information about the maturity of ARM disassembler?

Thanks!
David

Hi,

I'm considering to use MC disassembler for ARM target in a binary translation project. However after trying some ARM binary and I find that there are a lot of instructions that the disassembler fails to to decoding.

Could anyone give me some information about the maturity of ARM disassembler?

It's production quality. We're not aware of any instructions that it fails to decode. Please provide examples / file bugs.

Evan

Hi Evan,

Thanks for the information!

I’ve try to use llvm-objdump to disassemble some ARM binary, such as busybox in android.

./llvm-objdump -arch=arm -d busybox

There are many instructions cannot decode,

:./llvm-objdump: warning: invalid instruction encoding

Did I use llvm-objdump in a correct way?

I think that one possible reason is that llvm-objdump encounter pc relative data. I’ll figure out if this is the reason.

Thanks,
David

Hi David,

I've try to use llvm-objdump to disassemble some ARM binary, such as busybox
in android.

./llvm-objdump -arch=arm -d busybox

It's probably assuming the wrong architecture revision. I don't have
an android busybox handy, but I see similar on binaries compiled for
ARMv7. The trick is to use:

llvm-objdump -triple=armv7 -d whatever

(ARMv7 covers virtually anything Android will be running on these days).

There are a couple of other things to be wary of at the moment though:
1. PC-relative data, as you said: ARM code often includes literal data
inline with code, this could well *not* have a valid disassembly. In
relocatable object files, these regions should be marked[*], but I
believe LLVM has problems with that currently. In executable files
(like "busybox") the regions won't necessarily even be marked.

2. ARM object files may contain mixed ARM and Thumb code: two
different instruction sets. Obviously, disassembling ARM as Thumb or
the reverse won't give you anything sensible. Again, relocatable files
mark these regions[*] but executables don't. If you know an what you
want is thumb code, you can use the triple "thumbv7" instead for
llvm-objdump.

So a combination of those probably explains why you're getting
problems and may improve matters, but it probably won't make things
perfect (and arguably can't in the case of the ARM/Thumb distinction
without reconstructing all possible control-flow graphs).

Tim.

[*] The marking is via symbols $a, $t and $d which reference the
beginning each stretch of ARM code, Thumb code and Data.

./llvm-objdump -arch=arm -d busybox

It might be possible that this defaults to armv4.

Hi Tim,

Thanks a lot for the reply.

I tested libc.so which is a shared library. llvm-objdump also report some disassemble errors.

Could you please tell me more about $a, $t and $d symbols? How these symbols are used to define different regions? Where I can find this symbols in ELF object file?

Thanks,
David

I’m now try to find a decoder of ARM instructions in oder

Hi David,

Could you please tell me more about $a, $t and $d symbols? How these symbols
are used to define different regions? Where I can find this symbols in ELF
object file?

At the start of each range of ARM code, an assembler or compiler
should produce a "$a" symbol with that address, and put it (naturally
enough) in the ELF symbol-table. Similarly each stretch of Thumb code
gets a "$t" and each data a "$d".

For example if I assemble:

    .arm
    mov r0, r3
    ldr r2, Lit
Lit:
    .word 42
    add r0, r0, r0
    .thumb
    mov r5, r2

then the symbol table contains these entries:
     4: 00000000 0 NOTYPE LOCAL DEFAULT 1 $a
     [...]
     6: 00000008 0 NOTYPE LOCAL DEFAULT 1 $d
     7: 0000000c 0 NOTYPE LOCAL DEFAULT 1 $a
     8: 00000010 0 NOTYPE LOCAL DEFAULT 1 $t

which shows that an ARM region begins at offset 0x0, a data one at
offset 0x8, we switch back to ARM at 0xc and finally Thumb takes over
at 0x10.

GNU objdump hides the symbols by default when printing the
symbol-table (you can give it the --special-syms option to show them),
but readelf shows them always.

If you want the really deep details, they're fully documented in the
ARM ELF ABI here (section 4.6.5):

Which is all nice to know, but I'm afraid it probably doesn't offer an
immediate solution to the undefined instructions:
+ libc.so isn't a relocatable object file (well, it is dynamically,
but that doesn't count).
+ llvm-objdump ignores them anyway at the moment, as far as I can tell.

Tim.

Hi Tim,

Thanks a lot for your help! I’m very grateful.

libc.so is a prelinked library, I’ll build a non-prelinked one and have another try.

I’m now at the start of a binary translation project. I want to convert ARM binary code [*] to llvm ir, which is then translated to binary for our mips like architecture. That’s why I’m looking for a decoder for ARM binary.

The ARMMCDisassembler is production quality as be told by Evan. That’s why I’m so interested in it. However, I realized today that might not be a good choice. Although the disassembled MCInsts has a clean and simple interface, the op-codes in them are auto generated from instruction description files. They are in large quantities and do not have one-to-one correspondence to arm instructions. I think it is not a good idea for our translator to rely on the implementation of llvm ARM back-end. So I have to find another decoder or implement it by by ourselves.

Thanks,
David

[*] For most case, the targets are the shared libraries in Android APKs developed by NDK, like libangraybird.so. I think most of them are pre-linked, so it is bad for us. Because there is no $a, $t and $d symbols, we cannot figure out which region is arm code or thumb code statically.

Hi Tim,

Thanks a lot for your help! I’m very grateful.

libc.so is a prelinked library, I’ll build a non-prelinked one and have another try.

I’m now at the start of a binary translation project. I want to convert ARM binary code [*] to llvm ir, which is then translated to binary for our mips like architecture. That’s why I’m looking for a decoder for ARM binary.

The ARMMCDisassembler is production quality as be told by Evan. That’s why I’m so interested in it. However, I realized today that might not be a good choice. Although the disassembled MCInsts has a clean and simple interface, the op-codes in them are auto generated from instruction description files. They are in large quantities and do not have one-to-one correspondence to arm instructions. I think it is not a good idea for our translator to rely on the implementation of llvm ARM back-end. So I have to find another decoder or implement it by by ourselves.

Every MCInst created by the MCDisassembler will have a one-to-one mapping to an actual ARM instruction.

Hi Jim,

Thanks for reply. I’m sorry I didn’t make myself clear enough.

The MCInst created by MCDisassembler depends on the instructions defined in td files. These instructions do not have a one to one mapping to ARM instructions. There are usually one or more instructions defined in the td file correspond to one actual ARM instruction.

Thanks,
David

That depends on how you define “one ARM instruction.” It’s not a clear cut thing. For example, is “add r1, r2, r3” the same ARM instruction as “add r1, r2, #4”? What is a distinct instruction and what’s a variant encoding of the same instruction is often entirely a matter of convenience.

-Jim

Yes, I got it. Thanks for the reply!

I’m considering to let the transformation Instr → LLVM IR as a part of of instruction definition in the td file. Then use tablegen to generate the code automatically just as the it does for disassembler. Thus bypass the MCInst.

All suggestions are welcomed!

Thanks,
Dawei