llvm-objdump

I would like to improve llvm-objdump. However, many unit tests depend
precisely on the current output, making the picture a little tricky.
My experience is limited to ELF format objects, so experts in other
formats please sanity check.

Suggested changes:
1) Symbolize conditional branch targets. Currently, llvm-objdump
prints branch targets numerically regardless of -symbolize.

2) Make -symbolize the default behavior for human friendliness.

3) Add new -bare option to suppress symbolizing. Many unit tests will
use -bare to preserve expected output in today's format.

4) When multiple symbols exist for a given address, print all of them.
Today, llvm-objdump only prints the last symbol found, but symbolizes
references with the first symbol found. So, it's a bit of a mess.

5) When symbolizing code references, prefer matching symbols with type
FUNC, but fall back to matches with type NOTYPE. This matches GNU
objdump behavior and many hand written assembly files don't specify
.type directives anyway.

How does this sound?

Regards,
-steve

Hi Steve,

I too have been working on improving llvm-objdump for Mach-O files, which I guess I would be called an expert in. My long term goal is to match llvm-objdump’s functionality with that of darwin’s otool(1) and improve beyond that.

For branch targets my preference is to print the target’s address (not the displacement of the branch), and preferably in hex. With a way to toggle between non-symbolic and symbolic. As non-symbolic is needed for debugging. And symbolic should be the full on use the symbol table, relocation entries, past instructions, indirect tables, literal tables, Objective-C meta data, C++ demanglers, and even debug info etc, to print the best operand and comment along with the instruction. For symbolic we go to all these lengths (short of debug info) in darwin’s otool(1) using llvm’s dissembler hooks.

I do think the default makes sense to be symbolic by default and non-symbolic with an option. I would love to extend the non-symbolic option to things like printing the private headers, relocation entries, etc as raw value. Again this is very useful for debugging and dealing with broken object files when you need to see the values and what could be going on. The name -bare as an option seems fine for this to me.

I don’t think having multiple addresses for a target is a real problem with the exception of the address 0 (which is often an unrelocated no addend value). So the trick is to not print the symbol name in the object with the address of zero in those cases. Generally in Mach-O we don’t see multiple symbols at the same address.

In Mach-O we don’t have typed symbols in the symbol table without looking at debugging info. But what you say about using type FUNC symbols for ELF seems to make sense to me.

My thoughts,
Kev

I would like to improve llvm-objdump. However, many unit tests depend
precisely on the current output, making the picture a little tricky.
My experience is limited to ELF format objects, so experts in other
formats please sanity check.

Suggested changes:
1) Symbolize conditional branch targets. Currently, llvm-objdump
prints branch targets numerically regardless of -symbolize.

2) Make -symbolize the default behavior for human friendliness.

Last I checked (which admittedly was about a year ago), -symbolize had
significant performance problems on large object files. If those are still
present, I think you should focus on fixing them before changing the
default.

3) Add new -bare option to suppress symbolizing. Many unit tests will
use -bare to preserve expected output in today's format.

4) When multiple symbols exist for a given address, print all of them.
Today, llvm-objdump only prints the last symbol found, but symbolizes
references with the first symbol found. So, it's a bit of a mess.

5) When symbolizing code references, prefer matching symbols with type
FUNC, but fall back to matches with type NOTYPE. This matches GNU
objdump behavior and many hand written assembly files don't specify
.type directives anyway.

How does this sound?

You seem to be focusing a lot on the user-visible behavior. However, I
would say that a lot of the work that needs to be done is actually internal
to the code; that will make adding new functionality easier.

Here are some suggested changes:

1. Clean up the code, improving the usability of LLVM's C++ API's as
necessary. This will benefit all LLVM users of this functionality in fact.
The main thing is to clarify the core logic and reduce boilerplate.
2. Rip out the YAMLCFG stuff (including the corresponding library code)
since it seems totally borked.

btw, for tools that we consider as internal testing tools, there has
historically been pushback for adding user-visible features if they do not
serve an immediate need within the LLVM codebase (CC'ing Rafael: does
llvm-objdump fall under this category?).

-- Sean Silva

Hi Kev,
I'm glad to hear llvm-objdump is getting attention. I'm unclear on
how much output specialization one could (or should) do for ELF vs.
Mach-O. If you're game, let's compare an example:

$ cat labeltest.s
.text
foo:
    nop
bar:
bum:
    nop
    jmp bar
    jmp bum
    jmp baz
    nop
baz:
    nop

Assembling for x86 and llvm-objdump'ing, i get

$ llvm-mc -arch=x86 -filetype=obj labeltest.s -o x86_labeltest.o
$ llvm-objdump -d x86_labeltest.o

x86_labeltest.o: file format ELF32-i386

Disassembly of section .text:
foo:
       0: 90 nop

bum:
       1: 90 nop
       2: eb fd jmp -3
       4: eb fb jmp -5
       6: eb 01 jmp 1
       8: 90 nop

baz:
       9: 90 nop

I get the dump above with or without -symbolize.

My personal golden reference, GNU objdump, does this:

$ objdump -dw x86_labeltest.o

x86_labeltest.o: file format elf32-i386

Disassembly of section .text:

00000000 <foo>:
   0: 90 nop

00000001 <bar>:
   1: 90 nop
   2: eb fd jmp 1 <bar>
   4: eb fb jmp 1 <bar>
   6: eb 01 jmp 9 <baz>
   8: 90 nop

00000009 <baz>:
   9: 90 nop

What does otool produce?

For branch targets my preference is to print the target’s address (not the displacement of the branch), and preferably in hex.

I like this too.

I don’t think having multiple addresses for a target is a real problem with the exception of the address 0 (which is often an unrelocated no addend value). So the trick is to not print the symbol name in the object with the address of zero in those cases

Right, relocations are a special case.

Hi Steve,

For the labeltest.s I get:

% llvm-mc -triple x86_64-apple-darwin10 -filetype=obj -o x86_labeltest.o labeltest.s

First with just -v that produces disassembly (without verbose operands):
% otool -tv x86_labeltest.o
x86_labeltest.o:
(__TEXT,__text) section
foo:
0000000000000000 nop
bar:
0000000000000001 nop
0000000000000002 jmp 0x7
0000000000000007 jmp 0x1
0000000000000009 jmp 0xe
000000000000000e nop
baz:
000000000000000f nop

And second with -V that produces “verbose operands”:

% otool -tV x86_labeltest.o
x86_labeltest.o:
(__TEXT,__text) section
foo:
0000000000000000 nop
bar:
0000000000000001 nop
0000000000000002 jmp bar
0000000000000007 jmp bar
0000000000000009 jmp baz
000000000000000e nop
baz:
000000000000000f nop

And third adding -j that prints the opcode bytes:
% otool -tVj x86_labeltest.o
x86_labeltest.o:
(__TEXT,__text) section
foo:
0000000000000000 90 nop
bar:
0000000000000001 90 nop
0000000000000002 e900000000 jmp bar
0000000000000007 ebf8 jmp bar
0000000000000009 e900000000 jmp baz
000000000000000e 90 nop
baz:
000000000000000f 90 nop

For me, operands of -3, -5 and 1 are of little use. If I think the target is assembled wrong I want to see where it thinks it is going (the hex address in the object file) and the opcode bytes so I can hand decode what is going on (more important in things like arm that don’t have simple displacements).

Also if I’m printing symbolic operands like “bar” I don’t want to see the address of bar or the displacement in that case. Basically I want to see as close to real assembly code as possible.

Also note for Mach-O, we work hard to not have symbols at the same address and not using symbols that are not assembly temporary names. We use things like 1f, 2b or L21 because we break sections into “atoms” at the symbol addresses by default (when the assembly has the directive .subsections_via_symbols which produces the flag in the header SUBSECTIONS_VIA_SYMBOLS).

Kev

P.S. We also display raw text bytes with just -t and no -v or -V which is useful when debugging very broken objects:

% otool -t x86_labeltest.o
x86_labeltest.o:
(__TEXT,__text) section
0000000000000000 90 90 e9 00 00 00 00 eb f8 e9 00 00 00 00 90 90

For me, operands of -3, -5 and 1 are of little use.

Agreed.
​This annoyance started me down this whole path.

Also if I’m printing symbolic operands like “bar” I don’t want to see the
address of bar or the displacement in that case. Basically I want to see as
close to real assembly code as possible.

​Sounds right to me. In ‘-bare’ you get the address, otherwise you get the symbol.

​> ​
We use things

like 1f, 2b or L21 because we break sections into “atoms” at the symbol

Adding ‘f’ and ‘b’ style branches to the test case
​, but GNU objdump shows these as an address plus nearest label + offset

​.

$ cat labeltest.s
.text
foo:
nop
bar:
bum:
nop
jmp bar
jmp bum
jmp 1f
nop
1:
nop
jmp 1b

$llvm-objdump -d -symbolize x86_labeltest.o

x86_labeltest.o: file format ELF32-i386

Disassembly of section .text:
foo:
0: 90 nop

bum:
1: 90 nop
2: eb fd jmp -3
4: eb fb jmp -5
6: eb 01 jmp 1
8: 90 nop
9: 90 nop

$ objdump -dw x86_labeltest.o

x86_labeltest.o: file format elf32-i386

Disassembly of section .text:

00000000 :
0: 90 nop

00000001 :
1: 90 nop
2: eb fd jmp 1
4: eb fb jmp 1
6: eb 01 jmp 9 <bar+0x8>
8: 90 nop
9: 90 nop

I haven't tested performance, but you're probably right, The
symbolizing code repeats a linear search for each symbol. There is a
FIXME comment in the loop suggesting to use a map instead.

btw, for tools that we consider as internal testing tools, there has
historically been pushback for adding user-visible features if they do not
serve an immediate need within the LLVM codebase (CC'ing Rafael: does
llvm-objdump fall under this category?).

I don't think so. If I remember correctly the reason for having
llvm-readobj and llvm-objdump is that readobj can be whatever we want
for testing and llvm-objdump can match as closely as practical the
system (gnu?) objdump.

Cheers,
Rafael

2. Rip out the YAMLCFG stuff (including the corresponding library code)
since it seems totally borked.

Is the attached patch OK? :slight_smile:

Cheers,
Rafael

t.patch (24.7 KB)

> 2. Rip out the YAMLCFG stuff (including the corresponding library code)
> since it seems totally borked.

Is the attached patch OK? :slight_smile:

It appears that the YAML is just the skin of a lot of the MCAnalysis stuff
which looks dead. E.g. r182628, which adds the apparently unused -cfg
option for llvm-objdump. I think your patch leaves this MCAnalysis stuff
even more dead :slight_smile:

There's a whole string of commits for the MC CFG stuff that seemed to have
gotten dumped into trunk with no tests, and the post-review "tests?"
feedback on every one of the commits appears to have been mostly ignored.
It is probably worth ripping it all out at once.

-- Sean Silva

I like this idea of tailoring default output to keep users of
different platforms in their comfort zone: otools style for Mach,
objdump style for Linux, etc. A -bare option would produce platform
independent output for diagnostics or other cases where consistency
matters more than friendliness.