GSoC19: Improve LLVM binary utilities

Hi all,

My name is Seiya Nuta. I'm studying for my master's degree in University
of Tsukuba and interested in the project named "Improve LLVM binary
utilities". I've skimmed through llvm-objcopy/llvm-objdump, commit logs,
and Bugzilla to figure out what should I do.

I have some questions about the project:

- What should I prioritize? I suppose that improving llvm-objcopy is the
   most crucial work in this summer.
- How can I avoid proposing functionalities that others are already
   working on? It seems that the tools have been still actively
   developed.
- Are there good first issues related to the project? This is the first
   time for me to dig into the LLVM source code so currently I cannot
   show convincing evidence that I'm able to work on the project.

Best regards,
Seiya

Hi Seiya,

What should I prioritize? I suppose that improving llvm-objcopy is the most crucial work in this summer.

This is an opinion that will vary a lot from person to person. At the top of my list is improvements to llvm-objdump and working on MachO backends for LLD and llvm-objcopy. The critical thing to avoid IMO is implementing features without a direct use case in mind. I’ve let myself fall victim to this mistake many times before. I would ask the community for improvements they want to see and especially relay on your host to guide the direction you take. If you and your host feel that llvm-objcopy is the most critical then I certainly know some people and use cases that would be interested and will respond to an email on llvm-dev asking what you could work on. Several people have been adding bugs for llvm-objcopy recently and you should be able to find things to do there.

How can I avoid proposing functionalities that others are already working on? It seems that the tools have been still actively developed.

The bug tracker is one way to look at this, people will say if they’re working on any open bugs there. In practice I found that if I have a real use case and the feature I need hasn’t been implemented, no one is likely to be currently working on it. For bigger features you should email llvm-dev. Many people are likely to have thought about how bigger features should be implemented and there’s a better chance that someone is already actively working on things.

Are there good first issues related to the project? This is the first time for me to dig into the LLVM source code so currently I cannot show convincing evidence that I’m able to work on the project.

Well I have biased opinions. I’d like alignment to be better handled in llvm-objdump, I’d like for symbol references to be resolved in an easier to parse fashion, and for module and function offsets to be output in a way that makes them easy to jump between.

(Adding just a bit to Jake’s response)

Hi Seiya,

What should I prioritize? I suppose that improving llvm-objcopy is the most crucial work in this summer.

This is an opinion that will vary a lot from person to person.

+1! And don’t forget that one of those people is you – I don’t think it would be useful to start a gsoc project on something you don’t enjoy just because others think it’s important. I would agree about objcopy :slight_smile: but I’m also happy to help you figure out what project you’d like for any other tool.

At the top of my list is improvements to llvm-objdump and working on MachO backends for LLD and llvm-objcopy. The critical thing to avoid IMO is implementing features without a direct use case in mind. I’ve let myself fall victim to this mistake many times before. I would ask the community for improvements they want to see and especially relay on your host to guide the direction you take. If you and your host feel that llvm-objcopy is the most critical then I certainly know some people and use cases that would be interested and will respond to an email on llvm-dev asking what you could work on. Several people have been adding bugs for llvm-objcopy recently and you should be able to find things to do there.

I think objcopy has the most things that have left to be done, but there’s plenty of work in other binutils. I’m not sure if any particular bit would be called “crucial” however.
A couple ideas that have been kicked around for llvm-objcopy are:

How can I avoid proposing functionalities that others are already working on? It seems that the tools have been still actively developed.

The bug tracker is one way to look at this, people will say if they’re working on any open bugs there. In practice I found that if I have a real use case and the feature I need hasn’t been implemented, no one is likely to be currently working on it. For bigger features you should email llvm-dev. Many people are likely to have thought about how bigger features should be implemented and there’s a better chance that someone is already actively working on things.

Are there good first issues related to the project? This is the first time for me to dig into the LLVM source code so currently I cannot show convincing evidence that I’m able to work on the project.

Well I have biased opinions. I’d like alignment to be better handled in llvm-objdump, I’d like for symbol references to be resolved in an easier to parse fashion, and for module and function offsets to be output in a way that makes them easy to jump between.

Many bugs (though not enough) are tagged with the “beginner” keyword: https://bugs.llvm.org/buglist.cgi?quicksearch=keyword%3Abeginner&list_id=157827. That’s usually a good start.

Hi Seiya,

If you want a project that is not trival; but, doable in a summer; will be be a great leaning opportunity, and will be very useful to developers. Then I would suggest improving the disassembly of object files on x86_64. I can’t count the number of times this has caused confusion.

Consider the following assembly:

    nop
    nop
    .globl sym1
sym1:
    ret

.section .text2,"ax",@progbits
    jmp .text
    jmp .text+1
    jmp .text+6
    jmp sym1
    .globl sym2
sym2:
    jmp .text2
    jmp .text2+1
    jmp .text2+20
    jmp sym2
    jmp sym2@plt

When assembled and then disassembled you will see output something like:

Disassembly of section .text:
0x00000000: 90                      nop
0x00000001: 90                      nop

sym1:
0x00000002: C3                      ret

Disassembly of section .text2:
0x00000000: E9 00 00 00 00          jmp      .text+0xFFFFFFFFFFFFFFFC (0000000000000005h)
0x00000005: E9 00 00 00 00          jmp      .text+0xFFFFFFFFFFFFFFFD (000000000000000Ah)
0x0000000A: E9 00 00 00 00          jmp      sym1 (000000000000000Fh)
0x0000000F: E9 00 00 00 00          jmp      sym2 (0000000000000014h)

sym2:
0x00000014: EB EA                   jmp      0000000000000000h
0x00000016: EB E9                   jmp      0000000000000001h
0x00000018: EB FA                   jmp      sym2 (0000000000000014h)
0x0000001A: EB F8                   jmp      sym2 (0000000000000014h)
0x0000001C: E9 00 00 00 00          jmp      sym2 (0000000000000021h)

This is pretty confusing. What is wanted is output more like this:

Disassembly of section .text[0]:
0x00000000: 90                      nop
0x00000001: 90                      nop

sym1:
0x00000002: C3                      ret

Disassembly of section .text2[1]:
0x00000000: E9 ?? ?? ?? ??          jmp      .text[0] + 0x0
0x00000005: E9 ?? ?? ?? ??          jmp      .text[0] + 0x1
0x0000000A: E9 ?? ?? ?? ??          jmp      .text[0] + 0x6 (sym1 + 0x4)
0x0000000F: E9 ?? ?? ?? ??          jmp      sym1 + 0x0

sym2:
0x00000014: EB EA                   jmp      .text2[0] + 0x0
0x00000016: EB E9                   jmp      .text2[0] + 0x1
0x00000018: EB FA                   jmp      .text2[0] + 0x14 (sym2 + 0x0)
0x0000001A: EB F8                   jmp      .text2[0] + 0x14 (sym2 + 0x0)
0x0000001C: E9 ?? ?? ?? ??          jmp      sym2 (via GOT)

Please forgive me for using the output of our internal tools to illustrate the point (I prepared this internally and don’t have much time to write this email so I just copied and pasted). If you try this with LLVM’s binary tools or GNU’s you will see similar results.

Concrete suggestions for improvements:

  • section relative targets augmented with symbol information
  • ?? to indicate Relocation patches
  • targets of PC relative jumps computed correctly
  • sections names augmented with their indices (section name are ambiguous)
  • branches via PLT indicated with added comments

This is not trivial to accomplish. Specifically, computing the target of branches will either require more integration between the binary tools and the dissembler; or, possibly the binary tools could create a fake layout and then patch up the instructions so that they disassemble “correctly”.

If you manage to get that done; then I would suggest going further and trying to enhance the disassembly by adding color coding/outlining/ASCII art to the output to show things like loops, if statements, basic blocks. As inspiration see “rich disassembly” in this presentation by apple: http://devimages.apple.com/llvm/videos/LLVMMCinPractice.m4v.

This is what I meant by llvm-objdump improvements.

Hi,

Thank you for your suggestion. It won't be easy but it's
really attractive to me!

> * sections names augmented with their indices (section name are ambiguous)
Could you explain a little further what does "ambiguous" mean here?

You mean similar section names (e.g., .text1 and .textl)?

Seiya

This augmented output should not be the default, it should only be enabled with an option.