Object file modification/writing

I’ve been trolling through MC, LLD, and libObject/ObjCopy looking for the LLVM way to modify and write object files from memory buffers. I see that each project has more or less its own implementation, which seems like a potential problem causing repeated work and feature drift.

I’d like to try and unify the modification/writing APIs into libObject so that we can start to unify the users on one way to read/modify/write object files. libObject provides a pretty good API for reading object files, but modification/writing support is private and in ObjCopy. I’d like to basically just take the object writing bits and move them over, while adding support for doing things like:

auto obj1 = createObjectFile(memoryBuffer);
auto obj2 = createObjectFile(memoryBuffer);

ObjectFile obj3 = copyObjectFrom(obj1);
obj3.mergeSectionsFrom(obj2);
obj3.writeTo(writeableMemoryBuffer);

I’m not planning on parsing into assembly (though ideally it would be easy to do if desired), the idea here is simply to provide a way to modify/combine object files manually. Please note - I’m also not going for a boil-the-ocean approach here, my goal here is explicitly not to provide a full linker! There’s admittedly a LOT of code in the LLVM codebase so if I’m missing something that does what I’m looking for that I’m missing please let me know.

I’m going through the git blame and tagging folks who have touched files of interest recently just to get advice/pitfalls, if there’s someone who should be tagged that isn’t please feel free to!
@BertalanD @ellishg @int3 @MaskRay @topperc

If you just want an abstraction for writing an ELF header, that’s probably feasible, but that’s also not very much code. Beyond that, I’m not sure there’s any abstraction that would actually work for all those contexts. MC, LLD, and objcopy all represent the objects they’re constructing using very different data structures, and I’m not sure you can unify them without significantly hurting performance.

The idea would be to get some way to modify an object file and write the modified file. To be concrete, let’s say I wanted to add a symbol that I already have binary code for - I want a C++ API that allows me to add it to the text section and update the symbol table without directly munging bytes.

MC, LLD, and objcopy all represent the objects they’re constructing using very different data structures, and I’m not sure you can unify them without significantly hurting performance.

Can you expand on that? What I saw was a bunch of representations of Symbol/SymbolTable/Section/Segment that looked very similar, at least to my untrained eye.

If you want to move the objcopy code to an LLVM library, that seems fine, but that’s very different from the sort of API you could share with the MC layer or LLD.

The issue is more what is computed, and when it’s computed. We don’t want to force other code (particularly lld) into using suboptimal data structures for the sake of conforming to some abstract representation of, for example, a symbol table.

Maybe my statement was a little too strong; there’s probably some code that could be unified. But we have to be very careful about the performance implications of the data structures we’re using.

2 Likes

Not all of it, but just the bits that modify/write objects. I’m not expecting to unify MC and LLD, just hoping to take some duplicated code and de-duplicate it a little bit maybe.

100% - I have no intention of forcing projects to use these data structures or try and build something that all the various projects can use perfectly - more that I want to unify duplicates and provide an actual library API for dealing with object files that doesn’t involve a command-line or something basically adjacent to a command line :slight_smile: There’ll always be some custom stuff on top, but having a base layer of “this is how to take some bytes and jam them into a MachO/ELF/COFF/WASM structure” seems generally useful.

Yeah, I wouldn’t mind a lower-level ELF-writing API that I could reimplement llvm-dwp with - currently it’s on top of libOBject and MC which involve quite a bit of memory overhead (that MC necessarily buffers the whole output sections is unfortunate - I’d love to be able to say “I know how many bytes I want to write, let me know when you’re ready and I’ll write them” so I don’t have to buffer - or worse “I don’t know how many bytes, let me know and I’ll write them and then tell you how many I wrote” (currently MCStreamer writes the section header at the end of the file - so it doesn’t need to know the full layout ahead of time, I think - then it seeks back to the start to write the offset to the section header))

Maybe such an abstraction could be low enough that lld could use it too, but wouldn’t hold my breath unfortunately. At least low level enough that MCStreamer, llvm-dwp, and objcopy, maybe?

Maybe such an abstraction could be low enough that lld could use it too, but wouldn’t hold my breath unfortunately. At least low level enough that MCStreamer, llvm-dwp, and objcopy, maybe?

Yeah makes sense to me. TBH my goal out of this work is to be able to do basically ld -r on simple binaries and that’s pretty much it, so I think a very simple set of APIs would do it for me. It sounds like you need roughly the same abstraction level for llvm-dwp?

I’m not sure I’m following the need for merging Objcopy and Object into one? The Object library is there primarily for reading files (there’s some minor exceptions to that, but let’s ignore those for this discussion, as they aren’t universal). The Objcopy library uses the Object library to read an object into its internal representation and then to perform manipulations on that object. This separation of concerns means that clients that don’t care about the modification/writing aspects (e.g. llvm-readobj) aren’t encumbered with them.

The majority of the Objcopy code was recently moved into a library out of the llvm-objcopy executable, so that other clients can make use of it to perform manipulations on objects. I can’t remember (and am too lazy to look) whether reading/writing to/from a memory buffer as opposed to a file has been added yet, but it should be fairly trivial to extend the Objcopy interface to do that. The internal process would be something like memoryBuffer → ObjectFile (i.e. libObject) → Object (i.e. libObjcopy) → memoryBuffer. You’d then be able to do something like the following pseduo-code:

Object obj1 = objcopy::read(memoryBuffer); // Reads memory buffer into ObjectFile internally.
Object obj2 = objcopy::read(memoryBuffer);
for (Section s : obj2)
  obj1.addSection(s);
obj1.write(writeableMemoryBuffer);

I’m not sure I’m following the need for merging Objcopy and Object into one? The Object library is there primarily for reading files (there’s some minor exceptions to that, but let’s ignore those for this discussion, as they aren’t universal). The Objcopy library uses the Object library to read an object into its internal representation and then to perform manipulations on that object.

I didn’t realize that was the stated purpose of Object, mainly. All the docs I saw said that it was for ‘dealing with’ objects. I also found was that the ObjCopy APIs were very restrictive and the ones I would have needed were basically private, so I took that to mean that ObjCopy was meant to only serve llvm-objcopy. I should also note, we wouldn’t want to merge them together, just migrate some of the code from ObjCopy into Object (mainly the modification/writing bits).

@jh7370 are you suggesting that we should leave Object alone and modify ObjCopy to have the public APIs we want? The concern you have here:

This separation of concerns means that clients that don’t care about the modification/writing aspects (e.g. llvm-readobj) aren’t encumbered with them.

feels like something that could be ameliorated by good documentation/directory structure, perhaps? I’m not opposed, per se, to doing the modification in ObjCopy, but it does feel strange to have non-llvm-objcopy users depending on the ObjCopy library, at least to me.

I agree with @jh7370 that libObject is currently for reading object file. We tried to avoid any logic that modifies the Object file to live in libObject but I am not against lifting some of the logics in objcopy to form a new library.

I can’t speak for all the object formats but modifying object file is usually very hard. To modify macho, you need to have essentially the same knowledge as a linker to perform anything that can impact relocations.

To modify macho, you need to have essentially the same knowledge as a linker to perform anything that can impact relocations.

Can you expand a little bit on this? My understanding is that if you have relocatable symbols you can more or less “just update the symbol table” and moosh the text sections together? As you can tell, I’m still learning here :slight_smile:

First of all, edit symbol table doesn’t affect relocations in the object file if you work within the limitation. For example, if you rename symbols to something shorter than the original name, that will work. But on the other hand, “just update the symbol table” is very hard in a general sense because you don’t know where in the object file encodes an offset in the symbol table so you end up rebuilding the entire object file.

The macho edit tools, for example, install_name_tool or segedit all have very limited function, and they simply don’t work on all install_names or sections. Operations like strip will fallback to linker because it needs to update symbol table.

Ah I see - and if I add symbols without renaming anything I would have to edit the objects that I’m merging in with the updated symbol offsets? I guess every call in the incoming object is potentially affected by this operation.

I think that would be the appropriate approach, although it’s hard to be confident when talking in the abstract without seeing practical code changes. It’s worth noting that since the ObjCopy library code was originally written specifically for use by llvm-objcopy, there may be bits that aren’t going to work without following the full objcopy pipeline (at least not without additional work). @avl-llvm may have some additional insights as they were the one to actually do the librarifying work, with the intent of using it in a different tool.

One other key difference between the Object and ObjCopy libraries is that the former’s API is primarily intended to be platform agnostic, whereas working with objects when it comes to modifying them is very much platform specific. As such there is only one ObjectFile class, with concrete subclasses per file format, in the Object library, but in the ObjCopy library, there’s one Object class per file format.

Not really. My point wasn’t about where code lives, but rather about how tools are built. At the moment, most tools that rely on the Object library don’t need to add the ObjCopy library too. This improves build times and potentially executable sizes, since they don’t have to pull in a load of code they have no need for. That being said, he “where code lives” point is also a point for not combining them: just as we prefer self-descriptive code over oodles of comments in the code, it is generally better to keep functionality that isn’t intrinsically linked separate.

[quote=“bzcheeseman, post:9, topic:65954”]
I’m not opposed, per se, to doing the modification in ObjCopy, but it does feel strange to have non-llvm-objcopy users depending on the ObjCopy library, at least to me.
[/quote]If the library was called, e.g. ObjectWriting, would that resolve your concern? The name is more reflective of its historical usage than its intended purpose. As noted above, the code was moved into a separate library specifically for the purpose of being reused elsewhere. If you went back far enough, you might find that the Object library code originally was pulled out of a tool. That doesn’t mean that tools other than that original tool shouldn’t use it - it means precisely the opposite.

Super useful thoughts, thank you!

I hear your concerns re: building tools and linking only what’s necessary, as well as the point about being self-descriptive.

The name is more reflective of its historical usage than its intended purpose. As noted above, the code was moved into a separate library specifically for the purpose of being reused elsewhere. That doesn’t mean that tools other than that original tool shouldn’t use it - it means precisely the opposite.

Agreed! That’s my goal here - to make it more discoverable when inevitably someone else wants to do something similar and to increase re-use. IMO renaming is one thing we can do to help with this.

To make things concrete, how about something like this: Rename Object -> ObjRead, add a new library, ObjWrite, and modify ObjCopy and whoever else is interested to depend on both? That way the domain of each library is super clear. Alternatively, we could do something like

Object/
  Read/
    CMakeLists.txt -> libObjectRead.a
  Write/
    CMakeLists.txt -> libObjectWrite.a

First, ELF is a simple format. It’s not difficult to parse or manipulate by hand. Therefore, the benefit of code sharing has a surprisingly low upper bound.

lib/Object contains some code which is shared among some dumpers (e.g. llvm-nm, llvm-objdump, llvm-readobj) and the reader part of binary manipulation tools (llvm-objcopy).
I think the current llvm-project structure actually strikes a great balance among code sharing, ease of use, flexibility, and performance.
I am not against making code sharing where infeasible, but I don’t think there are big areas where coding sharing can improve without significantly hurting other properties.

Examples that code sharing helps. llvm-nm/llvm-readobj/llvm-objdump/etc share a lot of dumper code.
I’ve made changes like making llvm-objdump reuse llvm/Object part for symbol table/symbol versioning parsing, etc.

lld uses lib/Object for parsing but not heavily. Library code tends to duplicate validity checking which negatively affects performance
or does parsing in a certain order which a high performance linker doesn’t want to stick to.
Honestly I don’t think there is any major code sharing opportunity for at least COFF/ELF/Mach-O ports.
There is some little code sharing opportunity for some newer Mach-O features.

ELFObjectWriter, llvm-objcopy, llvm-readobj/llvm-objdump/llvm-nm/etc, and lld/ELF all use ELF very differently.
Code sharing may easily build up complex abstractions which hurt some tools which expect different properties.
E.g. ELFObjectWriter prefers a streaming style writer. It does not need an AST like interface.
A linker doesn’t like an AST interface, either. It computes information in different passes and wants to have a great control of the write order.

For binary manipulation tools, there are a lot of details to consider.
Say you want to add a symbol. There are questions about which symbol table to add, .symtab or .dynsym or both.
Type/binding/st_other/size/value?
Whether the symbol is absolute, undefined, or relative to an existing section. If relative to an existing section, how to locate it.
Which location of the symbol table? How to put its name in a string table?
The flexibility is great and many manipulation tools restrict the flexibility in a reasonable way.
I think only with very clear requirement we can think of grounds to improve reusability of the existing code (as others have mentioned, mostly lib/ObjCopy now) to help make new use cases easy.

What code would go in each of these three different libraries (in particular ObjWrite versus ObjCopy, for example)?

@MaskRay thanks for the info dump - this is super helpful! I think you’re spot on, basically we want to try and improve the reuse of the ObjCopy code.

ObjCopy would pretty much have the conversion from one format to another (i.e. elf to binary, binary to macho, etc.) while delegating the actual “hey please write these bytes and do the bookkeeping in the header/load commands” to ObjWrite.

(Just in case it isn’t clear, I’m certainly not opposed to work to make ObjCopy more usable elsewhere, as long as it doesn’t significantly impact the maintainability of that code).

I’m not sure I understand the need for a separate library then? If the issue is simply that ObjCopy is a bad name for the library, I’d not oppose a renaming necessarily.

I’m not sure I understand the need for a separate library then? If the issue is simply that ObjCopy is a bad name for the library, I’d not oppose a renaming necessarily.

My concern is that once the relevant code is made public in ObjCopy it’ll no longer be a good name for the library.

How about this: when I have time ™ I can put up a patch that moves some headers into the ObjCopy public API, and then based on that we can discuss a rename?