Exploring alternative object file formats

Hello!

As part of my PhD research I would like to explore some alternative object file formats.
I believe LLVM is a great codebase to look into.

If I think about the stages of a "compiler:
compiler → assembler → linker → ELF

I am keen to see where I can modify the assembler & linker to understand a new intermediate object file format.

I did see there are individual tools such as llvm-mc but if I want to test compiling real programs, I will need to start with a frontend as it drives linking with necessary startup files like crtio

Thank you!

Obvious question: why?

LLVM has linkers for ELF, COFF, and MachO. IBM brings XCOFF and slowly GOFF to LLVM. It costs them a lot of energy and there are no linkers in-tree.

Re: Why?

PhD research to investigate alternative formats is itself it’s own Why.
I have some ideas on layout out the object file format that can help with linking.

Is your recommendation to look where ELFFileObject.h is used and plug in my own ?

https://reviews.llvm.org/D152834

Here is the next code layout algorithm.

is CodeLayout.h where it starts picking an object file format?
I’ll take a read.

(noet that llvm has an “integrated assembler” for some targets - so in some cases no assembly code is generated (in the textual sense, at least) - it goes straight to object code. You can look in MC*Streamer (MCAsmStreamer is for textual assembly, MCObjectStreamer is for the integrated straight-to-object-code assembler) - and how it handles the current COFF/MachO/ELF formats, and you could add your own format there)

If you’re interested in ideas: I’d certainly love to see a format that does a better job about symbol names (symbol names for large template heavy code can make 10-20% of object size) - keeping the symbols compressed and using maybe uniform-sized symbol hashes for linker symbol resolution could save some time in the linker/save object size (then only decompress the mangled names for linker error messages, or for linker-exported symbol names)

& could also build the format with something more like MachOs “subsections via symbols” to save space compared to ELF’s section overhead per function (if you’re using -ffunction-sectinos/-gc-sections) though also probably needs “alt entry” for cases where you do want a symbol not to break up a contiguous region.

I’d suggest that there could be a stronger distinction between relocatable and executable formats. What’s good for the former (the kinds of things @dblaikie mentioned) aren’t necessarily of any use at all to the latter. An executable format would really want to be optimized for fast loading into a process, IMO. Back in the day I worked on a system where process pages could be mapped directly from the executable, and therefore not require a separate backing store (not take up space in the page/swap file). I can’t say I’ve studied this aspect of ELF closely but my impression is that while you could set it up that way, it’s not at all required, and so loaders don’t take advantage. (Someone who knows more about ELF in its executable persona might well correct me on this.)

It deepens on whether your target group are mobile applications. You have to pay for each kilobyte of disk space and startup time has to be optimized to the fullest.

Or your target group are server applications. As seen above space consumption and code layout are more important.

Thanks everyone so for the answer.

I am interested in something that is actually not space efficient but chooses to optimize access, UX and exploration.

I am most interested in relocatable objects because there is more metadata to expose and work with such as the relocations themselves. As @pogo59 mentioned, when it gets to the executable, the final product is closer to be directly mmap’d.

If there is a particular file that would help me explore trying new object file formats please let me know.

My plan of attack so far was:

  1. look at llvm-mc
  2. add a new “filetype” to it’s support and work there

I can then use clang or whatever to generate the machine code with “-S” and pipe it to llvm-mc.
(Bonus points if I can figure out how to directly wire Clang through to the changes I introduce to llvm-mc itself and then can save myself figuring out multiple individual commands)

I started a personal branch pursuing this – if you are interested in following along DM me.

@fzakaria, in case it is useful to you. I’ve been adding support for the DXContainer format that is used by the DirectX backend for shader kernels to be loaded by DirectX drivers.

It’s a pretty primitive format and not one I would recommend as a format, but it is simple, and it is one of the more recently added formats so it could provide a bit of a recent roadmap for you.

I also strongly recommend adding MC, Object and ObjectYAML support in tandem. Adding the three together gives you the ability to also build out testing infrastructure alongside the generation code.

Is it in main ?

I haven’t come across it yet when browsing the source code.

That sounds perfect because my goals are fine with a simple format as well :smiley_cat:

On Thu, Jul 20, 2023, 5:21 PM Chris B via LLVM Discussion Forums <notifications@llvm.discoursemail.com> wrote:

beanz
July 21

@fzakaria, in case it is useful to you. I’ve been adding support for the DXContainer format that is used by the DirectX backend for shader kernels to be loaded by DirectX drivers.

It’s a pretty primitive format and not one I would recommend as a format, but it is simple, and it is one of the more recently added formats so it could provide a bit of a recent roadmap for you.

I also strongly recommend adding MC, Object and ObjectYAML support in tandem. Adding the three together gives you the ability to also build out testing infrastructure alongside the generation code.


Visit Topic or reply to this email to respond.

To unsubscribe from these emails, click here.

Yes it is in main, but it is definitely unsuitable for use with targets other than DirectX. It is just simple enough that it might give you a roadmap, but DirectX doesn’t encode instructions, so it isn’t full featured enough to be useful for CPU architectures.

It has support in BinaryFormat, Object, MC and ObjectYAML.

1 Like