Proposal: support object file-based on-disk module format

Hi,

Over in [1] we've been discussing adding support in LTO for an object
file-based on-disk module format. Rafael suggested that I send a proposal
to this list; this is that proposal.

As motivation, consider a compiler that needs to store metadata in the
LTO object file that may need to be read by future compilation steps,
such as the "export data" used by some Go compilers [2]. Such metadata
might also need to be read by external tools which do not know about LLVM,
so a good choice of file format would be something relatively stable and
well understood, readable without depending on LLVM and compatible with the
non-LTO scenario. This lends itself to the platform's native object file
format being a good candidate for the outermost file format, such that the
metadata and IR are stored in separate sections.

The basic proposal is that as an alternative on-disk representation for IR, we
also support native object files (i.e. ELF/COFF/Mach-O) with a section named
'.llvmbc' containing the bitcode in the same format that we are using now,
and no other (allocatable) sections. The actual support needed in LLVM would
be limited to consumers, i.e. LTO infrastructure: linker plugins, llvm-ar,
llvm-nm etc. We would not necessarily need to teach other bitcode consumers
(e.g. llvm-dis) about this format or add any producers to the tree, but
it may be useful as a matter of convenience to do so.

We can also consider extending this format by generating code into the
object file, such as for functions which we believe at compile time to
be cold, or for all functions if we want the decisions to be made at link
time. This may be beneficial for C/C++ compilation as it may allow us to
parallelize/deduplicate the code generation work for at least some functions.

Thanks,

FWIW I always thought it was a little silly that Clang produces .o files for LTO that aren’t the native object file format.

Hi Peter,

This sounds sensible to me.

There is one thing that does concern me though. IIRC when you create
object files with additional sections the GNU ld linker (possibly
others too) will concatenate sections it doesn't recognise into the
final executable.
There is actually a hacky tool called whole-program-llvm [1] which
actually uses this to get a list of paths to LLVM bitcode files that
make up the final executable.

If I've understood your proposal correctly then when compiling and
using the GNU ld linker you would end up with all the bitcode files
embedded in the final executable. Is this intentional?

[1] https://github.com/travitch/whole-program-llvm

Thanks,
Dan Liew.

If the linker never sees the intermediate object files, this will not
happen. This is the case under the current proposal. However, if we codegen
into the object files, we might want to make those object files visible to
the linker. In which case, the compiler can use an object-format-specific
exclude flag [1] to exclude those sections from the executable or DSO.

Thanks,

We (current hat: FreeBSD) would like to be able to leave the LLVM IR in programs and shared libraries so that packaging tools can run microarchitecure-specific optimisations on the result, either offline or at install time (and a few other things, including applying software diversity techniques and so on).

David

If the linker never sees the intermediate object files, this will not
happen. This is the case under the current proposal.

Ah I see. I completely misread your original post, for some reason I
thought you were proposing object files that had LLVM IR and the
corresponding codegen together. Sorry for the noise.

However, if we codegen
into the object files, we might want to make those object files visible to
the linker. In which case, the compiler can use an object-format-specific
exclude flag [1] to exclude those sections from the executable or DSO.

[1] https://sourceware.org/binutils/docs/as/Section.html

Cool I did not know about those flags.

Thanks,
Dan.

That's an interesting use case, but I think it is to some extent orthogonal
to changes to the intermediate object format. It should be possible to
teach the LTO plugin to emit the combined bitcode in a (non-excluded)
.llvmbc section into the object file that the linker sees.

Thanks,

Sorry for the delay in replying.

I agree that this looks like an interesting feature to have. Having
worked with gcc's lto in the past, the one thing I would like to avoid
is having a failure mode where a non-LTO build is done, which is the
case with a complete fat native object. I understand that is not the
case in your current proposal since the object files only have
auxiliary metadata, not text sections. This is just something to keep
in mind as things evolve.

If I remember correctly one of the issues with the original patches
was the handling of MemoryBuffer ownership. Trunk has switched to
object::Binary holding just a reference to the memory, so this should
be easier to implement now.

So, it looks like everyone is OK at least with the idea of supporting
IR-in-Object, so would you mind rebasing your patches on top of
current trunk?

Thanks,
Rafael