[RFC] Embedding Bitcode in Object Files

Apple has some internal implemenation for embedding bitcode in the object file
that we would like to upstream. It has few changes to clang frontend, including
new clang options, clang driver changes and utilities to embed bitcode inside
object file. We believe upstreaming these implementations will benefit the
people who would like to develop software on Apple platform using open source
LLVM. It also helps the driver compatibility and it aligns with some of ongoing
efforts like Thin-LTO which also has an object wrapper for bitcode.

Embedded Bitcode Design:
Embedded Bitcode are designed to enable bitcode distribution without disturbing
normal development flow. When a program is compiled with bitcode, clang will
embed the optimized bitcode in a special section in the object file, together
with the options that is used during the compilation. The object file will still
have the normal TEXT, DATA sections for normal linking. During the linking,
linker will check all the input object files have embedded bitcode and collect
the bitcode into an archive which is embedded in the output. The archive also
contains all the information that is needed to rebuild the linked binary. All
compilation and linking stage can be replayed to generated the final binary.

There are mainly two parts we would like to upstream first:
1. Clang Driver:
Adding -fembed-bitcode option. When this new option is used, it will split the
compilation into 2 stages. The first stage runs the frontend and all the
optimization passes, and the second stage embeds the bitcode from the first
stage then runs the CodeGen passes. There is also a -fembed-bitcode-marker
option that doesn't split the compilation into 2 stages and it only puts an 1
byte marker into the object file. This is used to speed up the debug build
because bitcode serialization and verification will make -fembed-bitcode slower
especially with -O0 -g. Linker can still check the presence of the section to
provide feedback if any of the object files participated in the linking is
missing bitcode in a full bitcode build.
2. Bitcode Embedding:
Several special sections are used by bitcode to mark the presence of the bitcode
in the MachO file.
"__LLVM, __bitcode" is used to store the optimized bitcode in the object file.
It can have an 1-byte size as a marker to provide diagnostics in debug build.
"__LLVM, __cmdline" is used to store the clang command-line options. There are
few options that are not reflected in the bitcode that we would like to replay in
the rebuild. For example, '-O0' option makes us run FastISel during rebuild.

Thanks

Steven

Steven,

This looks like a very interesting feature. It opens new possibilities beyond those you have outlined. I am looking forward to seeing your implementation.

Sergei

Hi Steven,

Can you please explain how this relates to the existing .llvmbc section
feature?

Peter

Hi Peter

It is not currently related because we started the implementation before Thin-LTO
gets proposed in the community but our "__LLVM, __bitcode" section is pretty much
the same as ".llvmbc" section. Note ".llvmbc" doesn't really follow the section
naming convention for MachO objects. I am hoping to unify them during the upstream
of the implementation.

Thanks

Steven

That would be my main request. Seems like a nice feature, but we
should have one implementation of it :slight_smile:

BTW, can you explain a bit why you need things like "-O0" recorded? In
case you want to go from bitcode back to object file one file at a
time (no LTO)? Is that orthogonal? That is, should the command line be
included in .bc files too? What is the command line option that is
included, the -cc1 or the driver one?

There was some discussion on the past about which options get run in
clang if given -flto. For example, it seems likely that a more
conservative inlining pass would be a good thing to not remove
opportunities for the link time inlining. What would happen with
"-flto -fembed-bitcode"? Would the bitcode be the same as with just
-flto and the object file less optimized?

Cheers,
Rafael

Steven,

   I would like to echo Rafael's comments.

My general understanding is that given an object file with embedded IR I should be able to reproduce the same object.
Everything else should be "supporting" that objective... which might include relevant flags and transformations leading _to_ this IR and _from_ this IR to the given object code.

Does my understanding matches your overall goal?

Thanks.

Sergei

Hi Sergei and Rafael

Thanks for the comment!

In terms of bitcode section, my plan is to make "__LLVM, __bitcode" section the MachO version of ".llvmbc" section. In latest Darwin OS, "__LLVM" segment will not be loaded by dyld when you try to execute a binary with embedded bitcode which is a plus for this feature.

And for the command line, Sergei has the correct idea about the motivation behind this. We want to have enough information to recreate the same binary from the embedded bitcode (at least when compiled with the same compiler). Here is an example:
$ clang -fembed-bitcode -O0 test.c -c -###
"clang" "-cc1" (...lots of options...) "-o" "test.bc" "-x" "c" "test.c" <--- First stage
"clang" "-cc1" "-triple" "x86_64-apple-macosx10.11.0" "-emit-obj" "-fembed-bitcode" "-O0" "-disable-llvm-optzns" "-o" "test.o" "-x" "ir" "test.bc" <--- Second stage
If we record all the options from the second stage, we can recreate the same object file using the exact same command. So, yes, they are cc1 flags. I understand they are no stable but second stage can only have a handful of options that: 1. affects codegen. 2. not embedded in the bitcode that should be record. This list should be shrinking towards zero eventually (not sure about -O0 and other optimization options). If we have to rename them before removing them from the embedding option list, we can provide upgrade for them.

This feature is orthogonal to LTO. For my current implementation, "-flto -fembed-bitcode" is the same as "-flto". Linker need to have the logic to handle a llvm bitcode file (treated as LTO) and a macho file with embedded bitcode (treated as normal link) differently.

Thanks

Steven

Without knowing more details of your implementation, I'd be concerned about
how this might impact deterministic/reproducible builds.

Source paths are recorded in a number of places, but you can typically fix
that by using -fdebug-prefix-map. But if the entire command-line including
the -fdebug-prefix-map argument gets stored in the output too, then you
still have a problem.

I don’t think we need any path in the command line section. We only record the command-line options that will affect CodeGen. See my example in one of the preview reply:

$ clang -fembed-bitcode -O0 test.c -c -###
“clang” “-cc1” (…lots of options…) “-o” “test.bc” “-x” “c” “test.c” <— First stage
“clang” “-cc1” “-triple” “x86_64-apple-macosx10.11.0” “-emit-obj” “-fembed-bitcode” “-O0” “-disable-llvm-optzns” “-o” “test.o” “-x” “ir” “test.bc” <— Second stage

I can’t think of any source path that can affect CodeGen. There should not be any paths other than the bitcode input path and binary output path exists in the second stage and they are excluded from the command line section as well. -fdebug-prefix-map is consumed by the front-end and prefixed paths are a part of the debug info in the metadata. You don’t need to encode -fdebug-prefix-map in the bitcode section to reproduce the object file with the same debug info. Did that answer your concern?

Thanks

Steven

Great -- it wasn't clear from the first message if you were just embedding
the whole command-line as is. If the plan instead to embed only a few
relevant options, I agree there should be no issue as far as paths go.

I don’t know whether this is an issue in the current implementation, but I wanted to bring up a potential privacy issue.

In embedding the information, care should be taken to avoid embedding any information that contains personally identifiable information. This can certainly occur

if paths need to be embedded, as user names, or other private/confidential information may be present in the naming of directories and paths. Generally, I suspect

that it would be desirable to have an opt-in strategy for designating in the compiler which pieces of information/options need to be saved, and for all options marked

as needed, determine whether there is the possibility/likelihood that they may contain personally identifiable information.

Kevin B. Smith

I don’t know whether this is an issue in the current implementation, but I wanted to bring up a potential privacy issue.

In embedding the information, care should be taken to avoid embedding any information that contains personally identifiable information. This can certainly occur

if paths need to be embedded, as user names, or other private/confidential information may be present in the naming of directories and paths.

Hal,

No, it is not more of a problem than with DWARF info. DWARF info definitely contains personally identifiable information. However, people usually realize that is the case,

and will turn off or strip debug info if they are worried about such issues, or make a specific plan to cleanse that information.

You really just want to attempt to eliminate such information to the greatest extent possible. The desirability of using embedded Bitcode in libraries (which is a very

natural use model, that I’m pretty sure this is intended to support), will be improved by taking into consideration this aspect of the implementation.

Kevin Smith

Hi,

There is not only DWARF but any use of the macro __FILE__ (so any assertions for instance).
I wouldn't expect the bitcode to contain any more (or less) information than the binary.
The options for the optimizer/codegen shouldn't need any "sensitive" information.

Hi Kevin

That is a very good concern and we have ways to address the issue in our bitcode implementation to achieve similar something similar to ‘strip’ (hiding unnecessary symbols and debug info). It wasn’t in the proposal because we would like to get the basics in before diving into something more detailed and controversial.
Here is a short description about how we deal with the issue. Our implementation requires linker support which runs a ‘Linkage-Unit’ pass that consistently rename all the symbols and metadata that are not exported. This has to be done after resolving all the symbols. We would be happy to upstream our implementation if it is beneficial.

Thanks

Steven

__FILE__ is a frontend issue, I still have to add some equivalent to my
remap patches for that into clang...

Joerg

I don’t know what is involved in upstreaming that, but yes, it seems very useful/necessary to me.

My 2c…

Benefits of the feature clearly outweigh any potential privacy concerns from my point of view… and yes, there are multiple ways to deal with privacy even if it is an issue.

Sergei

Thanks everyone for giving me feedback. I will send out patches for the feature very shortly. Upstreaming bitcode obfuscation to handle privacy issue is the next on my list, but it will be a separate discussion.

Steven