End-to-end -fembed-bitcode .llvmbc and .llvmcmd

mtrofin · August 25, 2020, 2:10am

Hello,

I’m trying to understand how .llvmbc and .llvmcmd fit into an end-to-end story. From the RFC, and reading through the implementation, I’m piecing together that the goal was to enable capturing IR right after clang and before passing it to LLVM’s optimization passes, as well as the command line options needed for later compiling that IR to the same native object it was compiled to originally (with the same compiler).

Here’s what I don’t understand: say you have a.o and b.o compiled with -fembed-bitcode=all. They are linked into a binary called my_binary. How do you re-create the corresponding IR for modules a and b (let’s call them a.bc and b.bc), and their corresponding command lines? From what I can tell, the linker just concatenates the IR for a and b in my_binary’s .llvmbc, and the same for the command line in .llvmcmd. Is there a separator maybe I missed? For .llvmcmd, I could see how maybe -cc1 could be that separator, what about the .llvmbc part? The magic number?

Thanks!

cachemeifyoucan · August 27, 2020, 6:17pm

Hi Mircea

From the RFC you mentioned, that is a Darwin specific implementation, which later got extended to support other targets. The main use case for the embed bitcode option is to allow compiler passing intermediate IR and command flags in the object file it produced for later use. For Darwin, it is used for bitcode recompilation, and some might use it to achieve other goals.

In order to use this information properly, you needs to have tools that understand the layout and sections for embedded bitcode. You can’t just use an ordinary linker, because like you said, an ELF linker will just append the bitcode. Depending on what you are trying to achieve, you need to implement the downstream tools, like linker, binary analysis tools, etc. to understand this concept.

Steven

yotann · August 28, 2020, 1:57am

Hi Mircea,

If you use an ordinary linker that concatenates .llvmbc sections, you can use this code to get the size of each bitcode module. As far as I know, there’s no clean way to separate the .llvmcmd sections without making assumptions about what options were used.

// Given a bitcode file followed by garbage, get the size of the actual

// bitcode. This only works correctly with some kinds of garbage (in

// particular, it will work if the bitcode file is followed by zeros, or if

// it’s followed by another bitcode file).

size_t GetBitcodeSize(MemoryBufferRef Buffer) {

const unsigned char *BufPtr =

reinterpret_cast<const unsigned char *>(Buffer.getBufferStart());

const unsigned char *EndBufPtr =

reinterpret_cast<const unsigned char *>(Buffer.getBufferEnd());

if (isBitcodeWrapper(BufPtr, EndBufPtr)) {

const unsigned char *FixedBufPtr = BufPtr;

if (SkipBitcodeWrapperHeader(FixedBufPtr, EndBufPtr, true))

report_fatal_error(“Invalid bitcode wrapper”);

return EndBufPtr - BufPtr;

}

if (!isRawBitcode(BufPtr, EndBufPtr))

report_fatal_error(“Invalid magic bytes; not a bitcode file?”);

BitstreamCursor Reader(Buffer);

Reader.Read(32); // skip signature

while (true) {

size_t EntryStart = Reader.getCurrentByteNo();

BitstreamEntry Entry =

Reader.advance(BitstreamCursor::AF_DontAutoprocessAbbrevs);

if (Entry.Kind == BitstreamEntry::SubBlock) {

if (Reader.SkipBlock())

report_fatal_error(“Invalid bitcode file”);

} else {

// We must have reached the end of the module.

return EntryStart;

}

Sean

mtrofin · August 28, 2020, 5:25am

Thanks, Sean, Steven,

to explore this a bit further, are there currently users for non-Darwin cases? I wonder if it would it be an issue if we inserted markers in the section (maybe as an opt-in, if there were users), such that, when concatenated, the resulting section would be self-describing, for a specialized reader, of course - basically, achieve what Sean described, but “by design”.

For instance, each .o file could have a size, followed by the payload (maybe include in the payload the name of the module, too; maybe compress it, too). Same for the .llvmcmd case.

dblaikie · August 28, 2020, 6:22pm

You should probably pull in some folks who implemented/maintain the feature for Darwin.

I guess they aren’t linking this info, but only communicating in the object file between tools - maybe they flag these sections (either in the object, or by the linker) as ignored/dropped during linking. That semantic could be implemented in ELF too by marking the sections SHF_IGNORED or something (same-file split DWARF uses this technique).

So maybe the goal/desire is to have a different semantic, rather than the equivalent semantic being different on ELF compared to MachO.

So if it’s a different semantic - yeah, I’d guess a flag that prefixes the module metadata with a length would make sense, then it can be linked naturally on any platform. (if the “don’t link these sections” support on Darwin is done by the linker hardcoding the section name - then maybe this flag would also put the data in a different section that isn’t linker stripped on Darwin, so users interested in getting everything linked together can do so on any platform)

But if this data is linked, then it’d be hard to know which command line goes with which module, yes? So maybe it’d make sense then to have the command line as a header before the module, in the same section. So they’re kept together.

mtrofin · August 28, 2020, 6:57pm

You should probably pull in some folks who implemented/maintain the feature for Darwin.

I guess they aren’t linking this info, but only communicating in the object file between tools - maybe they flag these sections (either in the object, or by the linker) as ignored/dropped during linking. That semantic could be implemented in ELF too by marking the sections SHF_IGNORED or something (same-file split DWARF uses this technique).

So maybe the goal/desire is to have a different semantic, rather than the equivalent semantic being different on ELF compared to MachO.

So if it’s a different semantic - yeah, I’d guess a flag that prefixes the module metadata with a length would make sense, then it can be linked naturally on any platform. (if the “don’t link these sections” support on Darwin is done by the linker hardcoding the section name - then maybe this flag would also put the data in a different section that isn’t linker stripped on Darwin, so users interested in getting everything linked together can do so on any platform)

But if this data is linked, then it’d be hard to know which command line goes with which module, yes? So maybe it’d make sense then to have the command line as a header before the module, in the same section. So they’re kept together.

This last point was my follow-up

MaskRay · August 28, 2020, 9:16pm

You should probably pull in some folks who implemented/maintain the
feature for Darwin.

I guess they aren't linking this info, but only communicating in the
object file between tools - maybe they flag these sections (either in the
object, or by the linker) as ignored/dropped during linking. That semantic
could be implemented in ELF too by marking the sections SHF_IGNORED or
something (same-file split DWARF uses this technique).

The .llvmbc / .llvmcmd section does not have the SHF_EXCLUDE flag. It
will be retained in the linked image.

So maybe the goal/desire is to have a different semantic, rather than the
equivalent semantic being different on ELF compared to MachO.

So if it's a different semantic - yeah, I'd guess a flag that prefixes the
module metadata with a length would make sense, then it can be linked
naturally on any platform. (if the "don't link these sections" support on
Darwin is done by the linker hardcoding the section name - then maybe this
flag would also put the data in a different section that isn't linker
stripped on Darwin, so users interested in getting everything linked
together can do so on any platform)

But if this data is linked, then it'd be hard to know which command line
goes with which module, yes? So maybe it'd make sense then to have the
command line as a header before the module, in the same section. So they're
kept together.

This last point was my follow-up

A module has a source_filename field.

clang -fembed-bitcode=all -c d/a.c
llvm-objcopy --dump-section=.llvmbc=a.bc a.o /dev/null
llvm-dis < a.bc => source_filename = "d/a.c"

The missing piece is a mechanism to extract a module from concatenated
bitcode (llvm-dis supports multi-module bitcode but not concatenated
bitcode ⚙ D70153 Add support for multi-module bitcode files to llvm-dis). I'll be happy to look into it:)

dblaikie · August 28, 2020, 9:27pm

You should probably pull in some folks who implemented/maintain the
feature for Darwin.

I guess they aren’t linking this info, but only communicating in the
object file between tools - maybe they flag these sections (either in the
object, or by the linker) as ignored/dropped during linking. That semantic
could be implemented in ELF too by marking the sections SHF_IGNORED or
something (same-file split DWARF uses this technique).

The .llvmbc / .llvmcmd section does not have the SHF_EXCLUDE flag. It
will be retained in the linked image.

Ah, yes, I understand that’s the current situation - I meant “if dropping during linking is the semantic that’s already implemented for MachO/ld64 (either with some MachO attribute, or hardcoded behavior in ld64) where the feature is fully-fledged/working-as-intended, we could match that semantic on ELF by using SHF_EXCLUDE”.

& then designing a separate but related feature for “I want bitcode and build commands that end up in the final linked binary” - at which point it might be a different format (using a length prefix) and at that point maybe consider putting the build command in the header rather than in a separate section.

mtrofin · August 28, 2020, 9:31pm

You should probably pull in some folks who implemented/maintain the
feature for Darwin.

I guess they aren’t linking this info, but only communicating in the
object file between tools - maybe they flag these sections (either in the
object, or by the linker) as ignored/dropped during linking. That semantic
could be implemented in ELF too by marking the sections SHF_IGNORED or
something (same-file split DWARF uses this technique).

The .llvmbc / .llvmcmd section does not have the SHF_EXCLUDE flag. It
will be retained in the linked image.

So maybe the goal/desire is to have a different semantic, rather than the
equivalent semantic being different on ELF compared to MachO.

So if it’s a different semantic - yeah, I’d guess a flag that prefixes the
module metadata with a length would make sense, then it can be linked
naturally on any platform. (if the “don’t link these sections” support on
Darwin is done by the linker hardcoding the section name - then maybe this
flag would also put the data in a different section that isn’t linker
stripped on Darwin, so users interested in getting everything linked
together can do so on any platform)

But if this data is linked, then it’d be hard to know which command line
goes with which module, yes? So maybe it’d make sense then to have the
command line as a header before the module, in the same section. So they’re
kept together.

This last point was my follow-up

A module has a source_filename field.

clang -fembed-bitcode=all -c d/a.c
llvm-objcopy --dump-section=.llvmbc=a.bc a.o /dev/null
llvm-dis < a.bc => source_filename = “d/a.c”

The missing piece is a mechanism to extract a module from concatenated
bitcode (llvm-dis supports multi-module bitcode but not concatenated
bitcode https://reviews.llvm.org/D70153). I’ll be happy to look into it:)

.llvmcmd may need the source file to be more useful.

Right - I think, for the non-Darwin concatenated case, all three of us (David, you, and I) are thinking along the lines of keeping together: the module name, the bytecode, and the command line - effectively not using .llvmcmd, and being able to correctly extract, by design, the rest of the information.

yotann · August 30, 2020, 2:22am

Here's the format I would suggest:

1. Put command-line flags in the module metadata instead of .llvmcmd.
2. Put each module in the bitcode wrapper supported by SkipBitcodeWrapperHeader, which includes a length field. I think LLVM only generates the wrapper for Darwin, but it can read the wrapper correctly on all platforms.
3. Change the .llvmbc section alignment so that no extra zeros are added between modules.

My use case: I'm using -fembed-bitcode on Linux as an alternative to the wllvm/whole-program-llvm tool. For my purposes, it'd be nice to also keep track of linker flags and other linker input files, but I can get most of what I need from the modules alone.

Sean

MaskRay · August 30, 2020, 4:48am

I investigated a bit about the bitcode file format today. The bitcode
is streaming style and I think an optional size field may be useful.
https://reviews.llvm.org/D86847 proposes to add a
BITCODE_SIZE_BLOCK_ID block. We actually don't need a container
because
the MODULE_CODE_SOURCE_FILENAME record encodes the source filename. We
can do a lightweight parse and obtain the field.
This should be fast because there are typically very few
blocks/records preceding MODULE_CODE_SOURCE_FILENAME.

For .llvmcmd, I am on the fence moving it into the bitcode. Downside:
retrieving the command line will be more difficult...
I'd like to mention that the functionality duplicates the existing
-frecord-command-line a bit...

% readelf -p .GCC.command.line a.o

String dump of section '.GCC.command.line':
[ 1] /tmp/clang-12 -c -frecord-command-line a.c

(GCC -frecord-gcc-switches uses a different format (some folks
consider it inferior to clang's format; and worse, the section is
SHF_MERGE|SHF_STRINGS):
% readelf -p .GCC.command.line a.o

String dump of section '.GCC.command.line':
  [ 0] -imultiarch x86_64-linux-gnu
  [ 1d] a.c
  [ 21] -mtune=generic
  [ 30] -march=x86-64
  [ 3e] -frecord-gcc-switches
  [ 54] -fasynchronous-unwind-tables

HaohaiWen · August 5, 2025, 1:00pm

I’m trying to set those two sections to exclude sections: [X86] Set .llvmbc and .llvmcmd to exclude sections by HaohaiWen · Pull Request #151910 · llvm/llvm-project · GitHub

Reproduce compilation via object file is obviously a greater choice. User can easily do this by using -reproduce to collect input object/lib and then reproduce compilation.

We can keep it in as metadata section if someone has already used it.

If there’s no any exising user. I think we can safely set it to exclude.

mtrofin · August 5, 2025, 2:16pm

If you end up moving forward with that PR, could you add a flag to control the exclusion behavior? It can be set to enable the new behavior (“exclude”) by default.

Either way, object files are unchanged, this is just about excluding at link time. It shouldn’t affect thin link either, correct?

IIUC reproduce is for reproducing the link, not the compilation. Are you saying that combining reproduce with the -fembed-bitcode or -mllvm -lto-embed-bitcode during a build would allow one to reproduce both compilation and linking?

HaohaiWen · August 21, 2025, 12:55am

Sorry, I missed the notification mail.

llvm/include/llvm/MC/SectionKind.h:

/// Exclude - This section will be excluded from the final executable or
/// shared library. Only valid for ELF / COFF targets.
Exclude,

llvm/lib/CodeGen/TargetLoweringObjectFileImpl.cpp:

For ELF:

static unsigned getELFSectionFlags(SectionKind K, const Triple &T) {
  ...
  if (K.isExclude())
    Flags |= ELF::SHF_EXCLUDE;

SHF_EXCLUDE: This section is excluded from input to the link-edit of an executable or shared object. This flag is ignored if the SHF_ALLOC flag is also set, or if relocations exist against the section.

For COFF:

getCOFFSectionFlags(SectionKind K, const TargetMachine &TM) {
  ...
  else if (K.isExclude())
    Flags |=
      COFF::IMAGE_SCN_LNK_REMOVE | COFF::IMAGE_SCN_MEM_DISCARDABLE;

IMAGE_SCN_LNK_REMOVE: The section will not become part of the image. This is valid only for object files.

If we set section to exclude kind, they will still live in object file and will be discarded when linker read them from object file. Regarding “thin link”, is that ThinLTO? I don’t think they will affect any LTO behavior.

IIUC reproduce is for reproducing the link, not the compilation. Are you saying that combining reproduce with the -fembed-bitcode or -mllvm -lto-embed-bitcode during a build would allow one to reproduce both compilation and linking?

Yes. We have cmd/bitcode to recompile it to object file and know how those objects are linked to executable. Thus reproducing part of compilation and whole linking is possible. I use part of compilation since we can only reproduce bitcode → object file.

mtrofin · August 21, 2025, 1:10am

Blockquote
If we set section to exclude kind, they will still live in object file and will be discarded when linker read them from object file

Perfect. And I assume ar won’t care - i.e. .o placed in a .a would still have these sections.

Blockquote
Regarding “thin link”, is that ThinLTO? I don’t think they will affect any LTO behavior.

Yes (ThinLTO). Agreed, hard to see how it’d affect it.

LGTM

HaohaiWen · August 21, 2025, 1:41am

I think so.
ar should not care about the contents.
e.g. .llvm_addrsig and .llvm.call-graph-profile are all “exclude” sections and they are used in lld.

HaohaiWen · August 21, 2025, 1:42am

@MaskRay, any more concerns regarding set .llvmcmd and .llvmbc to exclude sections?

Topic		Replies	Views
[RFC] Embedding Bitcode in Object Files LLVM Dev List Archives	18	357	February 18, 2016
[RFC] Embedded bitcode and related upstream (Part II) LLVM Dev List Archives	12	235	November 30, 2016
[RFC] Support embedding bitcodes in LLD with LTO LLVM Dev List Archives	10	335	February 8, 2019
[RFC] -ffat-lto-objects support IR & Optimizations lto , thinlto	17	3957	September 8, 2022
[RFC]Extending lib/Linker to support bitcode "shared objects" LLVM Dev List Archives	6	217	December 15, 2011

End-to-end -fembed-bitcode .llvmbc and .llvmcmd

Related topics