[PATCH] Wrap clang modules inside Mach-O/ELF/COFF containers

Hi everyone,

As the first step in preparation for module debugging (see http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-November/040076.html) this patch turns the *.pcm files that are used to store clang modules and precompiled headers in a platform-dependent Mach-O/ELF/COFF container, so that eventually we will be able to store debug information alongside the module in the same file.

This is implemented by using the standard LLVM code generation machinery. Instead of directly writing to the output file, the serialized AST blob is attached to an empty llvm::Module as a ModuleFlag. The module is passed to the backend which emits the AST blob into a special “__clang_pch" section in TargetLoweringObjectFile*.
On the ASTReader side, any object file is transparently unwrapped and the BitstreamReader is pointed directly to the AST section.

Other than the .pcm files having an extra header inside, this patch is not meant to have any user-visible effects.

Known bugs: I still need to figure out how to make c-index-test link against and register the available targets (check-all passes, but the modules created by c-index-test currently are plain old .pcm files).
Open questions: I made up the name of the new __clang_pch section and the various flags on the different platforms on the spot. I’m open to better suggestions.

Let me know what you think!

-- adrian

clang.diff (15.2 KB)

llvm.diff (1.67 KB)

Can we use _cfepch instead? MSVC linker will silently truncate section names longer than 8 characters.

Can we use _cfepch instead? MSVC linker will silently truncate section names longer than 8 characters.

… and I was worried about the Mach-O 15 character limit :slight_smile:
works for me.
– adrian

Hi everyone,

As the first step in preparation for module debugging (see http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-November/040076.html) this patch turns the *.pcm files that are used to store clang modules and precompiled headers in a platform-dependent Mach-O/ELF/COFF container, so that eventually we will be able to store debug information alongside the module in the same file.

This is implemented by using the standard LLVM code generation machinery. Instead of directly writing to the output file, the serialized AST blob is attached to an empty llvm::Module as a ModuleFlag. The module is passed to the backend which emits the AST blob into a special “__clang_pch" section in TargetLoweringObjectFile*.
On the ASTReader side, any object file is transparently unwrapped and the BitstreamReader is pointed directly to the AST section.

Other than the .pcm files having an extra header inside, this patch is not meant to have any user-visible effects.

Known bugs: I still need to figure out how to make c-index-test link against and register the available targets (check-all passes, but the modules created by c-index-test currently are plain old .pcm files).
Open questions: I made up the name of the new __clang_pch section and the various flags on the different platforms on the spot. I’m open to better suggestions.

Once you’ve got c-index-test working with containers, is the intent to remove the fallback to reading raw AST files, or do you intend to keep it around?

@@ -599,6 +600,25 @@ void PCHValidator::ReadCounter(const ModuleFile &M, unsigned Value) {
// AST reader implementation
//===----------------------------------------------------------------------===//

+void ASTReader::initStreamFileWithModule(llvm::MemoryBufferRef Buffer,
+ llvm::BitstreamReader &StreamFile) {
+ if (auto OF = llvm::object::ObjectFile::createObjectFile(Buffer))
+ // Unwrap the container if necessary.
+ for (auto &Section : OF->get()->sections()) {
+ StringRef Name;
+ Section.getName(Name);
+ if (Name == "__clang_pch") {
+ StringRef Buf;
+ Section.getContents(Buf);
+ return StreamFile.init((const unsigned char*)Buf.begin(),
+ (const unsigned char*)Buf.end());
+ }
+ }
+
+ StreamFile.init((const unsigned char *)Buffer.getBufferStart(),
+ (const unsigned char *)Buffer.getBufferEnd());
+}
+

Is this sufficiently paranoid? It looks like Section.getName() and Section.getContents() can both fail. I guess we already fall over horribly if an AST file is malformed, so maybe this is no worse...

Nitpick: Can we add braces for the outer if? I was confused about the nesting.

- // Write the generated bitstream to "Out".
- Out->write((char *)&Buffer.front(), Buffer.size());
+ std::string Error;
+ StringRef Triple = Ctx.getTargetInfo().getTriple().getTriple();
+ if (llvm::TargetRegistry::lookupTarget(Triple, Error)) {

<snip>

+ } else {
+ // FIXME: Fallback for c-index-test.
+ // Directly write the generated bitstream to "Out".
+ Out->write((char *)&Buffer.front(), Buffer.size());
+ Out->flush();
+ }

When c-index-test is updated, will this code change? Is it possible to end up here for other reasons, like if the triple is bogus? Or does that always get caught earlier?

Ben

Hi everyone,

As the first step in preparation for module debugging (see http://lists.cs.uiuc.edu/pipermail/cfe-dev/2014-November/040076.html) this patch turns the *.pcm files that are used to store clang modules and precompiled headers in a platform-dependent Mach-O/ELF/COFF container, so that eventually we will be able to store debug information alongside the module in the same file.

This is implemented by using the standard LLVM code generation machinery. Instead of directly writing to the output file, the serialized AST blob is attached to an empty llvm::Module as a ModuleFlag. The module is passed to the backend which emits the AST blob into a special “__clang_pch" section in TargetLoweringObjectFile*.
On the ASTReader side, any object file is transparently unwrapped and the BitstreamReader is pointed directly to the AST section.

Other than the .pcm files having an extra header inside, this patch is not meant to have any user-visible effects.

Known bugs: I still need to figure out how to make c-index-test link against and register the available targets (check-all passes, but the modules created by c-index-test currently are plain old .pcm files).
Open questions: I made up the name of the new __clang_pch section and the various flags on the different platforms on the spot. I’m open to better suggestions.

Once you’ve got c-index-test working with containers, is the intent to remove the fallback to reading raw AST files, or do you intend to keep it around?

Although the fallback is just one extra line of code, (so there wouldn’t be a large maintenance effort), I don’t really see a compelling use-case for it. I’d vote in favor of removing it.

-- adrian

Can we use _cfepch instead? MSVC linker will silently truncate section
names longer than 8 characters.

I don't like "cfe"; we should indicate which compiler is involved here. I
don't like "pch"; this is used for all kinds of AST files, not just PCH
files.

Can we use "ClangAST", or do we need the leading underscore?

The double underscores and all lowercase are a naming convention on Mach-O (which doesn’t have the 8 character restriction, though).
[https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachORuntime/index.html]

Now, we could use different spellings for each platform: “ClangAST” on COFF, and “__clang_ast" on Mach-O, and maybe “.clang_ast" on ELF? It would make the ASTReader code slightly more complicated, but not much.

– adrian

The double underscores and all lowercase are a naming convention on Mach-O (which doesn’t have the 8 character restriction, though).
[https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachORuntime/index.html]

Now, we could use different spellings for each platform: “ClangAST” on COFF, and “__clang_ast" on Mach-O, and maybe “.clang_ast" on ELF? It would make the ASTReader code slightly more complicated, but not much.

You can use the generic MC level support for COFF/ELF/MachO sections and just have it be the right name for each. The AST reader could probably use that or a similar abstraction.

-eric

The .pcm file is currently independent of debug info, meaning the compiler invocation will be able to use the same .pcm file regardless of whether the invocation had enabled debug info or not; with this change if an invocation had built a module file with debug info disabled, it would be inapplicable to the same invocation that had debug info enabled and would have to rebuild it; essentially we are tying module building with debug info. The module file as the “collection of semantic info” is conceptually independent from debug info.

Did you consider having the debug info container being another file (e.g. besides the .pcm) that will reference the .pcm file ? This way, instead of having to update all users of module files, regardless if they care about debug info or not, you’d just make debug info another user of .pcm files, no more special than the others.

The .pcm file is currently independent of debug info, meaning the compiler
invocation will be able to use the same .pcm file regardless of whether the
invocation had enabled debug info or not; with this change if an invocation
had built a module file with debug info disabled, it would be inapplicable
to the same invocation that had debug info enabled and would have to
rebuild it; essentially we are tying module building with debug info. The
module file as the “collection of semantic info” is conceptually
independent from debug info.

I imagine the side table of entity hashes to AST constructs could be
generated regardless of whether the module is built with debug info - it's
probably cheap/small enough?

If generating the actual debug info in the module is too expensive to be
always-on, we could make it conditional and have the compiler just build it
on the fly if it wasn't present in the module. (but, ideally, it is cheap
enough that we don't mind always putting it in the module and then ignoring
it if the compilation referencing the module doesn't have -g)

The .pcm file is currently independent of debug info, meaning the compiler
invocation will be able to use the same .pcm file regardless of whether the
invocation had enabled debug info or not;

We can't use the same .pcm file for -DNDEBUG vs -UNDEBUG builds. Do we ever
get to reuse a .pcm file like this in practice?

with this change if an invocation had built a module file with debug info

You can choose to add, or not to add, debug info to a release build.

The .pcm file is currently independent of debug info, meaning the compiler invocation will be able to use the same .pcm file regardless of whether the invocation had enabled debug info or not; with this change if an invocation had built a module file with debug info disabled, it would be inapplicable to the same invocation that had debug info enabled and would have to rebuild it; essentially we are tying module building with debug info. The module file as the “collection of semantic info” is conceptually independent from debug info.

I imagine the side table of entity hashes to AST constructs could be generated regardless of whether the module is built with debug info - it's probably cheap/small enough?

If generating the actual debug info in the module is too expensive to be always-on, we could make it conditional and have the compiler just build it on the fly if it wasn't present in the module. (but, ideally, it is cheap enough that we don't mind always putting it in the module and then ignoring it if the compilation referencing the module doesn't have -g)

To add to what David said, another advantage of having everything in one container is that further down the road, we may also embed bitcode for inlining purposes into the module.

-- adrian

The .pcm file is currently independent of debug info, meaning the
compiler invocation will be able to use the same .pcm file regardless of
whether the invocation had enabled debug info or not;

We can't use the same .pcm file for -DNDEBUG vs -UNDEBUG builds. Do we
ever get to reuse a .pcm file like this in practice?

You can choose to add, or not to add, debug info to a release build.

Sure, I don't dispute that this .pcm reuse can happen in theory. But what
I'm wondering is: Does this actually happen in practice? How often? Is this
case worth optimizing for?

There are other things I'd like to bundle with a .pcm file (.o and .ir code
for inline functions, for instance) that would also benefit from using an
ELF wrapper format, and would also vary based on clang's CodeGen options.
One possible approach would be to have (at least) two files -- one
CodeGen-independent AST file, and one CodeGen-dependent file containing all
the other bits -- but that seems to introduce complexity that is
unnecessary in almost all cases. (Also note that even flags like -O or
-fsanitize=address cause us to build different .pcm files today, because
they affect preprocessor macros.)

I don’t see the reason to make the module file itself the container, particularly when whatever the container may contain doesn’t affect in any way the semantic info that the module file is supposed to provide, we just proliferate module files and/or rebuild module files unnecessarily.
It’s true that the situation is not ideal currently and we have -O[1 ~ 3] reusing the .pcm but -Os does not, but in the future we could try to address this, not make the situation fundamentally worse and inescapable. I’d like that modules not turn into a “glorified PCH system" where there is practically zero re-use for them.

Back to the debug info, why not have the container like this

Foundation.pcm.o
\

Foundation.pcm

where the container references the .pcm file, and you can put the debug info in it (or ir later on).

Debug info can reference Foundation.pcm.o and get extended to handle the serialized AST from .pcm.

The .pcm file is currently independent of debug info, meaning the compiler invocation will be able to use the same .pcm file regardless of whether the invocation had enabled debug info or not;

We can't use the same .pcm file for -DNDEBUG vs -UNDEBUG builds. Do we ever get to reuse a .pcm file like this in practice?

You can choose to add, or not to add, debug info to a release build.

Sure, I don't dispute that this .pcm reuse can happen in theory. But what I'm wondering is: Does this actually happen in practice? How often? Is this case worth optimizing for?

There are other things I'd like to bundle with a .pcm file (.o and .ir code for inline functions, for instance) that would also benefit from using an ELF wrapper format, and would also vary based on clang's CodeGen options. One possible approach would be to have (at least) two files -- one CodeGen-independent AST file, and one CodeGen-dependent file containing all the other bits -- but that seems to introduce complexity that is unnecessary in almost all cases. (Also note that even flags like -O or -fsanitize=address cause us to build different .pcm files today, because they affect preprocessor macros.)

I don’t see the reason to make the module file itself the container, particularly when whatever the container may contain doesn’t affect in any way the semantic info that the module file is supposed to provide, we just proliferate module files and/or rebuild module files unnecessarily.
It’s true that the situation is not ideal currently and we have -O[1 ~ 3] reusing the .pcm but -Os does not, but in the future we could try to address this, not make the situation fundamentally worse and inescapable. I’d like that modules not turn into a “glorified PCH system" where there is practically zero re-use for them.

Back to the debug info, why not have the container like this

Foundation.pcm.o
   \
  Foundation.pcm

where the container references the .pcm file, and you can put the debug info in it (or ir later on).

Debug info can reference Foundation.pcm.o and get extended to handle the serialized AST from .pcm.

At least for the debug info I was hoping that it would be cheap enough to always emit it together with the pcm, especially given that modules are being rebuilt comparatively infrequently. The module debug info is essentially just an alternative encoding of the types provided by the module, and exactly same conditions that trigger a re-generation of the .pcm today would also necessitate re-generating the debug info.

If we do want to hold on to the ability of having modules without debug info this approach does appears a little less practical.

Having two separate files complicates the module rebuild stage a bit and we will need to also store a hash of the module to make sure the two files are in sync, but it would certainly be doable. It would probably be easier to provide non-debuggable modules this way.

Are modules without debug info desirable? Translating types to DWARF is relatively cheap and it is my understanding that modules are not rebuilt very often and since the module cache is shared across all projects.

-- adrian

The module cache is essentially never shared across projects because it depends on include paths and macros. I think there are sound ways to improve this, but no one’s looking into it right now.

Jordan

>
>
>>
>>
>>>
>>> The .pcm file is currently independent of debug info, meaning the
compiler invocation will be able to use the same .pcm file regardless of
whether the invocation had enabled debug info or not;
>>>
>>> We can't use the same .pcm file for -DNDEBUG vs -UNDEBUG builds. Do we
ever get to reuse a .pcm file like this in practice?
>>
>> You can choose to add, or not to add, debug info to a release build.
>>
>> Sure, I don't dispute that this .pcm reuse can happen in theory. But
what I'm wondering is: Does this actually happen in practice? How often? Is
this case worth optimizing for?
>>
>> There are other things I'd like to bundle with a .pcm file (.o and .ir
code for inline functions, for instance) that would also benefit from using
an ELF wrapper format, and would also vary based on clang's CodeGen
options. One possible approach would be to have (at least) two files -- one
CodeGen-independent AST file, and one CodeGen-dependent file containing all
the other bits -- but that seems to introduce complexity that is
unnecessary in almost all cases. (Also note that even flags like -O or
-fsanitize=address cause us to build different .pcm files today, because
they affect preprocessor macros.)
>
> I don’t see the reason to make the module file itself the container,
particularly when whatever the container may contain doesn’t affect in any
way the semantic info that the module file is supposed to provide, we just
proliferate module files and/or rebuild module files unnecessarily.
> It’s true that the situation is not ideal currently and we have -O[1 ~
3] reusing the .pcm but -Os does not, but in the future we could try to
address this, not make the situation fundamentally worse and inescapable.
I’d like that modules not turn into a “glorified PCH system" where there is
practically zero re-use for them.
>
> Back to the debug info, why not have the container like this
>
> Foundation.pcm.o
> \
> Foundation.pcm
>
> where the container references the .pcm file, and you can put the debug
info in it (or ir later on).
>
> Debug info can reference Foundation.pcm.o and get extended to handle the
serialized AST from .pcm.

At least for the debug info I was hoping that it would be cheap enough to
always emit it together with the pcm, especially given that modules are
being rebuilt comparatively infrequently. The module debug info is
essentially just an alternative encoding of the types provided by the
module, and exactly same conditions that trigger a re-generation of the
.pcm today would also necessitate re-generating the debug info.

If we do want to hold on to the ability of having modules without debug
info this approach does appears a little less practical.

Having two separate files complicates the module rebuild stage a bit and
we will need to also store a hash of the module to make sure the two files
are in sync, but it would certainly be doable. It would probably be easier
to provide non-debuggable modules this way.

Are modules without debug info desirable? Translating types to DWARF is
relatively cheap and it is my understanding that modules are not rebuilt
very often and since the module cache is shared across all projects.

That may be true for the way modules are deployed on the Mac OS platform,
but is not universally true. On other projects, some modules will typically
be rebuilt each time any header file changes; most source changes touch at
least one header file. And not all build strategies use a module cache;
some compile module files as explicit build artefacts.

-- adrian

‘Never’ is not quite accurate; user include paths don’t affect system modules, not all projects use project-specific macros everywhere, and there is also “-fmodules-ignore-macro=“.

Ah, that’s what I get for not keeping up with things. Or not learning them properly in the first place. Okay, we do have a chance, then, of sharing system modules across projects.

Jordan