Macro history (de-)serialization implementation, need help figuring out some things

Hi,

I have a quick-and-dirty implementation of macro history (de-)serialization, but it is still not fully working. In particular, I still have to teach reading code to distinguish between identifiers that have macro definition currently, and those that had one before. I already taught ASTWriter to care about currently undefined macros, but I had to add HadMacro flag in IdentifierInfo for this purpose (hopefully there was one spare bit in 32-bit word for this).

To complete this I have to gain much deeper understanding of things I’m changing, so I’d appreciate if someone explained me several things (or pointed me to the documentation if it exists): structure of PCH files and what is stored there, the idea behind chained PCHs and how they work, what are public/non-public macros (is it related to Objective C only?).

I’m writing this mainly to Douglas and Richard, as they have context of what I’m doing, but maybe someone else can help me with these things.

Thanks in advance!

–Regards,
Alex

Hi,

I have a quick-and-dirty implementation of macro history (de-)serialization, but it is still not fully working. In particular, I still have to teach reading code to distinguish between identifiers that have macro definition currently, and those that had one before. I already taught ASTWriter to care about currently undefined macros, but I had to add HadMacro flag in IdentifierInfo for this purpose (hopefully there was one spare bit in 32-bit word for this).

To complete this I have to gain much deeper understanding of things I’m changing, so I’d appreciate if someone explained me several things (or pointed me to the documentation if it exists): structure of PCH files and what is stored there, the idea behind chained PCHs and how they work, what are public/non-public macros (is it related to Objective C only?).

public/non-public macros (sorry, don’t know about PCH):

I think it was introduced by Doug when he worked on the module system (hopefully something for C++1x). The idea was that most macros could be contained within a module (private macros) and only a specific few would be declared so as to be exported.

– Matthieu

Hi,

I have a quick-and-dirty implementation of macro history (de-)serialization, but it is still not fully working. In particular, I still have to teach reading code to distinguish between identifiers that have macro definition currently, and those that had one before. I already taught ASTWriter to care about currently undefined macros, but I had to add HadMacro flag in IdentifierInfo for this purpose (hopefully there was one spare bit in 32-bit word for this).

To complete this I have to gain much deeper understanding of things I’m changing, so I’d appreciate if someone explained me several things (or pointed me to the documentation if it exists): structure of PCH files and what is stored there, the idea behind chained PCHs and how they work, what are public/non-public macros (is it related to Objective C only?).

public/non-public macros (sorry, don’t know about PCH):

I think it was introduced by Doug when he worked on the module system (hopefully something for C++1x). The idea was that most macros could be contained within a module (private macros) and only a specific few would be declared so as to be exported.

Thank you for the information. Maybe you know where to get any preliminary documentation on module system? Or any other information Doug may have used while implementing this?

Can anyone explain me the PCH part? Doug? Richard?

Hi,

I have a quick-and-dirty implementation of macro history (de-)serialization, but it is still not fully working. In particular, I still have to teach reading code to distinguish between identifiers that have macro definition currently, and those that had one before. I already taught ASTWriter to care about currently undefined macros, but I had to add HadMacro flag in IdentifierInfo for this purpose (hopefully there was one spare bit in 32-bit word for this).

To complete this I have to gain much deeper understanding of things I’m changing, so I’d appreciate if someone explained me several things (or pointed me to the documentation if it exists): structure of PCH files and what is stored there, the idea behind chained PCHs and how they work, what are public/non-public macros (is it related to Objective C only?).

public/non-public macros (sorry, don’t know about PCH):

I think it was introduced by Doug when he worked on the module system (hopefully something for C++1x). The idea was that most macros could be contained within a module (private macros) and only a specific few would be declared so as to be exported.

Thank you for the information. Maybe you know where to get any preliminary documentation on module system? Or any other information Doug may have used while implementing this?

I do not know if any documentation was published about it (sorry).

There are tests though: http://llvm.org/svn/llvm-project/cfe/trunk/test/Modules/ which is always a good starting point.

– Matthieu

Hey Alex!

Well, that's grave neglect on my part. I need to rectify this eventually.

For the short term, PCH chaining is pretty simple in idea, if not in implementation. Load an existing PCH file, parse some additional code, and save the diff between the AST loaded from the PCH and the AST after parsing as another PCH file that references the first. Now you have two chained PCH files. Load the second, and it will automatically load the first and then apply the diff.
This is easy for new AST nodes, but pretty hard for AST mutation. Luckily, this is very rare.

Sebastian

Thanks for the explanation. But what problem does this feature intend to solve? What are use-cases for this?

The main problem it was intended to solve was code completion speed in IDEs. Any given source file starts with a bunch of includes, typically (depending on one’s preference) first some library headers, then some internal project headers. Only then comes the actual code.

When you type an identifier in a Clang-powered IDE and request code completion, what happens is that the IDE calls on Clang to parse the file up to the point where code completion is requested, and Clang will return a list of possible completions, which the IDE then displays. In order to be useful, this has to be fast. If you’ve developed with Visual Studio, I’m sure you’re familiar with how disrupting the delay in IntelliSense’s reaction can be. If the project gets big and complicated, IntelliSense often becomes unusable simply because it takes several seconds to pop up.

If you have to wait for Clang to parse your entire source file, including all the headers, you’re going to wait just as long, especially for a complicated C++ project. The main way to speed this up is precompiled headers. Compile the headers first, then just load the binary format when you need to reparse the file. However, PCHs typically have to be configured. You take a set of headers that is common to your project and rarely changes (because rebuilding PCHs is slow) and tell the compiler to use it. But that still leaves the project-specific headers to be reparsed.

So Clang has another feature, called the precompiled preamble (PCP). Basically, Clang will look at a source file, decide where the include directives for the file end (the preamble) and automatically build a PCH from that, which it will use when it needs to reparse the file. (I think the C API has ReparseTranslationUnit for this.) Once the preamble is built (which should happen once when the file is opened in the IDE), Clang only needs to reparse the actual source file, which can be done in less than a second usually, especially if the PCP is kept in memory.

The downside of this approach is that it takes a long time to do the initial compiling of the preamble when you open the file. It would be a lot faster if you had a PCH of all the third party headers and just combined it with the file-specific part of the preamble into a PCP. And Clang used to be able to do that. You could use a PCH and it would load it completely (PCHs are usually loaded lazily by Clang), parse the new parts, and create a new PCH consisting of the combination. This is faster than reparsing everything, but it is still not fast enough (fully loading a PCH isn’t very fast and needs a lot of memory), and each resulting PCP is rather big (tens of megabytes if not more), which is a problem if you want to keep all the PCPs for your open files (and I know enough programmers who keep dozens of files open) in memory for fast access.

Enter chained PCH. You take your big third party library PCH as the primary. You created a diff for the rest of the preamble, which is usually nice and small (maybe a megabyte), and fast to create (because you only load the parts of the PCH that you need). You have one big block and a multitude of small blocks that reference it in memory, you have fast loading, fast parsing, and all-around goodness.

As a side effect, the work on chained PCH made the PCH system more flexible and was thus the first step towards the true module system being developed in Clang.

Sebastian

Thank you for the extensive explanation you gave. This and a live chat with Doug helped me much!