[GSoC 2024] On Demand Parsing in Clang

Description of the project: Clang, like any C++ compiler, parses a sequence of characters as they appear, linearly. The linear character sequence is then turned into tokens and AST before lowering to machine code. In many cases the end-user code uses a small portion of the C++ entities from the entire translation unit but the user still pays the price for compiling all of the redundancies.

This project proposes to process the heavy compiling C++ entities upon using them rather than eagerly. This approach is already adopted in Clang’s CodeGen where it allows Clang to produce code only for what is being used. On demand compilation is expected to significantly reduce the compilation peak memory and improve the compile time for translation units which sparsely use their contents. In addition, that would have a significant impact on interactive C++ where header inclusion essentially becomes a no-op and entities will be only parsed on demand.

The Cling interpreter implements a very naive but efficient cross-translation unit lazy compilation optimization which scales across hundreds of libraries in the field of high-energy physics.

// A.h
#include <string>
#include <vector>

template <class T, class U = int> struct AStruct {
void doIt() { /*...*/ }
const char* data;
// ...
};

template<class T, class U = AStruct<T>>
inline void freeFunction() { /* ... */ }
inline void doit(unsigned N = 1) { /* ... */ }

// Main.cpp
#include "A.h"

int main() {
doit();
return 0;
}

This pathological example expands to 37253 lines of code to process. Cling builds an index (it calls it an autoloading map) where it contains only forward declarations of these C++ entities. Their size is 3000 lines of code. The index looks like:

// A.h.index
namespace std{inline namespace __1{template <class _Tp, class _Allocator> class __attribute__((annotate("$clingAutoload$vector"))) __attribute__((annotate("$clingAutoload$A.h"))) __vector_base;
}}
...
template <class T, class U = int> struct __attribute__((annotate("$clingAutoload$A.h"))) AStruct;

Upon requiring the complete type of an entity, Cling includes the relevant header file to get it. There are several trivial workarounds to deal with default arguments and default template arguments as they now appear on the forward declaration and then the definition. You can read more in [1].

Although the implementation could not be called a reference implementation, it shows that the Parser and the Preprocessor of Clang are relatively stateless and can be used to process character sequences which are not linear in their nature. In particular namespace-scope definitions are relatively easy to handle and it is not very difficult to return to namespace-scope when we lazily parse something. For other contexts such as local classes we will have lost some essential information such as name lookup tables for local entities. However, these cases are probably not very interesting as the lazy parsing granularity is probably worth doing only for top-level entities.

Such implementation can help with already existing issues in the standard such as CWG2335, under which the delayed portions of classes get parsed immediately when they’re first needed, if that first usage precedes the end of the class. That should give good motivation to upstream all the operations needed to return to an enclosing scope and parse something.

Implementation approach: Upon seeing a tag definition during parsing we could create a forward declaration, record the token sequence and mark it as a lazy definition. Later upon complete type request, we could re-position the parser to parse the definition body. We already skip some of the template specializations in a similar way [2, 3].

Another approach is every lazy parsed entity to record its token stream and change the Toks stored on LateParsedDeclarations to optionally refer to a subsequence of the externally-stored token sequence instead of storing its own sequence (or maybe change CachedTokens so it can do that transparently). One of the challenges would be that we currently modify the cached tokens list to append an “eof” token, but it should be possible to handle that in a different way.

In some cases, a class definition can affect its surrounding context in a few ways you’ll need to be careful about here:

  1. struct X appearing inside the class can introduce the name X into the enclosing context.
  2. static inline declarations can introduce global variables with non-constant initializers that may have arbitrary side-effects.

For point (2), there’s a more general problem: parsing any expression can trigger a template instantiation of a class template that has a static data member with an initializer that has side-effects. Unlike the above two cases, I don’t think there’s any way we can correctly detect and handle such cases by some simple analysis of the token stream; actual semantic analysis is required to detect such cases. But perhaps if they happen only in code that is itself unused, it wouldn’t be terrible for Clang to have a language mode that doesn’t guarantee that such instantiations actually happen.

Alternative and more efficient implementation could be to make the lookup tables range based but we do not have even a prototype proving this could be a feasible approach.

Expected result:

  • Design and implementation of on-demand compilation for non-templated functions
  • Support non-templated structs and classes
  • Run performance benchmarks on relevant codebases and prepare report
  • Prepare a community RFC document
  • [Stretch goal] Support templates

The successful candidate should commit to regular participation in weekly meetings, deliver presentations, and contribute blog posts as requested. Additionally, they should demonstrate the ability to navigate the community process with patience and understanding.

Further reading
[1] root/README/README.CXXMODULES.md at master · root-project/root · GitHub
[2] https://github.com/llvm/llvm-project/commit/b9fa99649bc99
[3] https://github.com/llvm/llvm-project/commit/0f192e89405ce
Project size:Large
Difficulty: Hard
Confirmed Mentor: Vassil Vassilev,Matheus Izvekov

1 Like

@vvassilev,
I just thinking about it, What if we transform headers into compact lexical units, each comprising a token stream for every definition within the TranslationUnit? With a corresponding lookup system, we could efficiently access these condensed units. It’s just a thought I’m sharing; I’m not entirely sure of its feasibility. If it’s deemed worthy, I can delve deeper into exploring this further.

Thank you for the suggestion! This seems likely to have quite a bit of impact on Clang internals, so one question I have is about code reviewer availability and whether there’s enough expertise and bandwidth to successfully review these changes in a timely manner. Do the mentors think they’ve got that covered?

Do you foresee issues with code like:

struct X {
  static_assert(sizeof(int) == 129);
};

where skipping over the parsing of the tag means you’d potentially not ever be alerted to the failed static assertion?

Also, do you intend for this to apply to all tags, or just structures and unions? e.g.,

enum E { // If you skip this
  One,
  Two
};

void foo(int Val = One); // How do you handle this?

Using a token stream and C++ is not generally possible as in many cases we need semantic analysis. C++ has a two phase lookup process which requires more information. The idea of the proposal is similar to what you mention but only in isolated cases where we can be certain that we are safe.

Spot on! That’s a general problem we suffer and this project is not any different. Actually it’d be much worse as it may have non-trivial changes in Clang as you mention. My idea is that we will have a proof of concept in a branch and I am hoping if the results are what I expect to solicit more reviewers.

No, but the proof of concept in a separate branch will probably be a reasonable way out :frowning:

If X is “used” then we will go back to the recorded token stream parse it and eventually assert. I guess your question is broader than that in the sense that we will essentially skip over faulty code just because it is not used. I was hoping that discussion to be as part of an eventual RFC but it is great that we can have it now.

I am thinking that we can have a mode in clang that does not have everything checked. That’d be especially useful for system header files as we presume they compile correctly. The cutoff is rather difficult because it would need some heuristic.

We can probably not deal with lazy compilation of transparent lexical contexts such as enums at first. They are generally not huge and won’t be a big performance gain…

Okay, that makes sense to me. It’s not ideal, but at least reviewer bandwidth won’t be a showstopper for you.

Yup, that’s really what it boils down to. Skipping parsing the contents of tag types may help save some time, but it may mean failing to catch ill-formed code. However, I also suspect that the only time it will save is with parsing of system headers (which may still be significant savings!) on the assumption that tags in the program are ~always used while many (most?) tags in system headers may be predominately unused.

Consider


int f(int);

struct S {
   decltype(f(0.0)) foo(); 
};

double f(double);

S s;

Changing the parsing order here would produce a different (ultimately non-conforming) outcome.
In general, any change to parsing order is going to affect lookup results, so I don’t how upstreamable that sort of behavior change would be.

Furthermore, if this is done for performance, wouldn’t improvement to modules get us similar outcome?

That being said:

  • I’d love to see generalized 2 phase parse in C++ (I just don’t see the standard going there)
  • I’d love to see work on CWG2335 (iI think it would be valuable regardless of the outcome is “we implemented CWG2335 and it’s great” or
    we tried to implement CWG2335 and it proved impossible / it broke the world"),

Can you elaborate more? Do you mean that while parsing S we have not seen the overload that takes a double hence the return type of foo would be the int?

This is an interesting example and I need to go check how that’s done in our downstream work but practically I have not seen a problem of that kind last 10 years. Maybe we are silently taking the wrong signature or that code is not very common.

The idea of that project is to parse S in that context where it was found. That is, we will disregard any lookup candidates with source locations after the definition of S. That might need some tweaks in the lookup tables to keep some sort of notion of range but that’s probably it…

That is what I understand the issue to be. And while this may seem contrived, I can imagine that highly-generic code may run into this problem.

I think we need to be careful though – for a proof of concept, “good enough” is good enough, but the bar is higher for a production quality because the tool loses value if it deviates from conforming behavior too often. That said, limiting this to code that lives in a system header might be a good tradeoff on the assumption that system header code is generally correct.

While I say “prototype” our target is boost and boost-sized libraries. What I meant is that for the GSoC project we should demonstrate that this approach works and can scale. Depending on the contributor we might be able to get to test the prototype to larger workflows because in practice a lot of the infrastructure that we need is already present in one form or another in the frontend. We probably should find the right angle to rewire a few things.

For example, the challenge that @cor3ntin mentioned can be addressed relatively easily by introducing source-location-based filtering in the clang::LookupResult. However, usually the devil is in the details, because if we want to be super efficient we might eventually re-organize the way we model source locations into some sort of a binary tree… That rework is practically useful in the implementation of modules because right now we pre-allocate huge chunks of source locations entries upon module loading (not necessarily used by anything). These requirements will come only after we start working on that project.

That being said, I am a huge fan of a minimal implementation and design changes to the infrastructure that has been working over many years rather that starting hammering all over the place. That requires a lot of staring at code and producing very little code but that has worked for several cases including the work on clang-repl. That’s why I’ve mentioned “patience” is important for a successful candidate in the project description :wink:

1 Like

@vvassilev,
I am writing to express my interest in participating in Google Summer of Code (GSoC) and to explore the possibility of contributing to your project, On Demand Parsing. While I have a solid understanding of LLVM and have already submitted some Pull Requests (PRs) to the LLVM project, I am currently familiarizing myself with Clang and its intricate details.

To deepen my understanding and prepare a well-informed proposal, I would greatly appreciate your guidance on specific elements of the codebase that are relevant to my potential contribution. Could you please suggest key classes and data structures within Clang that are essential for understanding the proposed “lookup approach” in your project?

Furthermore, I would be grateful if you could highlight any specific challenges I might face when delving into the proposed “lookup approach.” Are there any potential complexities or intricacies related to class and structure handling that I should be aware of?

@vvassilev,
I am currently exploring the Clang compiler and its parsing behavior. I have a question about lazy parsing in a specific scenario.

Consider the following files:

ab.h:

int add(int a, int b);

b.cpp:

#include "ab.h"
int add(int a, int b) { return a + b; }

a.cpp:

#include "ab.h" 
int main() { 
   add(1, 2); 
   return 0; 
}

In b.cpp, the add function is defined but not used anywhere within b.cpp. I am curious to know how lazy parsing works in this situation in Clang. Since add is not used in b.cpp, how does Clang handle the parsing of this function? Could you please provide guidance on how such scenarios are typically handled?

I am not very familiar with Clang’s dependency management, so any clarification or guidance you can provide would be greatly appreciated.

That’d be clang::Parser, clang::Sema and clang::LookupResult. In addition maybe clang::SourceManager.

I have mentioned the ones I can think of in the previous posts in this thread.

This project expects some previous experience with clang as it has several tricky parts. I do not mind helping you understand some of the clang basics but without prior experience in at least clang AST this can become a very challenging and frustrating project experience.

I have a good understanding of the interactions between the preprocessor, lexer, and parser in Clang, and I’m familiar with various aspects of the Abstract Syntax Tree (AST) such as Decl, Expr, DeclContext, AstContext, and more. I’ve been studying the Clang codebase extensively. However, I still have a question about how lazy parsing works on different C++ files, as I mentioned in the example above. This is because, in my understanding, Clang compiles individual files, creates object files, then links them to make an executable. Could you clarify this process for me?

I’m seeking a deeper understanding of the lookup system and scope in Clang. While I’m already familiar with concepts like IdentifierResolver, Scope, and relevant classes such as FunctionScope and NamespaceScope, I’m particularly interested in the intricate details of how Clang’s Sema resolves exact declarations within a given scope. I want to understand the underlying mechanisms and processes involved in this resolution process to enhance my overall comprehension of Clang’s behavior.

@vvassilev,

I have been closely examining the SemaLookup codebase, with a focus on obtaining the correct lazy entity token range for functions, particularly when dealing with overloaded functions. To address this, I have proposed a solution involving the creation of a Map of NamedDecl and LazyTokenInfo in the Parser. This approach would allow us to obtain lazy entity token information for a specific declaration using Sema LookUpResult. Importantly, LazyTokenInfo itself would not hold any Token stream; it would solely contain a Token Range.

Despite these considerations, I am unsure about the feasibility of this approach. Could you please provide insights or suggestions to help me proceed?