Understanding Clang parsing

Hi,

I am trying to understand how the lexer and parser work inside Clang/LLVM solution file. I am using a cmake generated visual studio 2008 solution file. I tried stepping into the lexer and understand the way clang’s lexer works.

  1. Can anyone tell me how the parser is designed in a short sentence. I would like to concentrate on how the AST works inside this parser design.

I think this will help me understand clang in a better way, Its hard for me to understand this design with all those interconnected classes.

Thanks

The Parser takes tokens read by the Lexer and assigns meaning to them.
It's responsible for determining what 'int' means, for example. The
parser uses this information to build the AST.

I understand your concern. Parts of Clang have me confused, too :). But
don't worry; there's plenty of people around here (including me) who are
willing to help you understand how Clang works.

Chip

This is something I would like to learn more about as well. For instance, the file lib/Parse/ParseObj.c contains an entire list of tokens e.g. 'kw_if', 'kw_new'. I am assuming the lexer reads these tokens and prepares them for the parser. Could someone point me to where these 'kw_*' enums are defined?

What would be really helpful is if someone could give a brief overview of how I would go about adding a new expression to C. So far what I've learned is this...

- I would have to add the relevant token so the lexer can recognize it.
- I would have to add parser code in lib/Parse/Parser.cpp?
- I would have to construct the relevant AST for this expr.

If I could just get the names of the files/directories where these changes would need to be made, that would be a great starting point. thanks,

Salman

Thanks for the support Charles.
Could you tell me a better way to understand how parser works. Or is
"steping into" the code is the only solid way to understand it.

Clang's Parser class works like this:
1. Someone calls ParseTopLevelDecl() to parse a top-level declaration
from the source.
2. ParseTopLevelDecl() calls other methods inside the Parser class to
parse fragments of the source according to the grammar as defined in the
C and C++ standards. Most functions get called indirectly; for example,
when the Parser encounters a function declaration, the
ParseFunctionDeclaration() method gets called.
3. These parsing methods notify some object through the "Action"
interface every time something gets parsed. The default Action object
does nothing. The Semantic Analyzer (another really important part of
the compiler), or just "Sema" for short, provides an implementation
which not only builds the AST but also annotates it with types and
checks the semantics of the program specified by the source code.
4. Steps 1-3 are repeated for every top-level declaration in the
translation unit (a source file plus all its included headers).
5. When it does hit the end of the unit, ParseTopLevelDecl() calls
Action::ActOnEndOfTranslationUnit(). This lets the object on the other
side know that the translation unit has ended, and it's time to do
whatever. The Sema implementation invokes an ASTConsumer object with
this information.

I would like to know some important files inside the solution file which
are responsible for building the AST.

The Sema library is your best bet. Study it; that's where I started with
Clang. Trust me: it's one of the easiest parts of the compiler to
understand. Like most of Clang, it was designed to be relatively easy to
understand, even for someone with little or no prior experience with
compiler technology. In particular, look at lib/Sema/SemaDecl.cpp and
lib/Sema/SemaExpr.cpp .

Also, take a look at include/clang/Parse/Action.h . This is the file
that defines the Action interface.

Thanks.

No problem.

Chip

Hi again,

Could you tell me what’s a Qualtype in detail. How does it save space for representing different types ?

Thanks

This is something I would like to learn more about as well. For
instance, the file lib/Parse/ParseObj.c contains an entire list of
tokens e.g. 'kw_if', 'kw_new'. I am assuming the lexer reads these
tokens and prepares them for the parser. Could someone point me to
where these 'kw_*' enums are defined?

Believe it or not, they're defined as part of the Basic library. See
include/clang/Basic/TokenKinds.def.

What would be really helpful is if someone could give a brief overview
of how I would go about adding a new expression to C. So far what I've
learned is this...

- I would have to add the relevant token so the lexer can recognize it.

Only if you have a new keyword or some such to add.

- I would have to add parser code in lib/Parse/Parser.cpp?

Not there, but to the relevant source file--probably
lib/Parse/ParseExpr.cpp.

- I would have to construct the relevant AST for this expr.

Look at the AST library--particularly lib/AST/Expr.cpp and friends.

If I could just get the names of the files/directories where these
changes would need to be made, that would be a great starting point.

You'll also have to add a new action to the Action interface
(include/clang/Parse/Action.h), and you'll also have to modify Sema to
understand the new expression (if you intend to use Sema; see
lib/Sema/SemaExpr.cpp). If you want to generate IR from it, you may also
have to modify CodeGen (lib/CodeGen/CGExpr.cpp) to understand the new
AST node. If you want to do static analysis, you may need to modify the
Analysis library, etc.

Chip

A QualType holds a pointer to a Type object as well as qualifiers such
as 'const', 'volatile', and 'restrict'. This way, we don't have to have
separate Type objects for 'int', 'const int', 'volatile int', 'const
volatile int', etc.

Some of the qualifiers are stored in the lower bits of the pointer
itself (on the assumption that it is always 8-byte aligned). Other
qualifiers have to be stored elsewhere. The ones that are stored in the
pointer are called 'fast' qualifiers, and the others are called 'slow'
qualifiers.

You can get the underlying Type pointer by calling getTypePtr(), and you
can read all the qualifiers by getting a Qualifiers object from the
QualType with getQualifiers(). You can read only the ones that don't
come from typedefs with the getLocalQualifiers() method, and you can
query particular qualifiers with the isXxxQualified() and
isLocalXxxQualified() methods.

The definition of the QualType class is in include/clang/AST/Type.h.
Take a look.

Chip

This is a useful resource:
http://clang.llvm.org/docs/InternalsManual.html

-Chris