Using Clang AST classes to generate LLVM IR for a compiler front-end

Hi,

I'm an undergraduate student, and I'd like to implement a compiler as a
project for my university. The compiler would compile an experimental
language I made that attempts to make all operations memory-safe with
minimal runtime overhead (no GC)... All operations except those in some
critical sections where unsafe code is allowed (typically, standard
collection and algorithm classes).
You can find a presentation here: http://bit.ly/UiVF8z (by the way, I would
appreciate if some of you could tell me what you think of it and if I am
fundamentally wrong somewhere)

First I though of using LLVM, and make my compiler directly generate LLVM
IR.
But I have the feeling that this would mean redoing a lot of things that
have already been done (and probably much better than I could ever do).
These things include:
- High level exceptions handling with RAII (freeing resources transparently
when exceptions are thrown)
- Inheritance and multiple inheritance
- Virtual methods
- High-level optimizations (out of the grasp of LLVM)
- Concurrency (thread-local storage, etc.)
- Enhanced compatibility with C/C++
- ... ?

So I thought that it would probably be better to generate an AST that Clang
can understand, and let Clang compile it into LLVM IR. Indeed, under the
thin layer of abstraction, my language's structure is very similar to C++.
I know that Clang allows people to reuse its C/C++-parsing capabilities for
various needs, but what I'm looking for is precisely not this part of Clang,
but the other one: translating the AST into LLVM IR...

However, I'm wondering if the Clang API I'm looking for is stable enough, or
if it may change too quickly or too radically...
So my question is: Would it be worth the trouble to learn and use the Clang
AST library so that my font-end can use it, and would it be a good option
for the future?
Or (this is ugly), should I rather generate plain C++ code and compile it
with any C++ compiler (at least until I find something better to do)?

Thank you for your interest,
LP.

- High-level optimizations (out of the grasp of LLVM)

FYI clang does not really do many (any?) "high level optimizations".
Its AST representation is meant for representing the source code
exactly and is "immutable", so no transformations are done on it.

However, I'm wondering if the Clang API I'm looking for is stable enough, or
if it may change too quickly or too radically...

The basic API of codegen'ing has a small and tight interface (note the
sparsity of include/clang/CodeGen/). It should be stable enough,
although as a heads-up it is planned to be renamed to IRGen soon
(along with many (private) classes that fall under its purview). The
difficult part is feeding it an AST that it will understand.

So my question is: Would it be worth the trouble to learn and use the Clang
AST library so that my font-end can use it, and would it be a good option
for the future?

In short: it probably won't be worth your effort. Clang's AST is
extremely complicated (reflective of the complexity of C++).

Longer explanation: I have not heard of anybody directly creating
Clang AST's for a foreign language, and I don't think that the AST API
is meant for doing that. If you want to directly build AST's, it
should be relatively straightforward (at a high level) to just
instantiate the AST classes (in include/clang/AST) and link them
together in "appropriate ways". However, although I do not have superb
knowledge of the AST, my belief is that you will run into a lot of
"devil is in the details" problems due to unspoken invariants in the
AST which makes "link them together in "appropriate ways"" very hard
to achieve.

The talk by Ronan Keryell at the latest dev meeting
<http://llvm.org/devmtg/2012-11/&gt; may give you a better feel for the
current state of creating Clang's ASTs by any means other than Clang's
own Parse/Sema infrastructure.

Or (this is ugly), should I rather generate plain C++ code and compile it
with any C++ compiler (at least until I find something better to do)?

From briefly skimming your paper, this seems like by far the easiest

approach, at least for an initial prototype.

Targeting LLVM IR, which is designed (and documented) for exactly the
purpose that you want (generating code), is probably the right way to
go if you want to make this language production-quality. However, if
you then want interoperability with C/C++, you will have to do
struct/class layout like a C/C++ compiler, have calling conventions
like a C/C++ compiler, exceptions, vtables, etc. This is generally
really complicated, so it may make sense to reuse parts of clang to do
that; however, there are currently no interfaces in clang designed
specifically for doing any of those things (e.g. look at what it takes
just to print out a class's layout in the function
DumpCXXRecordLayout() in clang/lib/AST/ASTRecordLayoutBuilder.cpp).
Exposing nice APIs for this stuff seems generally useful in a variety
of circumstances, so patches to improve this situation for your use
case would very likely be accepted. You may also want to ping David
Abrahams since a while back he was looking at isolating C++ ABI stuff
into a separate library (although I don't think anything came of it).

-- Sean Silva

I agree. But it is a good idea to look briefly at Clang architecture
and copy it as much as makes sense for your project. You can even
take whole classes like SourceManager/SourceLocation or diagnostics
infrastructure (after decoupling from preprocessor since you don't
have one), or use LLVM's ADT library.

Dmitri

I was told by Eli Friedman that these two were inseparable.

-- Sean Silva

Not to put words into Eli's mouth, but I think that what he meant is
that one can not use current diagnostic infrastructure as-is without a
preprocessor. But if one copies the code and deletes irrelevant
parts, it can be made useful (or it might be easier for a newcomer to
rewrite the whole thing from scratch, because it becomes very simple
-- who knows).

Dmitri

I’ve done something like this. I create a (fake) AST from the metadata in .NET DLLs for an implementation of the C++/CLI standard. As you say the devil is in the details, and I had to read and debug lots of problems due to the rest of Clang expecting some AST invariants that are not exactly obvious. Still it’s a nice approach since you get the rest of the compiler basically for free. You might have to extend the AST / IR gen to support some different details of your language, but overall I would say this approach works fine and saves you a lot of work.

Very cool!

Is this work open-source? I'd like to look at it if possible.

-- Sean Silva

Sure, you can check it out at Github: https://github.com/tritao/clang/blob/master/lib/Sema/SemaCLI.cpp

Maybe that will also give you some ideas Lionel.

So my question is: Would it be worth the trouble to learn and use the Clang
AST library so that my font-end can use it, and would it be a good option
for the future?

In short: it probably won't be worth your effort. Clang's AST is
extremely complicated (reflective of the complexity of C++).

I agree. But it is a good idea to look briefly at Clang architecture
and copy it as much as makes sense for your project. You can even
take whole classes like SourceManager/SourceLocation or diagnostics
infrastructure (after decoupling from preprocessor since you don't
have one), or use LLVM's ADT library.

If you want a source manager and diagnostic engine, but don't need a C preprocessor, use LLVM's SourceMgr. It's far simpler.

Thank you all for your answers.

@João Matos: That's interesting, I will have a look. But I'll probably
choose the easiest solution for the moment (generating C++ source code), at
least for the initial prototype.