Proposal for GSoC project for clang front end

Hi all,

I'd like to hear your opinions and ideas for a proposal to improve support for C++ parsing for LLVM's clang front end.

Goal:
Improve clang's C++ support. The scope of the project will be limited to C++ parsing, not code generation (I think the
timeframe of a GSoC project and the complexity of C++ doesn't allow full C++ support to be developed).

C++ parsing support includes (but is not limited to):
-Namespace declarations, using directives.
-Class declarations, class members, methods etc.
-Overload method/function matching.
-C++ name lookup rules, scope resolution.
-Class/function templates.

Is LLVM interested in accepting such a proposal ?
If yes, can you offer me hints on what is the best way to describe such a proposal (I mean, should I make a list about
each and every specific C++ feature that the parser should be able to handle ?)

Any thoughts about the subject will be greatly appreciated.

About me:
I'm an undergraduate student of electrical engineering in Democritus University of Greece (http://www.ee.duth.gr).
I've been a user, contributor, and project leader, of various open-source projects over the years.
I've gained some experience in C++ parsing when I developed an automatic wrapper for Ogre3D (http://www.ogre3d.org),
that produces bindings for the CLR (http://sourceforge.net/projects/mogre).

I think that LLVM+clang is the future of C++ development, and I'd be really happy to make a useful contribution to this
great project :).

-Argiris Kirtzidis

I'm also quite interested in improving the clang front-end: there are
too many projects in dire need of a good C/C++ parser (like
Code::Blocks or Eclipse's CDT for instance). However, I will have to
admit that I'm not very experienced with creating parsers for
programming languages (though I'm quite proficient with C++).

In any case, I could probably take on a less daunting task like
writing the documentation for clang (right now, the documentation is
very, very....lacking), helping another participant (not too sure how
to work it out though) or fixing some TODOs in the clang code (like
providing a better alternative for the hard-coded include paths).

Before I forget my introductions, I'm an undergrad of Computer Science
and Math from Ateneo de Manila University (http://www.admu.edu.ph/) in
the Philippines.

Yes, and C++ is one of the worst parsing nightmares that exist. Prepare
to maintain your code for years to come, or consider it a throwaway
project.

Does anybody know what the status of gcc-xml relative to clang is?
(gcc-xml is reworking the g++ parser to emit XML. It's not complete, but
it should fit the needs of Code::Blocks or CDT nicely once it's done.)

Regards,
Jo

I'd like to hear your opinions and ideas for a proposal to improve
support for C++ parsing for LLVM's clang front end.

Some meta feedback: C++ support in clang is a huge project, far and away more than any mortal can get done in a summer. While it would be possible to sketch out the parser itself in the summer (providing the equivalent of -parse-noop for C) this won't be able to handle a lot of interesting cases. C++ requires a significant amount of semantic analysis just to get parsing correct.

Goal:
Improve clang's C++ support. The scope of the project will be limited to
C++ parsing, not code generation (I think the
timeframe of a GSoC project and the complexity of C++ doesn't allow full
C++ support to be developed).

Ok, remember that parsing is only one piece of the puzzle. We also have semantic analysis/typechecking/ASTBuilding as well. I think that focusing on -fsyntax-only is a good place to be.

C++ parsing support includes (but is not limited to):
-Namespace declarations, using directives.
-Class declarations, class members, methods etc.
-Overload method/function matching.
-C++ name lookup rules, scope resolution.
-Class/function templates.

Ok, pick one or maybe two of these. I think it would be much better to have namespaces fully implemented than have everything sorta implemented.

If I were going to pick, I would suggest focusing on getting simple methods implemented, along with instance variables, etc through -fsyntax-only. This should be a reasonable amount of work for a summer. Something like this should work for example:

class foo {
   int X;
   typedef float Z;
   int test(Z a) { return a+X; }
   int test2(q r);
   tyepdef float q;
};

int foo::test2(q r) {
   return X+r;
}

No overloading, not templates, but handling the basic "class issues". Static methods would be a bonus :slight_smile:

Is LLVM interested in accepting such a proposal ?

Yes!

-Chris

Thanks for your feedback Chris,

Chris Lattner wrote:

If I were going to pick, I would suggest focusing on getting simple methods implemented, along with instance variables, etc through -fsyntax-only. This should be a reasonable amount of work for a summer. Something like this should work for example:

class foo {
   int X;
   typedef float Z;
   int test(Z a) { return a+X; }
   int test2(q r);
   tyepdef float q;
};

int foo::test2(q r) {
   return X+r;
}

No overloading, not templates, but handling the basic "class issues". Static methods would be a bonus :slight_smile:

Ok, adding basic class support sounds great. It will include:

1) declaring methods, nested classes, enumerations and typedefs. (nested types will be accessible
    only by class methods, unless class scope resolution is implemented; see below)
2) member access control ("public:" etc)
3) calling instance methods.

int foo::test2(q r) {
   return X+r;
}

This is actually quite tricky, because clang currently assumes that a declaration can be "found"
only by using an identifier (support for '::' in "foo::test2" needed), and for name lookup it
assumes that the declaration is accessible at the current scope or at an enclosing scope of the
current one (support for resolving X in "return X+r;" needed).

So either a kind of "hack" would be employed to get correct parsing for this situation only,
or "proper" C++ name lookup would be developed to also accommodate access to class nested types
and static members. Personally I'd prefer the latter, but I'd like to hear your opinion whether
[1) - 3)] plus "C++ name lookup" is a reasonable amount of work for the summer or something
should be dropped or simplified.

-Argiris

No overloading, not templates, but handling the basic "class issues".
Static methods would be a bonus :slight_smile:

Ok, adding basic class support sounds great. It will include:

1) declaring methods, nested classes, enumerations and typedefs. (nested
types will be accessible
   only by class methods, unless class scope resolution is implemented;
see below)
2) member access control ("public:" etc)

Note that 'test' exercises a cute detail of C++: inline methods cannot be parsed until the whole class is processed. You need to analyze the q typedef before you can parse the body of test.

This should be straight-forward to handle in clang, please discuss on cfe-dev when it becomes time.

3) calling instance methods.

int foo::test2(q r) {
  return X+r;
}

This is actually quite tricky, because clang currently assumes that a
declaration can be "found"
only by using an identifier (support for '::' in "foo::test2" needed),
and for name lookup it
assumes that the declaration is accessible at the current scope or at an
enclosing scope of the
current one (support for resolving X in "return X+r;" needed).

It's not that bad, we already handle it for instance variables in ObjC.

So either a kind of "hack" would be employed to get correct parsing for
this situation only,
or "proper" C++ name lookup would be developed to also accommodate
access to class nested types
and static members. Personally I'd prefer the latter, but I'd like to
hear your opinion whether
[1) - 3)] plus "C++ name lookup" is a reasonable amount of work for the
summer or something
should be dropped or simplified.

I'd stick with the minimal amount of name lookup to get classes themselves working. instance variables are included, but not overloading or anything tricky.

-Chris

Chris Lattner wrote:

int foo::test2(q r) {
  return X+r;
}
      

This is actually quite tricky, because clang currently assumes that a
declaration can be "found"
only by using an identifier (support for '::' in "foo::test2" needed),
and for name lookup it
assumes that the declaration is accessible at the current scope or at an
enclosing scope of the
current one (support for resolving X in "return X+r;" needed).
    
It's not that bad, we already handle it for instance variables in ObjC.

Actually, name lookup in ObjC methods is not working correctly. For example:

Chris Lattner wrote:

int foo::test2(q r) {
return X+r;
}

This is actually quite tricky, because clang currently assumes that a
declaration can be "found"
only by using an identifier (support for '::' in "foo::test2" needed),
and for name lookup it
assumes that the declaration is accessible at the current scope or
at an
enclosing scope of the
current one (support for resolving X in "return X+r;" needed).

It's not that bad, we already handle it for instance variables in ObjC.

Actually, name lookup in ObjC methods is not working correctly. For example:

------------------------------------------------------
@interface Test {
   int x;
}

-(void) setX: (int) d;
@end

char *x;

@implementation Test

-(void) setX: (int) n {
   x = n;
}

@end
------------------------------------------------------

If you compile it with "clang test.m -fsyntax-only -pedantic-errors"
you'll get:

main.m:13:7: error: incompatible integer to pointer conversion assigning
'int', expected 'char *'
   x = n;
     ^ ~

I'm not familiar with Objective C, but isn't it supposed to pick up the
'x' instance variable instead of the global one ?

Yes. This is a clang bug.

- Farborz

Chris Lattner wrote:

int foo::test2(q r) {
return X+r;
}

This is actually quite tricky, because clang currently assumes that a
declaration can be "found"
only by using an identifier (support for '::' in "foo::test2" needed),
and for name lookup it
assumes that the declaration is accessible at the current scope or
at an
enclosing scope of the
current one (support for resolving X in "return X+r;" needed).

It's not that bad, we already handle it for instance variables in ObjC.

Actually, name lookup in ObjC methods is not working correctly. For example:

You're right. I fixed the logic of this in this patch:
http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20080324/004865.html

Thanks a lot for noticing this!

I'm not familiar with Objective C, but isn't it supposed to pick up the
'x' instance variable instead of the global one ?

Yes, absolutely.

There are two remaining issues listed in the patch, the first is an internal objc implementation detail (making the AST more consistent with the code by eliminating implicit 'self->'s).

The second is more interesting for your namespace patch. I think we really do need an AST-side concept of "scope", which is similar to the parser one, but is also a bit different. This would allow the implementation of ScopedDecl::isDefinedOutsideFunctionOrMethod to correctly handle typedefs defined inside functions etc.

I'll follow up after catching up on your namespace patch.

-Chris

Chris Lattner wrote:

The second is more interesting for your namespace patch. I think we really do need an AST-side concept of "scope", which is similar to the parser one, but is also a bit different. This would allow the implementation of ScopedDecl::isDefinedOutsideFunctionOrMethod to correctly handle typedefs defined inside functions etc.

Just a suggestion, how about a 'DeclContext' AST class that FunctionDecl would inherit from (and NamespaceDecl, RecordDecl):

class FunctionDecl : public ValueDecl, public DeclContext {
....

ScopedDecl would have a DeclContext member and isDefinedOutsideFunctionOrMethod would check if it's a FunctionDecl.

Yes, this is exactly what I was thinking (But ContextDecl instead of DeclContext :).

With this change, we could then have a TranslationUnitDecl for the top-level. It, ObjCMethod, and NamespaceDecl would also inherit from DeclContext. What do you think?

-Chris

Chris Lattner wrote:

Chris Lattner wrote:

The second is more interesting for your namespace patch. I think we really do need an AST-side concept of "scope", which is similar to the parser one, but is also a bit different. This would allow the implementation of ScopedDecl::isDefinedOutsideFunctionOrMethod to correctly handle typedefs defined inside functions etc.

Just a suggestion, how about a 'DeclContext' AST class that FunctionDecl would inherit from (and NamespaceDecl, RecordDecl):

class FunctionDecl : public ValueDecl, public DeclContext {
....

ScopedDecl would have a DeclContext member and isDefinedOutsideFunctionOrMethod would check if it's a FunctionDecl.

Yes, this is exactly what I was thinking (But ContextDecl instead of DeclContext :).

With this change, we could then have a TranslationUnitDecl for the top-level. It, ObjCMethod, and NamespaceDecl would also inherit from DeclContext. What do you think?

-Chris

This sounds perfect. I'd like to give it a try tomorrow, if I may.
Should the ContextDecl member of ScopedDecl be a read-only property, its value passed to the constructors of decls ? Or should there be a setContextDecl method ?
(The latter would be a bit more straightforward, if read-only is not a requirement).

ScopedDecl would have a DeclContext member and isDefinedOutsideFunctionOrMethod would check if it's a FunctionDecl.

Yes, this is exactly what I was thinking (But ContextDecl instead of DeclContext :).

With this change, we could then have a TranslationUnitDecl for the top-level. It, ObjCMethod, and NamespaceDecl would also inherit from DeclContext. What do you think?

-Chris

This sounds perfect. I'd like to give it a try tomorrow, if I may.

Wow, thanks!

Should the ContextDecl member of ScopedDecl be a read-only property, its value passed to the constructors of decls ? Or should there be a setContextDecl method ?
(The latter would be a bit more straightforward, if read-only is not a requirement).

I don't have a strong opinion. I think it does make sense to pass it into the ctor, but having a way to change it later could be useful. I'll leave it up to you. Thanks Algiris. Also, I apologize for not getting to your namespace patch sooner.

-Chris

You could surely save a lot of time reinventing the wheel by reusing an
existing C++ parser, like Elsa:

  http://www.cs.berkeley.edu/~smcpeak/elkhound/

There are even OCaml bindings:

  http://www.cs.ru.nl/~tews/olmar/

These libraries were discussed on the OCaml mailing list recently:

  http://groups.google.com/group/fa.caml/msg/dd7dad5533647220

Please see:
http://clang.llvm.org/comparison.html

-Chris

Well thats a matter of opinion. Regardless, this is GSoC ideas for LLVM/clang and doesn't have anything to do with Elsa.

-Tanya

Hi,

I've attached a patch that implements the ContextDecl concept. A short summary of the changes:

-Added ContextDecl and TranslationUnitDecl
-TranslationUnitDecl, FunctionDecl and ObjCMethodDecl inherit from ContextDecl
-Decl class has a ContextDecl member
-All Decl subclasses receive a ContextDecl at their constructors
-Changed 'isDefinedOutsideFunctionOrMethod' to

bool ScopedDecl::isDefinedOutsideFunctionOrMethod() const {
  return isa<TranslationUnitDecl>(getContextDecl());
}

-Sema sets the current ContextDecl, that subsequent new Decls should "be under", by calling Context::PushContextDecl and Context::PopContextDecl.

Is 'Context' the appropriate place to keep track of ContextDecl, or should I move this functionality to Sema ?

-Argiris

Chris Lattner wrote:

contextdecl.patch (41.5 KB)

Hi,

I've attached a patch that implements the ContextDecl concept.

Nice!

A short summary of the changes:

-Added ContextDecl and TranslationUnitDecl

Ok.

-TranslationUnitDecl, FunctionDecl and ObjCMethodDecl inherit from ContextDecl

Yep.

-Decl class has a ContextDecl member

Should this be in Decl, or in ScopedDecl? It doesn't make much sense for struct fields (for example) to have this context pointer.

-All Decl subclasses receive a ContextDecl at their constructors

Sounds good, but only scopedecl if possible.

-Changed 'isDefinedOutsideFunctionOrMethod' to

bool ScopedDecl::isDefinedOutsideFunctionOrMethod() const {
return isa<TranslationUnitDecl>(getContextDecl());
}

Woo, very nice!

-Sema sets the current ContextDecl, that subsequent new Decls should "be under", by calling Context::PushContextDecl and Context::PopContextDecl.
Is 'Context' the appropriate place to keep track of ContextDecl, or should I move this functionality to Sema ?

I think this makes more sense to be in Sema than in ASTContext. The notion of the "current" context is something that changes after parsing (which makes sense for Sema to have) but ASTContext represents the code "after" parsing.

Some minor stuff:

Please move "CurContext" into Sema and the ContextDecl member to ScopeDecl if possible.

instead of Decl::getContextDecl(), how about just [Scope]Decl::getContext() ?

+/// ContextDecl - This is used only as base class of specific decl types that
+/// can act as declaration contexts. This decls are:

This looks great, and your implementation is very clever. Would it be possible to move ContextDecl to a ContextDecl.h file that Decl.h #includes?

Can the body of ScopedDecl::isDefinedOutsideFunctionOrMethod be moved inline?

This is great work Argiris - I'm very impressed,

-Chris

Chris Lattner wrote:

-Decl class has a ContextDecl member

Should this be in Decl, or in ScopedDecl? It doesn't make much sense for struct fields (for example) to have this context pointer.

-All Decl subclasses receive a ContextDecl at their constructors

Sounds good, but only scopedecl if possible.

Isn't this a bit restricting ? When NamespaceDecl is added, non-ScopedDecls could be under TranslationUnitDecl or a
specific NamespaceDecl. NamespaceDecl itself makes little sense to be a ScopedDecl, but it makes sense to give it
a context pointer to its enclosing namespace.
Most of the non-ScopedDecls are ObjC specific but for Objective C++ they will be able to use NamespaceDecl
context pointers to, for example, get their fully qualified name.

And if RecordDecl is made a ContextDecl (mostly useful for C++), a context pointer for FieldDecl would point at
the struct/class that defined it.

What do you think ?

Argiris,

I've been very impressed with the work you have been doing on all of this. It's quite impressive, and will make a great contribution to clang. Comments inline.

Chris Lattner wrote:

-Decl class has a ContextDecl member

Should this be in Decl, or in ScopedDecl? It doesn't make much sense
for struct fields (for example) to have this context pointer.

Chris: Why is that the case? For example, I can imagine cases in the static analyzer where knowing the RecordDecl for a given structure field could potentially be very useful, and clean some things up nicely. Is it the cost of the extra pointer that you are worried about?

-All Decl subclasses receive a ContextDecl at their constructors

Sounds good, but only scopedecl if possible.

Isn't this a bit restricting ? When NamespaceDecl is added,
non-ScopedDecls could be under TranslationUnitDecl or a
specific NamespaceDecl. NamespaceDecl itself makes little sense to be a
ScopedDecl, but it makes sense to give it
a context pointer to its enclosing namespace.
Most of the non-ScopedDecls are ObjC specific but for Objective C++ they
will be able to use NamespaceDecl
context pointers to, for example, get their fully qualified name.

All of this makes a lot of sense to me. I like the unifying concept of context here, from the TranslationUnit level and down, and I like having NamespaceDecls having an enclosing context, etc., so that one can easily traverse the namespace hierarchy. I'm curious, how would the following be handled:

   namespace foo {
     class F { public: void foo(); };
   }

   using namespace foo;

   void F::foo() {}

What would the "ContextDecl" be for the decl representing the definition of foo? Would it be a NamespaceDecl, or a TranslationUnitDecl? If it is a NamespaceDecl, is it the same NamespaceDecl as the one enclosing the class definition? I'm sorry if this has already been discussed; I'm trying to wrap my mind around what exactly a ContextDecl is, as it seems to represents both a syntactic construct and a language concept.

And if RecordDecl is made a ContextDecl (mostly useful for C++), a
context pointer for FieldDecl would point at
the struct/class that defined it.

This makes sense to me, and I think it would be very useful.