Cleaning up the representation of Decls in the AST

Over the last couple of weeks Daniel Dunbar and I have had a series of
conversations about various serious issues with the current design and
implementation of Decls in Clang. We've come up with some proposed
changes that we think will resolve most of the problems that we know
about, and I thought I would sketch out these ideas here so that we
can openly discuss these ideas and debate their value.

I'm going to start by outlining some of the problems we are trying to
solve and then discuss our solution from both a design and implementation
level.

We're hoping to start implementing these changes very soon (next week).

Any comments or feedback you have are appreciated!

PROBLEMS WE'RE TRYING TO ADDRESS

(1) AMBIGUOUS OWNERSHIP AND LOSS OF INFORMATION

There are many cases where it is not clear what is the owning
reference to a Decl. This poses problems for both AST serialization
(which is needed for PCH support), but is also needed for refactoring
clients that wish to modify the AST. Similarly, the ASTs don't retain
information about the source code that would be vital for accurately
representing key elements of the original source code (necessary for
refactoring), or even being able to pretty-print valid ASTs to code
that can compile.

For example, consider the following code:

[EXAMPLE 1]

void f() {
   typedef struct x { int y; } foo;
}
struct x { int y; };
void g() {
   typedef struct x foo;
}

clang -ast-print currently pretty-prints this as:

void f() {
   typedef struct x foo;
}
...
void g() {
   typedef struct x foo;
}

Currently both f() and g() appear the same during printing because
their is no reference from either the TypedefDecl or the DeclStmt to
the RecordDecl for 'struct x'. Only the RecordType for 'struct x'
refers to the RecordDecl, which means in this case the RecordType
actually *owns* the RecordDecl. This is horrible, not only because a
Type object should not own Decls, but that a RecordDecl can sometimes
be owned by a DeclStmt or be a top-level declaration:

[EXAMPLE 2]

struct x { int y; }; // GOOD: The TranslationUnit owns the RecordDecl.
struct z { int y; } z; // BAD: The RecordType owns the RecordDecl.

The lack of clear ownership semantics, along with the loss of
information that it implies, makes it very difficult to always
faithfully pretty-print code that actually compiles, or compiles with
the same semantics:

[EXAMPLE 3]

void f() {
   struct { int x; } a, b, *c;
}

clang -ast-print:

void f() {
   struct <anonymous> a;
   struct <anonymous> b;
   struct <anonymous> *c;
}

clang -ast-dump:

void f()
(CompoundStmt 0xc070f0 <<stdin>:3:12, line:5:3>
   (DeclStmt 0xc070d0 <line:4:5>
     0xc06ff0 "struct <anonymous> a"
     0xc07050 "struct <anonymous> b"
     0xc070a0 "struct <anonymous> *c")

As before, in this case the RecordType owns the RecordDecl. Not only
that, but -ast-print pretty-prints the variable declaration in a
completely broken way. While --ast-print can be enhanced to do a
better job in this case, fundamentally the Decl for the anonymous
struct needs to referenced from the AST nodes themselves.

(2) NO DISTINCTION BETWEEN A 'DECL' AND A 'GROUP OF DECLS'

The way we represent "groups of Decls" is both not uniform and poorly
abstracted.

For multi-variable declarations (the most common case of "groups of
Decls"), we represent them as a single ScopedDecl* that points to the
first ScopedDecl* in a chain. The problem is that many clients assume
that when they are handed a single Decl* they are operating on a
single Decl, not a group of Decls. This continues to bite us over and
over. Here are two examples.

The first is the AST pretty-printer, which only pretty-prints the
first Decl in a multi-variable declaration that appears at the
top-level:

$ clang -ast-print
int a, b, c;
^d
typedef char *__builtin_va_list;
Read top-level variable decl: 'a'

Clearly this can be fixed, but basically this client was written to
assume that when it was handed a Decl* it was just reasoning about a
single Decl. Clients have to do downcasts and isa<> interrogations of
Decls to determine if a Decl* is referring to a single Decl or a group
of Decls. In the particular case of the pretty-printer, I counted
about four or five locations in StmtPrinter.cpp where
DeclStmt::getDecl() is called (which returns a chain of ScopedDecls)
and only the first Decl is processed.

A second example of where this use to bite us is codegen support in
Clang. Until recently, when we saw "int a, b, c;" we only did the
codegen for 'a'. This was fixed, but obviously future clients will
continue to make the same mistake over and over, and their failure
modes might not be as dramatic to immediately notice they are getting
it wrong.

Clearly a disambiguation between Decls and 'groups of Decls' would be
useful, not only for clients, but also for cleanliness within the ASTs
themselves. This would help resolve numerous ownership issues as
well. For example, if we had DeclStmt::getDeclGroup() instead of
DeclStmt::getDecl() (or something like that) that returned a group of
Decls then this kind of programming error wouldn't be possible because
the type system with prohibit it. Clients would explicitly reason
about 'groups of Decls'. This would prohibit many errors, but also
would simplify code in many cases. Specifically, in the cases where a
single Decl* is passed to a client the client can assume they are
dealing with a single Decl.

(3) ScopedDecl CHAIN: CONFLATED IMPLEMENTATION AND LACK OF ABSTRACTION

The decl chain in ScopedDecl is currently used for different purposes
in different contexts, which muddles up the ASTs and makes it often
very unclear what is going on.

For example, an EnumDecl represents its enum constants by chaining
them together using a ScopedDecl chain. This same chain is used to
represent multi-variable declarations as well as chaining together
NamespaceDecls. In each of these cases the ownership semantics with
regards to the chain are slightly different, and conceptually they
represent very different things.

For C++ namespaces, we can still chain them together (as we do with
FunctionDecls), but not used the ScopedDecl chain because it is
semantically something very different.

Further, most clients that do successfully walk the ScopedDecl chain
walk the chain directly. This is bad because it fuses us to a
particular implementation instead of having clients reason about an
abstraction. Instead we should use an iterator interface to walk a
group of decls.

In generally, we believe that the ScopedDecl chain should just be
removed. It is used in logically inconsistent ways and almost every
client that has used it has gotten things wrong at least once.

PROPOSED SOLUTION

We propose the following solution to address these issues:

- EnumDecl and NamespaceDecl no longer used the getNextDeclarator()
  chain in ScopedDecl. Both can use a chain (as they do now), or some
  alternate representation. However, the implementation chosen will be
  private to each class and an iterator interface should suffice to
  allow clients to ignore the details of the implementation.

- Introduce the notion of "DeclGroup" that represents a group of
  (comma-separated) Decls.

Essentially, DeclGroup would have the following grammar:

DeclGroup -> TypeDecl Decl+
           -> Decl

DeclGroup would *own* the Decls it references. The "TypeDecl" represents an optional extra type declaration associated with the DeclGroup.

For example, to represent:

struct {} x, y;

The DeclGroup consists of:

(1) A RecordDecl (the "TypeDecl") for the anonymous struct.
(2) Two VarDecls for x and y.

This also works for typedefs:

typedef struct x {} foo, bar;

becomes a DeclGroup consisting of:

(1) A RecordDecl for 'struct x' (the "TypeDecl").
(2) Two TypedefDecls for 'foo' and 'bar'.

Note that this completely resolves the ownership issues with anonymous
structs and similar constructs. It also means that Types no longer
own Decls (a good thing).

For the common case where a DeclGroup contains a single Decl (and no
extra "TypeDecl" is needed), DeclGroup would have an optimized
implementation so that using DeclGroup has no additional space
overhead compared to our current implementation. Our idea is to
essentially implement DeclGroup as a smart pointer, and for this
common case it just contains the pointer to the solitary Decl (no
additional overhead).

Moreover, by using a smart-pointer DeclGroup instead of a chain of ScopedDecls, we actually will save the space used by the NextDeclarator field in ScopedDecl (since it's NULL most of the time).

- DeclStmt uses DeclGroup instead of a ScopedDecl*.

- *All* top-level Decls would be replaced by DeclGroups.

Conceptually this unifies how DeclStmt and the top-level represent
groups of Decls.

ParseAST would be modified to call
ASTConsumer::HandleTopLevelDeclGroup instead of HandleTopLevelDecl. We
can still have HandleTopLevelDecl, and have the default implementation
of HandleTopLevelDeclGroup just call HandleTopLevelDecl for each of
its constitute Decls.

- RecordDecls would represent their fields as a group of DeclGroups.
  This enables us to accurately represent:

struct x {
   struct y { int z; } a, b;
   int c;
};

A field_iterator interface can provide the necessary abstraction to
clients that just want to iterate over the FieldDecls of a struct.

- The arguments of a function would be represented as a list of
  DeclGroups. This is because the following is legal (note the
  anonymous struct):

int foo(int x, struct { int a; } y);

Similarly, an iterator interface provides clients with a clean way to
iterate over the ParmVarDecls in a function prototype.

- Similar changes everywhere else where we have groups of
  (comma-separated) Decls that can contain both type and variable
  declarations.

PROS AND IMPLEMENTATION

We believe these changes could be made very incrementally to the
parse/Sema/ASTs/ASTConsumers.

The main benefits we see from these changes are:

- Cleaner API: Clients always know when they are reasoning about Decls
  or groups of Decls.

- Source-Fidelity: We capture much more of the source information in
  the ASTs. This is great for refactoring clients.

- Object-Ownership: Most of the known object ownership issues in the
  ASTs go away, making it easier for clients to manipulate
  (e.g. refactoring) as well as unblocking full AST serialization
  (needed for PCH).

- Abstraction: Given the proper iterator interfaces, most clients will
  not care about DeclGroups. They can just use the interfaces to
  iterate over Decls. The default implementation of
  ASTConsumer::HandleTopLevelDeclGroup would just call
  ASTConsumer::HandleTopLevelDecl for each Decl* in a top-level
  DeclGroup. Clients that care about top-level DeclGroups can
  override HandleTopLevelDeclGroup.

- ASTs better reflect the grammar of C: We believe these changes would
  make it clearer in many cases what source constructs a particular
  AST node can represent. This is invaluable for clients.

Thoughts?

Hi Ted,

This looks like a great design!
Just a few questions:

1) Who will own struct references/definitions, will there be DeclGroup owners for them too ?
For example:

struct S1;
struct S1 { int x; };
struct S2;
struct S2 {int x; } foo;

who owns these and what will the pretty-print output be for the above example ?

2)

DeclGroup -> TypeDecl Decl+
           -> Decl
  

Given a RecordDecl how can I find out its owner, will TypeDecl point to its owning DeclGroup (or Type point to the DeclGroup owning the TypeDecl) ?

-Argiris

Ted Kremenek wrote:

Hi Ted,

This looks like a great design!
Just a few questions:

1) Who will own struct references/definitions, will there be DeclGroup owners for them too ?
For example:

struct S1;
struct S1 { int x; };
struct S2;
struct S2 {int x; } foo;

Fixing RecordDecls is another issue that we've thought about. Right now we combine multiple struct declarations into a single RecordDecl. This seems very broken.

The natural solution is to have a separate RecordDecl for struct declaration. We can then chain them together, just like we do within FunctionDecls. The chain would not imply ownership; just a way to walk between the RecordDecls. Once this is done, we can naturally use DeclGroups with RecordDecls. Having multiple RecordDecls also solves a whole bunch of other problems (such as the loss of source information).

Once we have multiple RecordDecls, the RecordDecl that actually defines the struct can have a bit in it indicating that this is the case. This allows us to resolve between forward declarations and actual struct implementations.

Once we have multiple RecordDecls, for your example each of these declarations would be wrapped in a DeclGroup. A DeclGroup is basically just a smart pointer, and would be optimized to have the same overhead as a pointer for all of the cases except the last.

For the last case ("struct S2 {int x; } foo;"), DeclGroup would point to a block of memory that contained the RecordDecl for 'struct S2 { int x; }' and a VarDecl for 'foo'.

who owns these and what will the pretty-print output be for the above example ?

The pretty-printed output (once the pretty-printer is fixed) should be the same as your example.

2)

DeclGroup -> TypeDecl Decl+
          -> Decl

Given a RecordDecl how can I find out its owner, will TypeDecl point to its owning DeclGroup (or Type point to the DeclGroup owning the TypeDecl) ?

We can find the owner by building a back-map, just like we do with Stmts (ala ParentMap).

Just to be clear, DeclGroups are a separate concept than DeclContexts.

Moreover, by using a smart-pointer DeclGroup instead of a chain of
ScopedDecls, we actually will save the space used by the
NextDeclarator field in ScopedDecl (since it's NULL most of the time).

Are you intending to pass around DeclGroups by value? Or will they
have a stable address? Passing them by value seems like it would be
confusing, but it could lead to better performance.

Yes! DeclGroup would just be a smart pointer, and should be treated as a pointer. We can change the name to reflect it. To do memory management, we would have something equivalent to llvm::OwningPtr<...> for DeclGroup, e.g. OwnedDeclGroup.

Here is how I would rewrite DeclStmt:

class DeclStmt : public Stmt {
  OwnedDeclGroup Decls;
  SourceLocation StartLoc, EndLoc;
public:
  DeclStmt(DeclGroup D, SourceLocation startLoc, SourceLocation endLoc)
    : Stmt(DeclStmtClass), Decls(D), StartLoc(startLoc), EndLoc(endLoc) {}

  virtual void Destroy(ASTContext& Ctx);

  const DeclGroup getDecls() const { return Decls.get()l; }
  DeclGroup getDecls() { return Decls.get(); }
...

In this example, the dtor for OwnedDeclGroup would actually release the memory pointed to by DeclGroup. The dtor for DeclGroup would not do that (just like a pointer does not automatically free the memory it refers to).

There have been multiple requests for a way to get the location of the
type of a variable declaration; do you have any ideas for how to
implement that? It seems like DeclGroups will make that easier, but
it still seems tricky.

It is tricky, and I'm not certain if it is even a well-formed question. Consider:

int (*x)(int y);

Where is the location of the type? Is it the first int?

We could certainly add this information to DeclGroup. It would eat an extra 4 bytes, which we probably need to do somewhere anyway if we want to preserve this information.

Ted Kremenek wrote:-

- RecordDecls would represent their fields as a group of DeclGroups.
  This enables us to accurately represent:

struct x {
   struct y { int z; } a, b;
   int c;
};

A field_iterator interface can provide the necessary abstraction to
clients that just want to iterate over the FieldDecls of a struct.

This example has a different meaning in C and C++. Would they
be represented the same? How would one walk file-scope struct/
union/enum declarations for C?

Neil.

Ted Kremenek wrote:-

- RecordDecls would represent their fields as a group of DeclGroups.
This enables us to accurately represent:

struct x {
  struct y { int z; } a, b;
  int c;
};

A field_iterator interface can provide the necessary abstraction to
clients that just want to iterate over the FieldDecls of a struct.

This example has a different meaning in C and C++. Would they
be represented the same?

Hi Neil,

I think I know what you are referring to, but could you please elaborate a little more? Are you referring to the visibility of fields? The main solution that DeclGroup would solve is faithfully representing multi-variable declarations that also include a type declaration. This seems complementary to capturing visibility. For example, the "group of DeclGroups" could also be interspersed with "visibility" markers (with SourceLocation information). An iterator mechanism could be provided for clients iterate over private/protected/public fields, etc.

How would one walk file-scope struct/
union/enum declarations for C?

We can currently iterate over all file-scope declarations using TranslationUnit (which contains references to all top-level declarations). This same interface could be preserved by (a) providing a new iterator mechanism to iterate over top-level DeclGroups and (b) layering a revised iterator mechanism for iterating over the top-level Decls over the iterating mechanism for top-level DeclGroups.

Does this address your question, or were you asking about something else?

Ted

Ted Kremenek wrote:-

Ted Kremenek wrote:-

- RecordDecls would represent their fields as a group of DeclGroups.
This enables us to accurately represent:

struct x {
  struct y { int z; } a, b;
  int c;
};

A field_iterator interface can provide the necessary abstraction to
clients that just want to iterate over the FieldDecls of a struct.

This example has a different meaning in C and C++. Would they
be represented the same?

Hi Neil,

I think I know what you are referring to, but could you please elaborate
a little more? Are you referring to the visibility of fields?

Sorry I should have been clearer. I'm not referring to visibility,
I'm referring to the fact that in C "struct y" is at file scope,
whereas in C++ it's a nested struct.

How would one walk file-scope struct/
union/enum declarations for C?

We can currently iterate over all file-scope declarations using
TranslationUnit (which contains references to all top-level
declarations). This same interface could be preserved by (a) providing a
new iterator mechanism to iterate over top-level DeclGroups and (b)
layering a revised iterator mechanism for iterating over the top-level
Decls over the iterating mechanism for top-level DeclGroups.

Hence my follow-up question - it seems your iterator would miss
"struct y" in a C source file?

Neil.

(3) ScopedDecl CHAIN: CONFLATED IMPLEMENTATION AND LACK OF ABSTRACTION

In generally, we believe that the ScopedDecl chain should just be
removed. It is used in logically inconsistent ways and almost every
client that has used it has gotten things wrong at least once.

This seems like a somewhat orthogonal problem from the rest.

PROPOSED SOLUTION

We propose the following solution to address these issues:

- EnumDecl and NamespaceDecl no longer used the getNextDeclarator()
chain in ScopedDecl. Both can use a chain (as they do now), or some
alternate representation. However, the implementation chosen will be
private to each class and an iterator interface should suffice to
allow clients to ignore the details of the implementation.

I agree completely, since this is relatively orthogonal from the rest, it can be done as a first step independent from DeclGroupization

- Introduce the notion of "DeclGroup" that represents a group of
(comma-separated) Decls.

Essentially, DeclGroup would have the following grammar:

DeclGroup -> TypeDecl Decl+
          -> Decl

According to the C grammar lexicon, "typedecl" is a "declaration specifier" and "decl" is a "declarator".

This also works for typedefs:

typedef struct x {} foo, bar;

becomes a DeclGroup consisting of:

(1) A RecordDecl for 'struct x' (the "TypeDecl").
(2) Two TypedefDecls for 'foo' and 'bar'.

Note that this completely resolves the ownership issues with anonymous
structs and similar constructs. It also means that Types no longer
own Decls (a good thing).

Presumably this means that "typedef struct x foo, bar" would be a DeclGroup with no "TypeDecl" but with two typedef decls?

For the common case where a DeclGroup contains a single Decl (and no
extra "TypeDecl" is needed), DeclGroup would have an optimized
implementation so that using DeclGroup has no additional space
overhead compared to our current implementation. Our idea is to
essentially implement DeclGroup as a smart pointer, and for this
common case it just contains the pointer to the solitary Decl (no
additional overhead).

Ok.

- RecordDecls would represent their fields as a group of DeclGroups.
This enables us to accurately represent:

struct x {
  struct y { int z; } a, b;
  int c;
};

Ok. This fixes the inconsistency with how we represent struct fields as well (right now we don't keep track that they are 'grouped').

A field_iterator interface can provide the necessary abstraction to
clients that just want to iterate over the FieldDecls of a struct.

That seems like it will be the common need for Sema at least, so that would be very nice.

PROS AND IMPLEMENTATION

I'm sold, totally awesome. Nice proposal guys.

Here's another related issue that isn't really addressed: who owns the expressions for types?

typedef int X[1+2+x]; // vla

who owns the 1+2+x? We have an issue in the canonical type system where "const X" will end up making another *type* which shares ownership of the expression. This is badness, but perhaps an unrelated issue to the rest of the declgroup discussion here.

-Chris

I think that this (tracking loc info for types) is easy to solve, just a significant amount of work :). I don't think this has anything to do with DeclGroup though.

The reason we don't keep loc info for types around today is that types are uniqued. Conceptually, if types were not uniqued, 'x's type in Ted's example would just be a PointerTypeWithLocInfo with a pointee of FunctionTypeWithLocInfo, with arg and return type of BuiltinTypeWithLocInfo, etc. You could even throw in a GroupingParenTypeInfo object if desired.

This would make the type system capture the full loc info for the declaration and I think it would be extensible to the other type-related loc and syntactic structure info that various clients needed. The big problem with this approach is that it would make type comparisons very slow, because you'd have to recursively compare types by their structural type.

Fortunately, there is an answer here and it's an easy one :). The basic approach I would recommend is to build off the canonical types system. You could define *today* all these classes, and the canonical/desugared version of the type would just be the normal type without location info. This is exactly how we represent "typeof(int)" in the type system separately from "int" even though both have the same semantic meaning.

This can also be used for a number of other things that we don't do today. For example, in ObjC, you can write a protocol qualified type with the protocols in any order, e.g. NSString<a,b,c> is the same as NSString<c,a,b>. Today, we just always discard the original ordering and sort them alphabetically. This means that if we emit a diagnostic with one of these types that we print them out in the wrong order (alphabetic instead of the order the user wrote them in). If we really cared, we could have a 'noncanonical' version of this that preserves the original ordering, and have the canonical version of the type store the qualifiers in sorted order.

I anticipate that the same approach will be used in the future to distinguish between "std::basic_string<char>" and "basic_string<char>" (when using namespace std) and "basic_string<char, type_traits<char> >" when the user explicitly spells out the default parameter, etc.

Tracking this information would be expensive (which is just the nature of the problem since types are so 'intricate'), but could be a conditional in SemaType. Clients that wanted it could parse with that condition enabled and get full loc info for their types.

-Chris

Hi Neil,

I'm still a little unclear about your point, but I believe you are referring to the name lookup/Sema rules. The way I would look at it is that our AST representation is somewhat orthogonal to handling the name lookup rules in Sema.

Ted

The "TypeDecl" would be 'struct x'.

The DeclGroup would contain two TypedefDecls, one for the typedef of foo and the other for the typedef of bar.

Presumably this means that “typedef struct x foo, bar” would be a

DeclGroup with no “TypeDecl” but with two typedef decls?

The “TypeDecl” would be ‘struct x’.

The DeclGroup would contain two TypedefDecls, one for the typedef of
foo and the other for the typedef of bar.

I think this requires some elaboration: The representation here depends on how we
are handling multiple RecordDecls. If the ‘struct x’ in the typedef generates a new
RecordDecl for a forward reference then it would own this Decl, and the DeclGroup
would point to it. If the ‘struct x’ in the typedef does not generate a new RecordDecl
then it would be NULL.

In either situation the contained list of Decls would include two TypedefDecls.

  • Daniel

Ok, makes sense.

Ted Kremenek wrote:

Note that this completely resolves the ownership issues with anonymous
structs and similar constructs. It also means that Types no longer
own Decls (a good thing).
  
There's an ownership issue that is not mentioned:

x = sizeof( struct { int z; } ); // who owns this RecordDecl ?

This may indicate that RecordDecls are fundamentally part of Type and should be owned by a Type.

In a somewhat related note, consider this:

struct A {};
x = sizeof ( struct A ); #1
x = sizeof ( struct B {} ); #2

C++ Sema needs to know whether the Type passed to sizeof is a type reference or a definition (the type definition would be illegal) and currently it seems that there isn't a way to do that.

If we modify RecordType to own RecordDecls and be able to provide the information of type reference or type definition, these examples would be handled:

[EXAMPLE 1]

void f() {
   typedef struct x { int y; } foo;
}
struct x { int y; };
void g() {
   typedef struct x foo;
}

clang -ast-print currently pretty-prints this as:

void f() {
   typedef struct x foo;
}
...
void g() {
   typedef struct x foo;
}

Currently both f() and g() appear the same during printing because
their is no reference from either the TypedefDecl or the DeclStmt to
the RecordDecl for 'struct x'. Only the RecordType for 'struct x'
refers to the RecordDecl, which means in this case the RecordType
actually *owns* the RecordDecl. This is horrible, not only because a
Type object should not own Decls, but that a RecordDecl can sometimes
be owned by a DeclStmt or be a top-level declaration:

The semantics would be that RecordDecls are always owned by RecordTypes.
clang -ast-print would be able to print the above correctly because:
-Type of f()::foo would indicate a struct definition.
-Type of g()::foo would indicate a struct reference.

[EXAMPLE 2]

struct x { int y; }; // GOOD: The TranslationUnit owns the RecordDecl.
struct z { int y; } z; // BAD: The RecordType owns the RecordDecl.

The lack of clear ownership semantics, along with the loss of
information that it implies, makes it very difficult to always
faithfully pretty-print code that actually compiles, or compiles with
the same semantics:
  
The ownership semantics would be that both RecordDecls are owned by RecordTypes.
The first one can be a DeclGroup with no declarators.

[EXAMPLE 3]

void f() {
   struct { int x; } a, b, *c;
}

clang -ast-print:

void f() {
   struct <anonymous> a;
   struct <anonymous> b;
   struct <anonymous> *c;
}

clang -ast-dump:

void f()
(CompoundStmt 0xc070f0 <<stdin>:3:12, line:5:3>
   (DeclStmt 0xc070d0 <line:4:5>
     0xc06ff0 "struct <anonymous> a"
     0xc07050 "struct <anonymous> b"
     0xc070a0 "struct <anonymous> *c")

As before, in this case the RecordType owns the RecordDecl. Not only
that, but -ast-print pretty-prints the variable declaration in a
completely broken way.

The above would be printed correctly since the RecordType would indicate a struct definition.

Any thoughts ?

-Argiris

Hi Argiris,

This is an excellent point. There are a couple reasons why I strongly don't believe that types should ever own elements of the AST.

From an aesthetic perspective, declarations represent program syntax, while types are part of the program semantics. I don't believe our data structures representing semantics should own the data structures that represent program syntax. They can refer to each other, but there isn't an ownership relationship here. Conceptually it just doesn't make much sense, and I believe that one of the goals of the ASTs is that they should be conceptually clean as possible. This allows clients of the ASTs to understand their invariants much more easily. This was actually one of the prime motivation of the DeclGroup idea that Daniel and I put forth. Beyond cleaning up some ownership issues, DeclGroups really do capture more elements of C's actually syntax.

Another reason that having types own Decls is that it really makes things much more difficult for clients that wish to modify the AST. When is a Decl owned by a type? With the idea you propose, a DeclStmt might own a TypeDecl if it isn't a RecordDecl, but in the case of a RecordType the type would own it. There is also the problem that we now have multiple RecordDecls for a given RecordType. Does the RecordType own all of those RecordDecls? This actually makes manipulation of the AST really tricky and error prone.

ASTs cannot adequately capture such elements of the syntax then they need to be augmented. In the case of sizeof(type), it seems to me that a reasonable solution would be to have SizeOfAlignOfTypeExpr refer to a variant instead of a QualType. In the common case the variant would just refer to a QualType, and in the other case it would refer to both a QualType and a TypeDecl (in which case it owns the TypeDecl). Using either smart pointers or subclassing of SizeOfAlignOfTypeExpr, it seems that the storage of such extra information could readily be incorporated into the AST. getArgumentType() could then be made virtual (if necessary) to query the appropriate field.

The nice thing about this solution is that it isolates the ugliness of C's grammar for this particular problem into a few bits of code in the AST, rather than complicating the representation of the C type system or introducing hard to reason about ownership issues between Types and Decls. Naturally the parser would need to be augmented to reason about variants as well; i.e., in the case of sizeof(type), the logic in the parser that returns the QualType for "type" actually may return a "type" or a "decl". This seems fine to me; we're already passing variants around in the parser. Another thing that's nice about this solution is that IMHO it is conceptually clean; the fact the sizeof can refer to a type or a decl is just a consequence of C's grammar that is represented faithfully in the AST itself; clients that care about whether or not sizeof(type) actually refers to a declaration will need to be able to distinguish between these cases, and other clients that just want the type that sizeof(type) refers to won't have to care at all.

There will likely be other corner cases like sizeof(type) that we will need to handle, but these all seem to me to be issues of syntax that require enhancements to the AST themselves. Having the RecordTypes own RecordDecls really just seems like a short-term solution that muddles the conceptual cleanliness of the ASTs and introduces artificial ownership relationships between types and decls that we would only think about introducing in order to solve gross corner cases like these. It just doesn't feel like the right solution to me.

Ted

Ted Kremenek wrote:

Ted Kremenek wrote:

Note that this completely resolves the ownership issues with anonymous
structs and similar constructs. It also means that Types no longer
own Decls (a good thing).

There's an ownership issue that is not mentioned:

x = sizeof( struct { int z; } ); // who owns this RecordDecl ?

This may indicate that RecordDecls are fundamentally part of Type and should be owned by a Type.

Hi Argiris,

This is an excellent point. There are a couple reasons why I strongly don't believe that types should ever own elements of the AST.

From an aesthetic perspective, declarations represent program syntax, while types are part of the program semantics. I don't believe our data structures representing semantics should own the data structures that represent program syntax. They can refer to each other, but there isn't an ownership relationship here. Conceptually it just doesn't make much sense, and I believe that one of the goals of the ASTs is that they should be conceptually clean as possible. This allows clients of the ASTs to understand their invariants much more easily. This was actually one of the prime motivation of the DeclGroup idea that Daniel and I put forth. Beyond cleaning up some ownership issues, DeclGroups really do capture more elements of C's actually syntax.

Another reason that having types own Decls is that it really makes things much more difficult for clients that wish to modify the AST. When is a Decl owned by a type? With the idea you propose, a DeclStmt might own a TypeDecl if it isn't a RecordDecl, but in the case of a RecordType the type would own it. There is also the problem that we now have multiple RecordDecls for a given RecordType. Does the RecordType own all of those RecordDecls? This actually makes manipulation of the AST really tricky and error prone.

To be more clear, the idea is that TypeDecls are owned by Types, and only one RecordDecl would be created.
Can clients modify the AST without involving Types ? Assuming one RecordDecl in the AST is replaced with another, what about the RecordType the refers to it ?
My point is that if the DeclGroup owns the RecordDecl, the Type will still need to be updated. If the Type owns the the RecordDecl you may only need to update the Type.

You also mention that:

- Source-Fidelity: We capture much more of the source information in
  the ASTs. This is great for refactoring clients.

The way I see it, DeclGroups mostly solve ownership issues, it's not a "real" solution for refactoring clients. These clients would want all type references too, not just forward declarations and definitions.
The only way to support refactoring clients seems to be the proposal by Chris here:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-August/002614.html
which is a kind of "Type AST" thing, so complicating the type system may be inevitable. TypeDecls may be part of that "Type AST".

-Argiris

Argiris Kirtzidis wrote:

Ted Kremenek wrote:

Another reason that having types own Decls is that it really makes things much more difficult for clients that wish to modify the AST. When is a Decl owned by a type? With the idea you propose, a DeclStmt might own a TypeDecl if it isn't a RecordDecl, but in the case of a RecordType the type would own it. There is also the problem that we now have multiple RecordDecls for a given RecordType. Does the RecordType own all of those RecordDecls? This actually makes manipulation of the AST really tricky and error prone.

To be more clear, the idea is that TypeDecls are owned by Types, and only one RecordDecl would be created.
Can clients modify the AST without involving Types ? Assuming one RecordDecl in the AST is replaced with another, what about the RecordType the refers to it ?
My point is that if the DeclGroup owns the RecordDecl, the Type will still need to be updated. If the Type owns the the RecordDecl you may only need to update the Type.

Also, about modifying the AST: assuming you want to replace a RecordDecl with another one, it's easier to replace the one RecordDecl in the Type, instead of tracking down and replacing all RecordDecls (forward declarations and definition).

-Argiris

I agree with Ted that we should separate syntax thing from semantics thing in the AST.

struct s;
struct s a;
struct s { int d; } x;

These ‘struct s’ have different meanings: type declaration, type specifier, type definition.
But syntactically they are all RecordDecl.

Having Type owns Decls introduces confusion.

-Zhongxing Xu

2008/9/12 Argiris Kirtzidis <akyrtzi@gmail.com>

Zhongxing Xu wrote:

I agree with Ted that we should separate syntax thing from semantics thing in the AST.

struct s;
struct s a;
struct s { int d; } x;

These 'struct s' have different meanings: type declaration, type specifier, type definition.
But syntactically they are all RecordDecl.

Strictly syntactically speaking, and by standard terminology, here's what the above constructs are:

struct s; -> type-specifier ';'
struct s a; -> type-specifier 'a' ';'
struct s { int d; } x; -> type-specifier 'x' ';'

Isn't it more faithful to the syntax to consider "struct s { int d; }" as part of the type-specifier for 'x' ?
Here's how a client may work when it receives "struct s { int d; } x;" and wants to pretty-print it:

-I've got a DeclGroup of one VarDecl named 'x'.
-Print its type. It's type (the type-specifier part of the syntax) is RecordTypeDef. Print the RecordDecl by getting it from the RecordTypeDef.
-Print "x;"

This seems more syntactically-oriented to me, why is it more confusing ?

-Argiris

To sum up my point here, it is more syntactically accurate to consider RecordDecl a part of type-specifier.
In the syntax it's the type-specifier of a syntactic construct that tells us whether it's "struct s" or "struct s { int d; }".
If we are not getting this kind of information, the type-specifier is not represented correctly in the AST.

-Argiris

Argiris Kirtzidis wrote: