Decls are not synonyms for the symbols they represent

A few weeks ago I had a conversation with Daniel about the fact that the ASTs (or other clang data structures) have no notion of the “entity” (for lack of a better word) that a declaration represents.

Here are a couple examples of what I mean:

(example 1)

extern double x;
extern double x;

Both of these are variable declarations that reference the same variable. There is no notion of the variable itself other than the declarations, which is conflated, particularly since we have multiple declarations in this case (i.e., there is no unique “entity” for the variable).

(incidentally, clang crashes on this input: http://llvm.org/bugs/show_bug.cgi?id=2760))

(example 2)

struct s;
struct s { int a; };
struct s;

Until a few weeks ago, these struct declarations were represented by a single RecordDecl with a unique RecordType. Now they are represented by three separate RecordDecls with a shared, unique RecordType.

With structures, the unique RecordType indeed can be treated as representing the “struct” itself, which seems fine since the given declarations are just type declarations. So in this case, we do have a unique “entity” in the ASTs to represent what the declarations refer to. There are still some issues with this representation, but I will delay mentioning them until after the next example.

(example 3)

int f();
int f();
int f() { return 0; }

int g();
int g() { return 1; }

For this example, we have separate FunctionDecls for each one of these declarations. In this example, all of the declarations both ‘f’ and ‘g’ share the same type (note that this is different from the case with structs). For the case of ‘f’, all of its FunctionDecls are chained together, and the same goes for ‘g’. There is, however, no notion of an entity or concept in the ASTs or other clang data structures that represent ‘f’ itself.

Here is an example of why not having an explicit concept for ‘f’, ‘g’, or any symbol is problematic.

Consider:

extern int h(int* x) attribute((nonnull));
extern int h(int *x);

extern int h(int* x) attribute((noreturn));

This code is completely valid. In the ASTs we create three FunctionDecls, the first having the attribute “nonnull” attached to it (and object of type NonNullAttr) and the third having the attribute “noreturn” attached to it (an object of type NoReturnAttr).

Suppose I had a client (e.g., code generation, static analysis) that wanted to know all the attributes attached to a given function. How would I go about doing this? Given one of these FunctionDecls, I would have to iterate the chain of FunctionDecls and query each one of its attributes. This seems a little cumbersome, and causes separate clients to have to implement their own logic for querying information about “symbols” in a translation unit. It also causes clients to think about internal representations such as the fact that FunctionDecls are chained, something we may wish to change at any moment in the future.

This email isn’t really a proposal of a solution; I’m just raising an issue to see if anyone has any comments. After the last few weeks I’ve been excited by our discussions of DeclGroups and TypeSpecifiers that will solve many of the remaining issues with faithfully representing syntax in the ASTs. At the same time, I think we need to pay a little more attention to the semantics, and providing infrastructure that would be useful for many clients.

Indeed, some of our changes to improve our capturing of syntax have actually weakened some of our clients reasoning about semantics. For example, by splitting separate struct declarations into multiple RecordDecls we actually (initally) broke CodeGen because the CodeGen library assumed that there was a direct 1-1 mapping between a RecordDecl and the concept it represented. That particular case was easily resolved by using the RecordType instead of the RecordDecl to represent the ‘struct’, but I’d be willing to wager that there are other issues that haven’t surfaced yet because RecordTypes are being used in this way (by all clients).

Thoughts?

Ted,

I couldn’t get clang to crash on the code you referenced below…

[steve-naroffs-imac-2:llvm/tools/clang] snaroff% cat xx.c
void f (void)
{
extern double *p;
extern double *p;
}

[steve-naroffs-imac-2:llvm/tools/clang] snaroff% …/…/Debug/bin/clang xx.c
xx.c:4:18: error: redefinition of ‘p’
extern double *p;
^
xx.c:3:18: error: previous definition is here
extern double *p;
^
2 diagnostics generated.

Do I need to run clang with any special switches?

snaroff

Hi Steve,

Sorry for the confusion. I think I meant "incorrect rejects" instead of "crashes." gcc eats that input just fine.

Ted

Hi Ted,

Ted Kremenek wrote:

(example 2)

  struct s;
  struct s { int a; };
  struct s;

Until a few weeks ago, these struct declarations were represented by a single RecordDecl with a unique RecordType. Now they are represented by three separate RecordDecls with a shared, unique RecordType.

What do you think about the idea of going back to single RecordDecls and using DeclGroups to represent the above example ?

struct s; -> typespecifier ';' -> DeclGroup with TypeSpecifier referencing 's' and empty declaration list
struct s { int a; }; -> typespecifier ';' -> DeclGroup with TypeSpecifier defining 's' and empty declaration list

And only one RecordDecl will be created for 's'.

Or maybe make it more explicit by introducing a TypeSpecDecl for these constructs.

(example 3)

int f();
int f() { return 0; }

int g();
int g() { return 1; }

For this example, we have separate FunctionDecls for each one of these declarations. In this example, all of the declarations both 'f' and 'g' share the same type (note that this is different from the case with structs). For the case of 'f', all of its FunctionDecls are chained together, and the same goes for 'g'. There is, however, no notion of an entity or concept in the ASTs or other clang data structures that represent 'f' itself.

Here is an example of why not having an explicit concept for 'f', 'g', or any symbol is problematic.

Consider:

  extern int h(int* x) __attribute__((nonnull)); extern int h(int *x);
  extern int h(int* x) __attribute__((noreturn));

This code is completely valid. In the ASTs we create three FunctionDecls, the first having the attribute "nonnull" attached to it (and object of type NonNullAttr) and the third having the attribute "noreturn" attached to it (an object of type NoReturnAttr).

Suppose I had a client (e.g., code generation, static analysis) that wanted to know all the attributes attached to a given function. How would I go about doing this?

If I'm not mistaken, the latest declaration is supposed to "merge" the information from all previous declarations through Sema::MergeFunctionDecl, and that includes attributes.
The client would only need to query information from the latest declaration.

FYI, there was a long-winded discussion on what is a good way to combine the information from all FunctionDecls and use the same FunctionDecl pointer throughout the AST, here:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-April/001479.html
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-May/001608.html

-Argiris

Argiris Kirtzidis wrote:

What do you think about the idea of going back to single RecordDecls and using DeclGroups to represent the above example ?

struct s; -> typespecifier ';' -> DeclGroup with TypeSpecifier referencing 's' and empty declaration list
struct s { int a; }; -> typespecifier ';' -> DeclGroup with TypeSpecifier defining 's' and empty declaration list

And only one RecordDecl will be created for 's'.

Or maybe make it more explicit by introducing a TypeSpecDecl for these constructs.

Also, by having only TypeSpecifiers own RecordDecls, the ownership model for RecordDecls is more consistent that an ownership model where sometimes a TypeSpecifier owns a RecordDecl and other times it's the translation unit.

-Argiris

Hi Argiris,

My understanding was that with the combination of TypeSpecifiers and DeclGroups, TranslationUnit would no longer have any direct reference to a RecordDecl (and thus never own it). Isn't this true?

Ted

Hi Argiris,

I think what to do with RecordDecls (whether we merge them or not) will become more clear once we start implementing DeclGroups and TypeSpecifiers. My concern about collapsing RecordDecls as we did before is the loss of information; once more of the TypeSpecifier logic is in place, it should become more clear what we need to keep and what to throw away.

The issue I am also trying to raise is whether or not RecordDecl really *should* be used as the key identifier for the struct. This is regardless of whether or not it is plausible. To me the RecordDecl is a syntactic construct, where what I would like to represent is something else. Of course for efficiency we collapse concepts all the time, but I think it would be preferable to try to separate out some of these different concepts into different entities in the clang ASTs/data structures.

Because a RecordDecl is a type declaration, it seems to me that we already have something in Clang that represents the "entity" that a RecordDecl refers to; in this case a RecordType. My concern is for things like variables, functions, etc., which have multiple declarations but no one thing representing the given entity (I'm sorry to be fuzzy here).

As for reusing Decls for this purpose (i.e., by merging similar Decls together), as an analogy, in our recent discussion of TypeSpecifiers you originally proposed we used Type objects for that purpose, and we converged on the idea that TypeSpecifiers really were something wholly different (something we really do need). In this case we needed a new data structure to capture more the of program syntax in the ASTs. In this case, I'm ruminating whether or not we need something else to represent an aspect of the program semantics that we aren't directly representing.

Ted

Consider:

extern int h(int* x) attribute((nonnull)); extern int h(int *x);

extern int h(int* x) attribute((noreturn));

This code is completely valid. In the ASTs we create three FunctionDecls, the first having the attribute “nonnull” attached to it (and object of type NonNullAttr) and the third having the attribute “noreturn” attached to it (an object of type NoReturnAttr).

Suppose I had a client (e.g., code generation, static analysis) that wanted to know all the attributes attached to a given function. How would I go about doing this?

If I’m not mistaken, the latest declaration is supposed to “merge” the information from all previous declarations through Sema::MergeFunctionDecl, and that includes attributes.

To my chagrin I was not aware of this. Thanks for pointing it out!

The problem I have with this approach is that it relies on the ASTs being constructed in a very special way. Sema of course has to do the right thing, but what about other entities that create or modify ASTs? Put another way, it is not clear to me, from looking at the ASTs, that the invariant you mentioned is in place. It’s just something that is silently there because of the way Sema does things.

The client would only need to query information from the latest declaration.

FYI, there was a long-winded discussion on what is a good way to combine the information from all FunctionDecls and use the same FunctionDecl pointer throughout the AST, here:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-April/001479.html
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-May/001608.html

My own feeling, and from reviewing the threads you mentioned and from looking over the recent discussions on this list, that reusing Decls (either by merging them, changing SourceLocations, etc.) just doesn’t seem right. Gradually mutating the Decl objects as we construct the rest of the AST for a translation unit just seems messy, and makes modifying or creating AST nodes error prone. Reusing Decls to me makes it inherently harder to capture the way a user wrote the code. I’m willing to put this part of the argument, however, until we have actually gone and implemented some of TypeSpecifier and DeclGroup to see what we are left with.

To me declarations represent syntax, and should reference the thing they are declaring. RecordDecls already do this with RecordTypes (since there is a many-to-one mapping here), but we don’t have a similar concept for variables and functions.

Ted Kremenek wrote:

My understanding was that with the combination of TypeSpecifiers and DeclGroups, TranslationUnit would no longer have any direct reference to a RecordDecl (and thus never own it). Isn't this true?

Ah, never mind, I was under the impression that it was going to be something like this:

int x; -> DeclGroup
struct y; -> RecordDecl

but it will actually be like this ?:

int x; -> DeclGroup
struct y; -> DeclGroup

is this correct?

I think so. In this case, the common-case optimization would be to have a DeclGroup just refer to a single Decl, but the idea is at the top-level (the translation unit) you have a set of DeclGroups, each which have a collection of Decls. This way clients are always forced to think about either groups of decls or individual decls in the proper contexts.

I think this is another useful concept separation as in TypeSpecifier and Type.

A specific client that can benefit from this separation is the static analysis tool. We can have an entity that represents the variable instead of using one of the (perhaps redundant) declarations.

2008/9/17 Ted Kremenek <kremenek@apple.com>

Ted,

I couldn't get clang to crash on the code you referenced below...

[steve-naroffs-imac-2:llvm/tools/clang] snaroff% cat xx.c
void f (void)
{
extern double *p;
}

[steve-naroffs-imac-2:llvm/tools/clang] snaroff% ../../Debug/bin/clang xx.c
xx.c:4:18: error: redefinition of 'p'
extern double *p;
                ^
xx.c:3:18: error: previous definition is here
extern double *p;
                ^
2 diagnostics generated.

Do I need to run clang with any special switches?

snaroff

Hi Steve,

Sorry for the confusion. I think I meant "incorrect rejects" instead of "crashes." gcc eats that input just fine.

Ted,

I just fixed 2760 – Clang rejects valid redeclaration. This was a trivial bug in MergeVarDecl().

"entity" for the variable (which was the context of your original email).

snaroff

Comments below…

A few weeks ago I had a conversation with Daniel about the fact that the ASTs (or other clang data structures) have no notion of the “entity” (for lack of a better word) that a declaration represents.

Here are a couple examples of what I mean:

(example 1)

extern double x;
extern double x;

Both of these are variable declarations that reference the same variable. There is no notion of the variable itself other than the declarations, which is conflated, particularly since we have multiple declarations in this case (i.e., there is no unique “entity” for the variable).

(incidentally, clang crashes on this input: http://llvm.org/bugs/show_bug.cgi?id=2760))

Fixed.

(example 2)

struct s;
struct s { int a; };
struct s;

Until a few weeks ago, these struct declarations were represented by a single RecordDecl with a unique RecordType. Now they are represented by three separate RecordDecls with a shared, unique RecordType.

With structures, the unique RecordType indeed can be treated as representing the “struct” itself, which seems fine since the given declarations are just type declarations. So in this case, we do have a unique “entity” in the ASTs to represent what the declarations refer to. There are still some issues with this representation, but I will delay mentioning them until after the next example.

I think your change was a nice improvement:-)

(example 3)

int f();
int f();
int f() { return 0; }

int g();
int g() { return 1; }

For this example, we have separate FunctionDecls for each one of these declarations. In this example, all of the declarations both ‘f’ and ‘g’ share the same type (note that this is different from the case with structs). For the case of ‘f’, all of its FunctionDecls are chained together, and the same goes for ‘g’. There is, however, no notion of an entity or concept in the ASTs or other clang data structures that represent ‘f’ itself.

I don’t really understand what you mean by “f itself”. In the example above, we have two identical function declarations and one function (for “f”). This accurately reflects the source code (which is our goal). I could imagine higher level convenience functions that might be useful for some clients, however I think the AST is fundamentally correct in this instance.

Here is an example of why not having an explicit concept for ‘f’, ‘g’, or any symbol is problematic.

Consider:

extern int h(int* x) attribute((nonnull));
extern int h(int *x);

extern int h(int* x) attribute((noreturn));

This code is completely valid. In the ASTs we create three FunctionDecls, the first having the attribute “nonnull” attached to it (and object of type NonNullAttr) and the third having the attribute “noreturn” attached to it (an object of type NoReturnAttr).

Suppose I had a client (e.g., code generation, static analysis) that wanted to know all the attributes attached to a given function. How would I go about doing this? Given one of these FunctionDecls, I would have to iterate the chain of FunctionDecls and query each one of its attributes. This seems a little cumbersome, and causes separate clients to have to implement their own logic for querying information about “symbols” in a translation unit. It also causes clients to think about internal representations such as the fact that FunctionDecls are chained, something we may wish to change at any moment in the future.

As far as the AST’s go, I really don’t see the hardship here. The fact that the FunctionDecls are chained accurately reflects the source code…doesn’t it? For me, the problem with the chain is memory efficiency (more than convenience). In C, it is fairly uncommon to have more than one function decl for the same name (yet every FunctionDecl has a chain!). Nevertheless, we already have some bloat in FunctionDecl…every prototype has a Body slot (ouch). Clearly room for improvement here.

A related issue which I consider more problematic is the lack of any “chain” for VarDecls. Consider the following code:

int i4;
int i4;
extern int i4;

const int a [1] = {1};
extern const int a;

extern const int b;
const int b [1] = {1};

At the moment, there is no way to get to the previous declaration! Since I’ve already whined about the memory inefficiency for FunctionDecl’s, I certainly wouldn’t recommend adding a chain for all VarDecls!

I remember writing Sema::CheckForFileScopedRefefinitions() where I had to deal with this. Fortunately, Sema’s “IdResolver” came to the rescue (thanks Argiris:-). That said, my gut says it might be worth using an IdResolver-like mechanism to solve this “navigation problem” for both VarDecls and FunctionDecls. Architecturally, it would make sense for this new API to be part of ASTContext.

Thoughts?

snaroff

steve naroff wrote:

(example 3)

int f();
int f() { return 0; }

int g();
int g() { return 1; }

For this example, we have separate FunctionDecls for each one of these declarations. In this example, all of the declarations both 'f' and 'g' share the same type (note that this is different from the case with structs). For the case of 'f', all of its FunctionDecls are chained together, and the same goes for 'g'. There is, however, no notion of an entity or concept in the ASTs or other clang data structures that represent 'f' itself.

I don't really understand what you mean by "f itself". In the example above, we have two identical function declarations and one function (for "f"). This accurately reflects the source code (which is our goal). I could imagine higher level convenience functions that might be useful for some clients, however I think the AST is fundamentally correct in this instance.

This is what I think too, it's better if the AST is a simple as possible, more higher level semantic information could be build on top of the AST, instead of getting embedded in.
For example, there could be a framework like the analysis framework, that gets "fed" ASTs from multiple translation units and can reason about declarations as a whole.
A client could query it to determine that the "f" function is declared 'here' in this file and defined 'there' in that file.

-Argiris

steve naroff wrote:

(example 3)

int f();
int f() { return 0; }

int g();
int g() { return 1; }

For this example, we have separate FunctionDecls for each one of these declarations. In this example, all of the declarations both 'f' and 'g' share the same type (note that this is different from the case with structs). For the case of 'f', all of its FunctionDecls are chained together, and the same goes for 'g'. There is, however, no notion of an entity or concept in the ASTs or other clang data structures that represent 'f' itself.

I don't really understand what you mean by "f itself". In the example above, we have two identical function declarations and one function (for "f"). This accurately reflects the source code (which is our goal). I could imagine higher level convenience functions that might be useful for some clients, however I think the AST is fundamentally correct in this instance.

This is what I think too, it's better if the AST is a simple as possible, more higher level semantic information could be build on top of the AST, instead of getting embedded in.
For example, there could be a framework like the analysis framework, that gets "fed" ASTs from multiple translation units and can reason about declarations as a whole.
A client could query it to determine that the "f" function is declared 'here' in this file and defined 'there' in that file.

Another simple/relevant example is a browser index/database (for an IDE, say)...where the notion of "project scope" comes into play. We need to resist the natural tendency to fold all associations/data into the AST. I'm not suggesting that's what Ted was implying...just re-making a point.

snaroff

As far as the AST’s go, I really don’t see the hardship here. The fact that the FunctionDecls are chained accurately reflects the source code…doesn’t it? For me, the problem with the chain is memory efficiency (more than convenience). In C, it is fairly uncommon to have more than one function decl for the same name (yet every FunctionDecl has a chain!). Nevertheless, we already have some bloat in FunctionDecl…every prototype has a Body slot (ouch). Clearly room for improvement here.

I’m actually not arguing about adding anything to the AST. I agree with you and Argiris that we should keep the ASTs to faithfully representing the syntax. My issue has more to resolving which declarations refer to the same entity. Your IdResolver idea (below) is more or less what I was thinking of.

A related issue which I consider more problematic is the lack of any “chain” for VarDecls. Consider the following code:

int i4;
int i4;
extern int i4;

const int a [1] = {1};
extern const int a;

extern const int b;
const int b [1] = {1};

At the moment, there is no way to get to the previous declaration! Since I’ve already whined about the memory inefficiency for FunctionDecl’s, I certainly wouldn’t recommend adding a chain for all VarDecls!

Agreed.

I remember writing Sema::CheckForFileScopedRefefinitions() where I had to deal with this. Fortunately, Sema’s “IdResolver” came to the rescue (thanks Argiris:-). That said, my gut says it might be worth using an IdResolver-like mechanism to solve this “navigation problem” for both VarDecls and FunctionDecls. Architecturally, it would make sense for this new API to be part of ASTContext.

Sounds great to me. What I would like to see is a nice way for clients of the ASTs to go from syntactic constructs (e.g. declarations) to the higher-level concepts. I don’t want to add bloat to the AST; the problem is that some of this information just simply cannot be recovered once we exit Sema. I think having the notion of an IdResolver in ASTContext seems like a very reasonable idea; not certain about the performance characteristics.

This is what I think too, it's better if the AST is a simple as possible, more higher level semantic information could be build on top of the AST, instead of getting embedded in.

Agreed. I was thinking of something more of the lines of what Steve mentioned in his last email. Just a clean way to go from Decls -> unique identifier.

For example, there could be a framework like the analysis framework, that gets "fed" ASTs from multiple translation units and can reason about declarations as a whole.

I've been thinking about doing this for a while, and it would be something needed for inter-procedural static analysis. However, to me, this is a different concept that would be layered on top of whatever mechanism is in place to resolve identifiers/declarations *within* a translation unit.

The problem that I'm concerned with right now is that there are cases where two DeclRefExprs can refer to different VarDecls that correspond to the same variable, but a client of the AST has no way of knowing that that is the case. Consider:

Ted Kremenek wrote:

The problem that I'm concerned with right now is that there are cases where two DeclRefExprs can refer to different VarDecls that correspond to the same variable, but a client of the AST has no way of knowing that that is the case. Consider:

----

extern int x;

void foo() {
  x++;
}

extern int x;

void bar() {
  x++;
}

void baz() {
   extern int x;
   x++;
}

-----

If you do an AST dump of this, you will see that the DeclRefExprs for 'x' in foo(), bar(), and baz() respectively refer to separate VarDecls. Right now a client of the ASTs has no way of knowing that these VarDecls refer to the same variable. Certainly a higher-level API that resolves identifiers across translation units could do this, but it might have to do a lot of work to resolve situations like the one above. By having a mechanism to resolve identifiers within a translation unit, this higher-level API would much easier to implement.

An "IdResolver" used beyond Sema would be very useful, but it may come with a memory cost.
Could we get away with having all DeclRefExprs for 'x' refer to the same VarDecl ?

I guess we could, however I'm still not sure it's the right thing to do (since it goes against our tenet of keeping the AST simple and reflective of the source). From the compiler's perspective, the 3 DeclRefExprs *should* bind to 3 different VarDecls.

For this example, I'd like to know why/when we care if all these VarDecls refer to the same variable? Since each VarDecl is "extern", it's the linkers job to bind the actual variable definition (which isn't in this particular translation unit).

The "meta issue" here is how we preserve aspects of Sema's knowledge without bloating/contorting the AST's. Like everything else we've been doing with clang, it needs to be driven by actual clients. As clang clients evolve, we need to consider API's that enable our AST's to be used conveniently. Asking clients to re-implement complex Sema merging/checking/etc. isn't practical. Note that ImplicitCastExpr was added to make the code generators life more convenient - without it, the code generator would have to be in the type conversion business (which Chris/I wanted to avoid, given C's arcane rules). So, changes to the AST's are certainly not out of the question...we just need to be clear on the cost/benefit.

snaroff

I guess we could, however I'm still not sure it's the right thing to do (since it goes against our tenet of keeping the AST simple and reflective of the source). From the compiler's perspective, the 3 DeclRefExprs *should* bind to 3 different VarDecls.

I agree with Steve here. The three separate declarations really are separate declarations that should be represented explicitly in the AST.

For this example, I'd like to know why/when we care if all these VarDecls refer to the same variable? Since each VarDecl is "extern", it's the linkers job to bind the actual
variable definition (which isn't in this particular translation unit).

This is somewhat of a bad example; it was mainly to illustrate a point. However, there are cases for this example where clients do care that they refer to the same variable. For example, the static analyzer, when analyzing these functions, may wish to know that these three declarations represent the same global variable. Where that variable is actually instantiated is irrelevant; the linker just takes care of making sure their is storage slotted for that variable in some translation unit. In other words,
we know that there is a global variable 'x' whose storage lies in some other translation unit; if the program were to link and run, it will be an actual variable in memory, so it doesn't really matter where it was defined.

The "meta issue" here is how we preserve aspects of Sema's knowledge without bloating/contorting the AST's. Like everything else we've been doing with clang, it needs to be driven by actual clients. As clang clients evolve, we need to consider API's that enable our AST's to be used conveniently. Asking clients to re-implement complex Sema merging/checking/etc. isn't practical. Note that ImplicitCastExpr was added to make the code generators life more convenient - without it, the code generator would have to be in the type conversion business (which Chris/I wanted to avoid, given C's arcane rules). So, changes to the AST's are certainly not out of the question...we just need to be clear on the cost/benefit.

My thoughts exactly. It would be great if some of this information could be stored in external data structures; clients who wish to keep that information could do so, and otherwise it could get thrown away once Sema is done with it.