annotation token documentation

After my little fun hacking on annotation tokens, I decided to write up some documentation so that I'd remember how they work the next time around:
http://clang.llvm.org/docs/InternalsManual.html#AnnotationToken

Argiris, I've come around to being a true believer in the approach, I think it is very nifty. I have a couple of concerns about the current implementation though: first it looks like the template-id handling code is not really complete. I added some FIXME's to the documentation above, I assume that the implementation is just in progress. I'd appreciate it if someone clueful could review the doc.

Second, I think it would be interesting to consider handling the "negative" case (when an identifier is not a type) by turning the identifier into an annotation token as well, representing the variable it resolves to (which could be null if it refers to nothing). Right now, if we see an identifier in the token stream, we don't know if that means it is unresolved or whether it means that it is a variable and backtracking already analyzed but did not change it.

Argiris, do you think it would make sense to make a new annot_variable that contains a (potentially null) pointer to a variable decl? Even without backtracking, this would reduce name lookups in C: right now we try to decide if something is a type (which requires a lookup) then decide its not, so we handle it as a variable (requiring another lookup). Specifically, things like:

int x = sizeof(x);

first call Sema::isTypeName then call Sema::LookupDecl. If the parser just called a "lookup the decl for this identifier" method, it could annotate it once and avoid the second lookup.

-Chris

After my little fun hacking on annotation tokens, I decided to write
up some documentation so that I'd remember how they work the next time
around:
http://clang.llvm.org/docs/InternalsManual.html#AnnotationToken

Argiris, I've come around to being a true believer in the approach, I
think it is very nifty. I have a couple of concerns about the current
implementation though: first it looks like the template-id handling
code is not really complete.

The completely-broken template-id handling is my fault. There's a whole bunch of code that needs to get written, but it's relatively low priority (for now).

Second, I think it would be interesting to consider handling the
"negative" case (when an identifier is not a type) by turning the
identifier into an annotation token as well, representing the variable
it resolves to (which could be null if it refers to nothing). Right
now, if we see an identifier in the token stream, we don't know if
that means it is unresolved or whether it means that it is a variable
and backtracking already analyzed but did not change it.

This would be great. Right now, we have isTypeName, isCurrentClassName, and isTemplateName, all of which call LookupDecl and then return a single yes or no answer. Some code paths call these routines multiple times. We'd be better off with a "what does this identifier?" action that tells us what we're looking at... and then we can store that information in an annotation token.

One thing we have to be careful of is that a given identifier can be looked up in different ways in different contexts, so we have to know when to throw away the results of a previous lookup. For example, and identifier 'x' prior to a '::' is looked up differently that 'x' if it is followed by something else.

  - Doug

Argiris, I've come around to being a true believer in the approach, I
think it is very nifty. I have a couple of concerns about the current
implementation though: first it looks like the template-id handling
code is not really complete.

The completely-broken template-id handling is my fault. There's a whole bunch of code that needs to get written, but it's relatively low priority (for now).

Ah ok. I'm completely fine with it being broken for now, can you please review the comments I added to the internals doc and update if they are wrong?

Second, I think it would be interesting to consider handling the
"negative" case (when an identifier is not a type) by turning the
identifier into an annotation token as well, representing the variable
it resolves to (which could be null if it refers to nothing). Right
now, if we see an identifier in the token stream, we don't know if
that means it is unresolved or whether it means that it is a variable
and backtracking already analyzed but did not change it.

This would be great. Right now, we have isTypeName, isCurrentClassName, and isTemplateName, all of which call LookupDecl and then return a single yes or no answer. Some code paths call these routines multiple times. We'd be better off with a "what does this identifier?" action that tells us what we're looking at... and then we can store that information in an annotation token.

Cool, I completely agree.

One thing we have to be careful of is that a given identifier can be looked up in different ways in different contexts, so we have to know when to throw away the results of a previous lookup. For example, and identifier 'x' prior to a '::' is looked up differently that 'x' if it is followed by something else.

Is this a purely lexical thing? Can backtracking change the decision about what form a token is? The current code does a pretty good job (afaik) handling a different than a::b using lookahead for example, is this sufficient or are there other crazy cases beyond what we're already handling?

-Chris

Hi Chris & Doug,

Chris Lattner wrote:

After my little fun hacking on annotation tokens, I decided to write up some documentation so that I'd remember how they work the next time around:
http://clang.llvm.org/docs/InternalsManual.html#AnnotationToken

Thanks a lot for documenting it!

Second, I think it would be interesting to consider handling the "negative" case (when an identifier is not a type) by turning the identifier into an annotation token as well, representing the variable it resolves to (which could be null if it refers to nothing). Right now, if we see an identifier in the token stream, we don't know if that means it is unresolved or whether it means that it is a variable and backtracking already analyzed but did not change it.

Argiris, do you think it would make sense to make a new annot_variable that contains a (potentially null) pointer to a variable decl? Even without backtracking, this would reduce name lookups in C: right now we try to decide if something is a type (which requires a lookup) then decide its not, so we handle it as a variable (requiring another lookup). Specifically, things like:

int x = sizeof(x);

first call Sema::isTypeName then call Sema::LookupDecl. If the parser just called a "lookup the decl for this identifier" method, it could annotate it once and avoid the second lookup.

This would work really well in combination with the suggestion by Doug about unifying the "identifier inquiring" methods in the Actions API. Something like "LookupIdentifier" which returns an enum describing it and optionally returning a void* through an out parameter ?

Douglas Gregor wrote:

Hi Argiris,

Hi Chris & Doug,

Chris Lattner wrote:

After my little fun hacking on annotation tokens, I decided to write up some documentation so that I'd remember how they work the next time around:
http://clang.llvm.org/docs/InternalsManual.html#AnnotationToken

Thanks a lot for documenting it!

Second, I think it would be interesting to consider handling the "negative" case (when an identifier is not a type) by turning the identifier into an annotation token as well, representing the variable it resolves to (which could be null if it refers to nothing). Right now, if we see an identifier in the token stream, we don't know if that means it is unresolved or whether it means that it is a variable and backtracking already analyzed but did not change it.

Argiris, do you think it would make sense to make a new annot_variable that contains a (potentially null) pointer to a variable decl? Even without backtracking, this would reduce name lookups in C: right now we try to decide if something is a type (which requires a lookup) then decide its not, so we handle it as a variable (requiring another lookup). Specifically, things like:

int x = sizeof(x);

first call Sema::isTypeName then call Sema::LookupDecl. If the parser just called a "lookup the decl for this identifier" method, it could annotate it once and avoid the second lookup.

This would work really well in combination with the suggestion by Doug about unifying the "identifier inquiring" methods in the Actions API. Something like "LookupIdentifier" which returns an enum describing it and optionally returning a void* through an out parameter ?

Yep, that's what I had in mind. The name could be (at least) a type, an object/function, a type template, a function template, the current class name (which is also a type), or the current class template name (which is also both a type and a type template).

Douglas Gregor wrote:

One thing we have to be careful of is that a given identifier can be looked up in different ways in different contexts, so we have to know when to throw away the results of a previous lookup. For example, and identifier 'x' prior to a '::' is looked up differently that 'x' if it is followed by something else.

AFAIK an identifier followed by a '::' would always be looked up in a different context (using ActOnCXXNestedNameSpecifier) so there wouldn't be a need to change a previous annotation.

Okay.

  - Doug