Should we build semantically invalid nodes?

Doug Gregor wrote:

Hmm... in the case of an error when type-checking the casts, we should
not actually build a CXXCastExpr node. Rather, Sema::ActOnCXXCasts
should return "true" so that the parser knows that there was an error.

I suggest that both CheckConstCast and CheckReinterpretCast return a
bool that is true when there is an error and false otherwise (this is
the scheme used elsewhere in Clang). Then, ActOnCXXCasts will never
build an ill-formed node in the AST (which is a good thing).

Is there any harm in building semantically invalid nodes ?

Yes, there is. It means that consumers of these nodes have to be able
to cope with potentially bad inputs. Then, each semantic-checking
routine needs to cope with (1) all correct inputs, (2) all incorrect
inputs that could come from a program that was well-formed up until
this point, and (3) all broken inputs that come from previously
ill-formed code. The standard always tells us what (1) and (2) can be
(sometimes directly, sometimes through exclusion), but (3) is an
almost unbounded set of bogus input depending on the twisted
imagination of our users and our ability to mangle bad inputs into
other bad inputs.

It will not affect clients that care about semantics since they will not
analyse the AST if there are errors (there could be invalid decls).
The clients that only care about the textual representation of the program,
and not about the semantics, will get seriously impacted by not building the
nodes, e.g. a refactoring client will not be able to pickup the use of a
variable because the expression that references it is not added to the AST.

If a client doesn't care about semantics, it can either (1) let Sema
do its work and then ignore the resulting types, or (2) provide a
different Action that doesn't do the type-checking.

To mirror the decls marked as "invalid" we could have exprs marked as
"invalid" too.

GCC does this, where essentially any node in the AST can be
"error_mark_node" if there was an error. A *lot* of code in GCC is
dedicated to checking for and propagating error_mark_node; it probably
ends up costing them performance, but the real cost is that they end
up fixing a lot of tiny "ice-on-invalid" bugs where someone forgot to
check for some outlandish ill-formed code that results in a weird AST
node or an error_mark_node where it isn't expected. It's always seemed
like a losing battle to me, and the code is littered with
error_mark_node checks.

  - Doug

Doug Gregor wrote:

Yes, there is. It means that consumers of these nodes have to be able
to cope with potentially bad inputs. Then, each semantic-checking
routine needs to cope with (1) all correct inputs, (2) all incorrect
inputs that could come from a program that was well-formed up until
this point, and (3) all broken inputs that come from previously
ill-formed code. The standard always tells us what (1) and (2) can be
(sometimes directly, sometimes through exclusion), but (3) is an
almost unbounded set of bogus input depending on the twisted
imagination of our users and our ability to mangle bad inputs into
other bad inputs.
  
A semantically invalid expression will produce a diagnostic. At this point the program is ill-formed, the worst that can happen is that semantic checks against this expression will produce more diagnostics.
This is exactly similar to the effect of invalid decls.

It will not affect clients that care about semantics since they will not
analyse the AST if there are errors (there could be invalid decls).
The clients that only care about the textual representation of the program,
and not about the semantics, will get seriously impacted by not building the
nodes, e.g. a refactoring client will not be able to pickup the use of a
variable because the expression that references it is not added to the AST.
    
If a client doesn't care about semantics, it can either (1) let Sema
do its work and then ignore the resulting types, or (2) provide a
different Action that doesn't do the type-checking.
  
Sema is *huge* and the alternative Action option is not realistic (another Action that deals with templates ? ;), this is what almost all of clients will use.
There will be clients that care only about the syntax tree, Sema is fully capable of servicing them too.

To mirror the decls marked as "invalid" we could have exprs marked as
"invalid" too.
    
GCC does this, where essentially any node in the AST can be
"error_mark_node" if there was an error. A *lot* of code in GCC is
dedicated to checking for and propagating error_mark_node; it probably
ends up costing them performance, but the real cost is that they end
up fixing a lot of tiny "ice-on-invalid" bugs where someone forgot to
check for some outlandish ill-formed code that results in a weird AST
node or an error_mark_node where it isn't expected. It's always seemed
like a losing battle to me, and the code is littered with
error_mark_node checks.
  
I think the correct Clang analogy here is DeclResult/ExprResult which gets propagated around and must be checked before use.

I don't see what is so bad about separating syntax from semantics.
-An expression node is produced for a syntactic construct.
-This expression node already conveys useful information about the program structure.
-Semantic checks are done to it and diagnostics are emitted.
-Now why should we discard the syntactic information ? This is a concrete expression with an actual type (even if it got it's type illegally according to the language rules), so it won't lead to crashes, just to possible more diagnostics.

AFAIK, Clang produces Decls even for semantically illegal declarations (they are even added to scope) and there doesn't seem to be issues attributed to them.

-Argiris

From a purely practical standpoint, and possibly a minor issue, disallowing the construction of AST elements for invalid code means that for EXTENSION-diagnostics I also have to check for -pedantic-errors myself and prevent AST construction in this case.

I don't know how widespread use of those is. There's one in CheckReinterpretCast.

Sebastian

Sebastian Redl wrote:

From a purely practical standpoint, and possibly a minor issue, disallowing the construction of AST elements for invalid code means that for EXTENSION-diagnostics I also have to check for -pedantic-errors myself and prevent AST construction in this case.

I don't know how widespread use of those is. There's one in CheckReinterpretCast.

This is an excellent example for that the syntax tree and the semantic checks are somewhat orthogonal. The fact that a syntactic construct is semantically valid or invalid based on a compiler flag, does not affect the ability of Sema to handle it later on.

-Argiris

Doug Gregor wrote:

Yes, there is. It means that consumers of these nodes have to be able
to cope with potentially bad inputs. Then, each semantic-checking
routine needs to cope with (1) all correct inputs, (2) all incorrect
inputs that could come from a program that was well-formed up until
this point, and (3) all broken inputs that come from previously
ill-formed code. The standard always tells us what (1) and (2) can be
(sometimes directly, sometimes through exclusion), but (3) is an
almost unbounded set of bogus input depending on the twisted
imagination of our users and our ability to mangle bad inputs into
other bad inputs.

A semantically invalid expression will produce a diagnostic. At this point
the program is ill-formed, the worst that can happen is that semantic checks
against this expression will produce more diagnostics.
This is exactly similar to the effect of invalid decls.

No, the worst that can happen is that the compiler will crash, because
the semantic analysis has been given something that's completely
broken that it never should have to deal with.

Note how we handle invalid declarations: we mark them as invalid, and
then if we try to reference them in an expression, ActOnIdentifierExpr
refuses to build a DeclRefExpr. That keeps those invalid declarations
for causing malformed expression nodes from being constructed.

If a client doesn't care about semantics, it can either (1) let Sema
do its work and then ignore the resulting types, or (2) provide a
different Action that doesn't do the type-checking.

Sema is *huge* and the alternative Action option is not realistic (another
Action that deals with templates ? ;), this is what almost all of clients
will use.
There will be clients that care only about the syntax tree, Sema is fully
capable of servicing them too.

Okay, that's my option (1), then. We don't really want to cater to
clients trying to work on ill-formed code, do we?

GCC does this, where essentially any node in the AST can be
"error_mark_node" if there was an error. A *lot* of code in GCC is
dedicated to checking for and propagating error_mark_node; it probably
ends up costing them performance, but the real cost is that they end
up fixing a lot of tiny "ice-on-invalid" bugs where someone forgot to
check for some outlandish ill-formed code that results in a weird AST
node or an error_mark_node where it isn't expected. It's always seemed
like a losing battle to me, and the code is littered with
error_mark_node checks.

I think the correct Clang analogy here is DeclResult/ExprResult which gets
propagated around and must be checked before use.

That's not the right analogy. DeclResult/ExprResult are only used to
communicate between the semantic analysis and the parser, and they
never get into the AST. If we have the notion of "invalid" expressions
in our ASTs, such that a Sema function can receive an invalid input,
then we'll be in the same boat that GCC developers are with
error_mark_node. I've been there, and it's a very bad place to be.

I don't see what is so bad about separating syntax from semantics.
-An expression node is produced for a syntactic construct.
-This expression node already conveys useful information about the program
structure.
-Semantic checks are done to it and diagnostics are emitted.
-Now why should we discard the syntactic information ? This is a concrete
expression with an actual type (even if it got it's type illegally according
to the language rules), so it won't lead to crashes, just to possible more
diagnostics.

Oh, it will lead to crashes, because it opens up the semantic analysis
to inputs that can never make sense. An expression of function type
that isn't a reference to a function declaration? Nonsense, but it
could happen if we allow reinterpret_cast<int(void)>(blah) to get its
own AST node (even if it is marked as invalid). We can pretend that
we're good enough to build a compiler that's robust against all of
these bogus inputs, but it's just not going to happen. It's far better
to only build semantically well-formed AST nodes.

AFAIK, Clang produces Decls even for semantically illegal declarations (they
are even added to scope) and there doesn't seem to be issues attributed to
them.

This is necessary for some level of error recovery, but it's not
something we should propagate through the compiler.

  - Doug
  - Doug

If it's an extension, then naturally all of the other syntactic and
semantic routines in the compiler have to deal with this kind of tree.
More importantly, if it's an extension that we support then the
resulting AST node is, by definition, semantically correct. We might
complain about the use of the extension, but it's not something that's
outside what Clang can handle.

  - Doug

Doug Gregor wrote:

Sema is *huge* and the alternative Action option is not realistic (another
Action that deals with templates ? ;), this is what almost all of clients
will use.
There will be clients that care only about the syntax tree, Sema is fully
capable of servicing them too.
    
Okay, that's my option (1), then. We don't really want to cater to
clients trying to work on ill-formed code, do we?
  
I think you underestimate the importance of being able to get a syntax tree as complete as possible even on ill-formed code. It allows Clang to be used effectively for a variety of purposes that we may not even currently imagine.
e.g. IDEs have to assist the programmer as they write their code, so in the vast majority of the time they have to work on ill-formed code.

I don't see what is so bad about separating syntax from semantics.
-An expression node is produced for a syntactic construct.
-This expression node already conveys useful information about the program
structure.
-Semantic checks are done to it and diagnostics are emitted.
-Now why should we discard the syntactic information ? This is a concrete
expression with an actual type (even if it got it's type illegally according
to the language rules), so it won't lead to crashes, just to possible more
diagnostics.
    
Oh, it will lead to crashes, because it opens up the semantic analysis
to inputs that can never make sense. An expression of function type
that isn't a reference to a function declaration? Nonsense, but it
could happen if we allow reinterpret_cast<int(void)>(blah) to get its
own AST node (even if it is marked as invalid). We can pretend that
we're good enough to build a compiler that's robust against all of
these bogus inputs, but it's just not going to happen. It's far better
to only build semantically well-formed AST nodes.
  
Hmm, we don't have any concrete examples, but I'm not sure that it's so bad. The vast majority of semantic checks are using the types of the expressions.
And to clarify, I definitely am not advocating that Sema starts to check an "invalid expr flag"; if the expression is so bad that it cannot be handled normally, it should indeed be disallowed from the start.
(I mentioned the "invalid expr flag" from the start as a nice to have flag for interested consumers, not for Sema use).

There is certainly some middle ground to tread here, not all semantic checks will wreak havoc if they allow the expression; I think it's a worthwhile goal to strive to have an AST as complete as possible, it will be better in the long run for interesting uses of Clang beyond compiling.

-Argiris

Doug Gregor wrote:

Sema is *huge* and the alternative Action option is not realistic
(another
Action that deals with templates ? ;), this is what almost all of clients
will use.
There will be clients that care only about the syntax tree, Sema is fully
capable of servicing them too.

Okay, that's my option (1), then. We don't really want to cater to
clients trying to work on ill-formed code, do we?

I think you underestimate the importance of being able to get a syntax tree
as complete as possible even on ill-formed code.

Yep, that's likely. I think you underestimate the amount of pain that
having invalid AST nodes will cause.

There is certainly some middle ground to tread here, not all semantic checks
will wreak havoc if they allow the expression; I think it's a worthwhile
goal to strive to have an AST as complete as possible, it will be better in
the long run for interesting uses of Clang beyond compiling.

If the AST node that is built is something that is valid for further
semantic processing (even if it resulted in diagnostics), that's fine.
So, if we're looking at Sebastian's casting patch, for example, I'm
fine with generating an AST node for a reinterpret_cast that casts
away constness, because the resulting expression is still usable for
type-checking. Anything that requires a "this is an invalid
expression" flag is, in my opinion, asking for trouble.

  - Doug

Doug Gregor wrote:

I think you underestimate the importance of being able to get a syntax tree
as complete as possible even on ill-formed code.
    
Yep, that's likely. I think you underestimate the amount of pain that
having invalid AST nodes will cause.
  
It's a sacrifice for the Greater Good(tm) :wink:

There is certainly some middle ground to tread here, not all semantic checks
will wreak havoc if they allow the expression; I think it's a worthwhile
goal to strive to have an AST as complete as possible, it will be better in
the long run for interesting uses of Clang beyond compiling.
    
If the AST node that is built is something that is valid for further
semantic processing (even if it resulted in diagnostics), that's fine.
So, if we're looking at Sebastian's casting patch, for example, I'm
fine with generating an AST node for a reinterpret_cast that casts
away constness, because the resulting expression is still usable for
type-checking. Anything that requires a "this is an invalid
expression" flag is, in my opinion, asking for trouble.
  
Ok, sounds fine!

-Argiris

There is another option as well: Sema, on detecting an error, can force the expression to be something that is not erroneous and return that instead. For example, if the user wrote a cast from an int to a function type, it would be fine for sema to report the error and "recover" by returning a cast to function pointer.

Note that this should be done only in situations where it is really "obvious" what the user intended. The risk of doing this sort of thing is that if it is *not* what the user intended, that you could get a cascade of errors complaining about something the user didn't write. This (IMO) is far more damaging to the user experience than missing a second error that could have been reported in a subexpression.

-Chris

To get more specific about Sebastian's patch, I tried it out and it works quite well, Clang is pretty robust:

typedef int fn(int,int);
int f() {
  int a;
  reinterpret_cast<fn>(a)(1,2,3);
  int (*fx)(int,int) = reinterpret_cast<fn>(a);
}

-Argiris

Argiris Kirtzidis wrote:

To get more specific about Sebastian's patch, I tried it out and it works quite well, Clang is pretty robust:

typedef int fn(int,int);
int f() {
int a;
reinterpret_cast<fn>(a)(1,2,3);
int (*fx)(int,int) = reinterpret_cast<fn>(a);
}

I mean, Clang builds a nice AST tree of everything for the above snippet.

Here's an alternative idea:
How about we pass a flag at Sema to indicate that Sema should just build the AST without semantic checks ?

The meaning of this flag is not that it guarantees that *absolutely no* semantic checks are done, just that the focus is less on semantics and more on getting a complete AST.
The Action methods could be modified gradually to check this flag, eg. on Sebastian's patch, if this flag is on, Sema::ActOnCXXCasts will skip the CheckXXXCast call and go to creating the CXXCastExpr directly.

This addresses Doug's concerns about allowing invalid nodes while also enabling clients that only care about a full AST.

Any thoughts ?

-Argiris

Hey Argiris,

Since clang supports multiple "Action" modules, why not have an ASTBuilder class?

My intuition says simply passing in a flag to Sema won't be rich enough. Over time, you would probably need several flags to get Sema to do what you actually want. It just doesn't feel right to me.

On the other hand, AST's produced by an ASTBuilder class could be segregated and managed differently. It gives you much more flexibility to meet the needs of a more "fuzzy" environment (like an IDE), including replacing some of the AST nodes (if the one's were using in Sema aren't ideal). For example, the ASTBuilder class might choose to have "real" (i.e. non-unique types) types (something we've discussed before). Having the ability to get the actual source location for type specifiers seems like a "must have" feature for non-compiler clients.

I believe my suggestion is fairly easy to implement for Stmts/Exprs. It is more difficult to implement for Decls/Types (since Sema::ActOnDeclarator() does some "heavy lifting" to transform a Declarator into a Decl/Type). If this were the only obstacle, we could refactor the code into a Declarator library that both Action modules could share. Since Declarators in C/C++ are very complex, adding a layer might actually improve the maintenance of the code.

I think your desire for more fuzzy AST's is spot on. My only concern is identifying clients for them and choosing an implementation strategy that doesn't disrupt our most central/complex Action module (i.e. Sema). We knew this day would come, which is why the Action interface doesn't force you to use a particular data structure (or any at all).

Thoughts?

snaroff

Hey Steve, breaking up Sema is a ridiculous awesome idea!

Here's a thought I'd like to throw around..
Could there be something like "composable Actions" ? The ASTBuilder would build the AST while Sema would do semantic checks and reject invalid nodes.
This will cleanly separate the semantic checks from the AST building and, as you said, will make the code more maintainable.

-Argiris

steve naroff wrote:

Argiris Kirtzidis wrote:

Hey Steve, breaking up Sema is a ridiculous awesome idea!

Here's a thought I'd like to throw around..
Could there be something like "composable Actions" ? The ASTBuilder would build the AST while Sema would do semantic checks and reject invalid nodes.
This will cleanly separate the semantic checks from the AST building and, as you said, will make the code more maintainable.
  

Unless the complexity of creating and maintaining that separation exceeds that of having the merged code.

I'm all for the separation, but we need a really good plan before we can actually do it.

Sebastian

Hey Steve, breaking up Sema is a ridiculous awesome idea!

Sounds like I struck a chord:-)

Here's a thought I'd like to throw around..
Could there be something like "composable Actions" ? The ASTBuilder would build the AST while Sema would do semantic checks and reject invalid nodes.
This will cleanly separate the semantic checks from the AST building and, as you said, will make the code more maintainable.

Composable Actions is an idea I've always wanted to experiment with (but haven't had a compelling reason to do so). If what you are thinking of sounds compelling, then it might be a good place to experiment.

At this point, it's unclear if the semantics checks can always be separated from the AST.

Are you certain the current AST's are appropriate for what you envision doing?

snaroff

Argiris Kirtzidis wrote:

Hey Steve, breaking up Sema is a ridiculous awesome idea!

Here's a thought I'd like to throw around..
Could there be something like "composable Actions" ? The ASTBuilder would build the AST while Sema would do semantic checks and reject invalid nodes.
This will cleanly separate the semantic checks from the AST building and, as you said, will make the code more maintainable.

Unless the complexity of creating and maintaining that separation exceeds that of having the merged code.

I'm all for the separation, but we need a really good plan before we can actually do it.

I totally agree. I think this would be a fairly disruptive change. We would need a fairly compelling reason to tackle it.

At this point, it's just "food for thought". It is true that Sema has grown considerably and it would be nice to benefit from some of its functionality with having to take it all.

snaroff

Argiris Kirtzidis wrote:

Hey Steve, breaking up Sema is a ridiculous awesome idea!

Here's a thought I'd like to throw around..
Could there be something like "composable Actions" ? The ASTBuilder
would build the AST while Sema would do semantic checks and reject
invalid nodes.
This will cleanly separate the semantic checks from the AST
building and, as you said, will make the code more maintainable.

Unless the complexity of creating and maintaining that separation
exceeds that of having the merged code.

I totally agree. I think this would be a fairly disruptive change. We
would need a fairly compelling reason to tackle it.

I agree.

At this point, it's just "food for thought". It is true that Sema has
grown considerably and it would be nice to benefit from some of its
functionality with having to take it all.

I think this something of a dangerous path. Semantic analysis is complex and intertwined enough (even in C, but particularly in C++) that adding abstractions should only be done really carefully.

It seems to me that it comes down to the clients that are the ultimate consumers of this information. Since Sema is perfectly fine for correct code, lets ignore all clients that require well-formed code (e.g. codegen, refactoring, etc) and those that aren't harmed by requiring it (static analysis). These clients are incidentally the ones that are doing "deep analysis" of the AST and really benefit from having a lot of invariants in the AST that absolutely must be true for sanity. Lack of these invariants would require sprinkling their (incredibly non-trivial) code with lots of special cases and hacks that I'd really like to avoid.

Another set of clients are things like "indexers" that want to find all the function definitions and global variables so you can "click on a function and jump to its definition". For this sort of use, a simple actions module plugging into the parser is just fine.

What sort of clients would benefit substantially from a broken and partially formed AST? If we really wanted this sort of thing, it seems like it would be cleanest to do what Steve said: define a new actions module that just builds an AST (which can even use the same or an extended set of nodes as Sema) but doesn't do any real checks, doesn't assign types, etc. At this point, you have more parse tree than an AST. I could imagine that something like this would be useful, but can't think of any specific clients.

To be clear, I want to separate out two notions from this. First, we don't need anything like this to get loc info for types. That is a straight-forward extension over what we have, and making sema (optionally) do it would be easy and non-invasive. Second, this whole discussion started with a discussion of error recovery. While related to the above, I still really really think that *Sema* shouldn't return invalid nodes and should only "correct" them in really obvious cases. Sema is extremely complex, and having it not be able to depend on the invariants we have on AST nodes would be very bad, as Doug pointed out.

-Chris

Great points. As you say, it comes down to the ultimate consumers of the information.

gain some experience with alternate representations. Once we have such experience, we can decide if it makes sense to do fold into Sema in some fashion.

I agree that getting loc info is a straight-forward extension to what we have. I didn't mean to suggest otherwise.

snaroff