RFC: add RecoveryExpr to represent invalid constructs in AST

In the presence of broken code, the AST clang generates is often very limited.
Often Sema diagnoses errors and doesn’t create AST nodes. For example, if the arguments to a function call are wrong, we get no CallExpr, and all subexpressions are discarded.

IDE tools like clangd see a lot of broken code, and rely on the resulting AST, so I’d like to improve this situation.

If the user writes the following erroneous code:
a(b, c).d() // error: a only takes one parameter
then no AST is emitted for this expression, and:

  1. clangd’s “go to definition” doesn’t work on a, even though there’s no overloading
  2. “go to definition” also doesn’t work on b and c or their subexpressions, even though these expressions are valid
  3. code completion doesn’t work properly for d(), even though we know what type a() returns. (go to definition doesn’t either).
    There are lots of similar cases: (foo->bar when bar doesn’t exist, etc).

My proposal is to add a fairly generic RecoveryExpr that the AST can emit in these situations. It would capture:

  • the approximate StmtClass we were trying to create (e.g. CallExpr)
  • the subexpressions - (e.g. an overloadexpr for the callee, and the arguments)
  • the type, if we can make a reasonable guess (e.g. all overloads have the same type)

I also considered other approaches, and think they don’t work so well:

  • just generate the e.g. CallExpr etc anyway, and maybe add an “invalid” flag. This means weakening many invariants in many different node types. It’s subtle and error-prone, and adds conceptual and implementation complexity to many nodes that now have to be able to represent various error cases. Even within clang it’s hard to find all the analyses that need to be updated. (http://lists.llvm.org/pipermail/cfe-dev/2019-April/062151.html)
  • extract the extra information we need from Sema, rather than the AST. e.g. Sema should invoke a callback whenever an identifier is resolved to a Decl. This solves go-to-definition pretty well, but isn’t a very flexible approach. The AST is rich, and provides context, can be dumped etc, a Sema callback would be an ad-hoc solution that is less convenient to use and understand.
  • add a hierarchy of new nodes to represent errors (RecoveryCallExpr etc). It seems like a lot of work to model all the different ways code can be invalid, and I don’t think the extra structure would be useful enough to justify it.

I have a very rough prototype of this idea: https://reviews.llvm.org/D61722 (emits RecoveryExpr for failed overload resolution).
It does create a form of recovery that affects diagnostics. But the changes exposed by tests are few and seem mostly good (especially compared to emitting CallExpr).
Some open questions:

  • what’s the type of the expression when we can’t guess? (IntTy in the patch)
  • will there be surprises if RecoveryExpr is emitted in more places and they start to interact?
  • should RecursiveASTVisitor, ASTMatchers etc skip these nodes by default, for compatibility with external code?
  • what are the important use cases that I haven’t thought about?

This would be helpful. Within Coverity, we hacked together our own error recovery mechanism (mostly involving demoting errors to warnings and discarding function bodies and variable initializers with errors), but our mechanism sometimes leads to other down stream errors (I investigated a spurious "enumerator ‘x’ does not exist in instantiation of ‘y’ just this week). The ability to explicitly indicate errors in the AST would be helpful.

Ideally, I’d like to have ErrorDecl, ErrorStmt, ErrorExpr, and ErrorType nodes. But there is no reason they would all have to be introduced at the same time.

Tom.

In the presence of broken code, the AST clang generates is often very limited.
Often Sema diagnoses errors and doesn't create AST nodes. For example, if the arguments to a function call are wrong, we get no CallExpr, and all subexpressions are discarded.
IDE tools like clangd see a lot of broken code, and rely on the resulting AST, so I'd like to improve this situation.

If the user writes the following erroneous code:
  a(b, c).d() // error: a only takes one parameter
then no AST is emitted for this expression, and:
1) clangd's "go to definition" doesn't work on a, even though there's no overloading
2) "go to definition" also doesn't work on b and c or their subexpressions, even though these expressions are valid
3) code completion doesn't work properly for d(), even though we know what type a() returns. (go to definition doesn't either).
There are lots of similar cases: (foo->bar when bar doesn't exist, etc).

My proposal is to add a fairly generic RecoveryExpr that the AST can emit in these situations. It would capture:
- the approximate StmtClass we were trying to create (e.g. CallExpr)
- the subexpressions - (e.g. an overloadexpr for the callee, and the arguments)
- the type, if we can make a reasonable guess (e.g. all overloads have the same type)

I also considered other approaches, and think they don't work so well:
- just generate the e.g. CallExpr etc anyway, and maybe add an "invalid" flag. This means weakening many invariants in many different node types. It's subtle and error-prone, and adds conceptual and implementation complexity to many nodes that now have to be able to represent various error cases. Even within clang it's hard to find all the analyses that need to be updated. ([cfe-dev] Error recovery when Sema finds only one (non-viable) candidate?)
- extract the extra information we need from Sema, rather than the AST. e.g. Sema should invoke a callback whenever an identifier is resolved to a Decl. This solves go-to-definition pretty well, but isn't a very flexible approach. The AST is rich, and provides context, can be dumped etc, a Sema callback would be an ad-hoc solution that is less convenient to use and understand.
- add a hierarchy of new nodes to represent errors (RecoveryCallExpr etc). It seems like a lot of work to model all the different ways code can be invalid, and I don't think the extra structure would be useful enough to justify it.

I have a very rough prototype of this idea: ⚙ D61722 [AST] Add RecoveryExpr; produce it on overload resolution failure and missing member. (emits RecoveryExpr for failed overload resolution).
It does create a form of recovery that affects diagnostics. But the changes exposed by tests are few and seem mostly good (especially compared to emitting CallExpr).
Some open questions:
- what's the type of the expression when we can't guess? (IntTy in the patch)
- will there be surprises if RecoveryExpr is emitted in more places and they start to interact?
- should RecursiveASTVisitor, ASTMatchers etc skip these nodes by default, for compatibility with external code?
- what are the important use cases that I haven't thought about?

I think this will be really useful functionality -- I've wished for
something similar on many occasions.

As for AST matchers, I think there should be an optional way to match
through these recovery nodes or not. I can envision different checks
wanting different behavior -- e.g., a correctness check may want to
ignore RecoveryExpr nodes whereas a refactoring check may actually
want to look through the node to attempt to correct it.

~Aaron

This would be great for tooling, thank you!

  • what’s the type of the expression when we can’t guess? (IntTy in the patch)

I think some kind of synthetic “unknown type”, not convertible to anything, would be helpful here - so it’s guaranteed foo(bar()) won’t pick an unwanted overload if bar() is a RecoveryExpr, and foo() has several overloads. It might also be somewhat useful in other places - i.e. as a placeholder for not yet deduced init-lists (it uses VoidTy now IIRC).

It would also be great if the overload set could be captured in the RecoveryExpr (or does it already? I haven’t look too carefully, sorry) - so the tools can offset navigation with several targets on a failed overloaded function call without resorting to dealing with compiler diagnostic messages - and it might be possible to do features like “parameter info” on the AST, without reparsing? Ex: foo(/* hint possible parameters here from all available overloads*/); - now it requires reparsing in code-completion mode, if I’m not mistaken.

Thanks all for the responses. I’m glad there’s interest in using such a thing. I’ll do some more work on this, and meanwhile seek more feedback on whether the design idea is likely to work. (Maybe from Richard? :-))

Tom wrote:

Ideally, I’d like to have ErrorDecl, ErrorStmt, ErrorExpr, and ErrorType nodes. But there is no reason they would all have to be introduced at the same time.
Absolutely. I think Expr is the most pressing and so a good candidate to start with, due to the high level of nesting and lack of recovery today.But there’s likely to be value in other node types too.

Aaron wrote:

As for AST matchers, I think there should be an optional way to match through these recovery nodes or not.

Agreed. As well as compatibility, the examples you mention probably want different behavior forever.
AFAIK there’s no “match options” type thing in ASTMatchers to hang this on, so that’s an interesting API question.

Dmitry wrote:

  • what’s the type of the expression when we can’t guess? (IntTy in the patch)

I think some kind of synthetic “unknown type”, not convertible to anything, would be helpful here - so it’s guaranteed foo(bar()) won’t pick an unwanted overload if bar() is a RecoveryExpr, and foo() has several overloads. It might also be somewhat useful in other places - i.e. as a placeholder for not yet deduced init-lists (it uses VoidTy now IIRC).

Yes, a new builtin type would give most control over recovery. I wonder if convertible to any type is better - that way we wouldn’t get diagnostics saying it has the wrong type, it would just be accepted (multiple candidate overloads would be flagged as ambiguous though, unless this type had special handling).

It would also be great if the overload set could be captured in the RecoveryExpr (or does it already? I haven’t look too carefully, sorry)

Yes. In the patch the the RecoveryExpr’s children are the callee and the args, and in this case the callee is an OverloadExpr. The candidate decls can be accessed through this (though not all OverloadCandidateSet info, like which candidates are viable and why).

This could be used to get parameter info/signature help without reparse, though it’s going to mean “parsing” the unstructured RecoveryExpr a bit. A more specialized RecoveryCallExpr would be even better for this case, but the simpler mechanism is probably good enough in practice.

Hi Sam,

This seems really useful indeed.

I am just wondering about how this would play with speculative actions.

I am more familiar with speculative stuff in Parser (peculativeEvaluationRAII, TentativeParsingAction, Parser::TryParseLambdaIntroducer) but it seems that sometimes we also use the same approach in Sema:

  • SFINAETrap (Which IIUC isn’t used just for SFINAE - for example in RebuildForRangeWithDereference.)
  • ContextRAII
  • TypoExpr

Any thoughts on that?

Cheers,

Jan

Thanks, Jan. I’m not familiar with all of the relevant parts of Parser/Sema, so these are useful pointers.

Hi Sam,

This seems really useful indeed.

I am just wondering about how this would play with speculative actions.

I am more familiar with speculative stuff in Parser (peculativeEvaluationRAII, TentativeParsingAction, Parser::TryParseLambdaIntroducer

I’m not sure if we’d want to use this to represent errors at the parser level, I don’t know enough yet about the parser to really have an opinion.

but it seems that sometimes we also use the same approach in Sema:

  • SFINAETrap (Which IIUC isn’t used just for SFINAE - for example in RebuildForRangeWithDereference.)

SFINAETrap is diagnostic-based, so we need to ensure when emitting RecoveryExpr:

  • we must also emit a diagnostic that causes substitution failure
  • we must not cause a diagnostic that is emitted even on substitution failure
    Other than that, I expect this to interact OK: when substitution fails, the subtrees with RecoveryExprs will be discarded.
    It may hurt the performance of substitution failure though, as we create more nodes rather than just cascading the failure immediately.
    (RebuildForRangeWithDereference is indeed not SFINAE but should work the same way)
  • ContextRAII

I’m not familiar with this class or how it’s used for speculative actions - can you elaborate?

  • TypoExpr

These are somewhat complementary in that TypoExpr is used when an identifier can’t be resolved, and RecoveryExpr is useful to preserve subexpressions that can be resolved.

That said there are cases where these intersect like invalid_function(valid_parameter) - it would be nice to generate something sensible here: a RecoveryExpr(CallExpr) where invalid_function is dropped/represented by a TypoExpr/represented by a RecoveryExpr(DeclRefExpr)…
I’m not sure exactly how this should work mechanically, and also don’t think it needs to work immediately.

Hi Sam,

Thanks for working on this! Clang’s tooling will definitely benefit from this work.

Thanks, Jan. I’m not familiar with all of the relevant parts of Parser/Sema, so these are useful pointers.

Hi Sam,

This seems really useful indeed.

I am just wondering about how this would play with speculative actions.

I am more familiar with speculative stuff in Parser (peculativeEvaluationRAII, TentativeParsingAction, Parser::TryParseLambdaIntroducer

I’m not sure if we’d want to use this to represent errors at the parser level, I don’t know enough yet about the parser to really have an opinion.

but it seems that sometimes we also use the same approach in Sema:

  • SFINAETrap (Which IIUC isn’t used just for SFINAE - for example in RebuildForRangeWithDereference.)

SFINAETrap is diagnostic-based, so we need to ensure when emitting RecoveryExpr:

  • we must also emit a diagnostic that causes substitution failure
  • we must not cause a diagnostic that is emitted even on substitution failure
    Other than that, I expect this to interact OK: when substitution fails, the subtrees with RecoveryExprs will be discarded.
    It may hurt the performance of substitution failure though, as we create more nodes rather than just cascading the failure immediately.
    (RebuildForRangeWithDereference is indeed not SFINAE but should work the same way)
  • ContextRAII

I’m not familiar with this class or how it’s used for speculative actions - can you elaborate?

  • TypoExpr

These are somewhat complementary in that TypoExpr is used when an identifier can’t be resolved, and RecoveryExpr is useful to preserve subexpressions that can be resolved.

That said there are cases where these intersect like invalid_function(valid_parameter) - it would be nice to generate something sensible here: a RecoveryExpr(CallExpr) where invalid_function is dropped/represented by a TypoExpr/represented by a RecoveryExpr(DeclRefExpr)…
I’m not sure exactly how this should work mechanically, and also don’t think it needs to work immediately.

Any thoughts on that?

Cheers,

Jan

This would be helpful. Within Coverity, we hacked together our own error recovery mechanism (mostly involving demoting errors to warnings and discarding function bodies and variable initializers with errors), but our mechanism sometimes leads to other down stream errors (I investigated a spurious "enumerator ‘x’ does not exist in instantiation of ‘y’ just this week). The ability to explicitly indicate errors in the AST would be helpful.

Ideally, I’d like to have ErrorDecl, ErrorStmt, ErrorExpr, and ErrorType nodes. But there is no reason they would all have to be introduced at the same time.

Tom.

In the presence of broken code, the AST clang generates is often very limited.
Often Sema diagnoses errors and doesn’t create AST nodes. For example, if the arguments to a function call are wrong, we get no CallExpr, and all subexpressions are discarded.

IDE tools like clangd see a lot of broken code, and rely on the resulting AST, so I’d like to improve this situation.

If the user writes the following erroneous code:
a(b, c).d() // error: a only takes one parameter
then no AST is emitted for this expression, and:

  1. clangd’s “go to definition” doesn’t work on a, even though there’s no overloading
  2. “go to definition” also doesn’t work on b and c or their subexpressions, even though these expressions are valid
  3. code completion doesn’t work properly for d(), even though we know what type a() returns. (go to definition doesn’t either).
    There are lots of similar cases: (foo->bar when bar doesn’t exist, etc).

My proposal is to add a fairly generic RecoveryExpr that the AST can emit in these situations. It would capture:

  • the approximate StmtClass we were trying to create (e.g. CallExpr)
  • the subexpressions - (e.g. an overloadexpr for the callee, and the arguments)
  • the type, if we can make a reasonable guess (e.g. all overloads have the same type)

I also considered other approaches, and think they don’t work so well:

  • just generate the e.g. CallExpr etc anyway, and maybe add an “invalid” flag. This means weakening many invariants in many different node types. It’s subtle and error-prone, and adds conceptual and implementation complexity to many nodes that now have to be able to represent various error cases. Even within clang it’s hard to find all the analyses that need to be updated. (http://lists.llvm.org/pipermail/cfe-dev/2019-April/062151.html)
  • extract the extra information we need from Sema, rather than the AST. e.g. Sema should invoke a callback whenever an identifier is resolved to a Decl. This solves go-to-definition pretty well, but isn’t a very flexible approach. The AST is rich, and provides context, can be dumped etc, a Sema callback would be an ad-hoc solution that is less convenient to use and understand.
  • add a hierarchy of new nodes to represent errors (RecoveryCallExpr etc). It seems like a lot of work to model all the different ways code can be invalid, and I don’t think the extra structure would be useful enough to justify it.

I have a very rough prototype of this idea: https://reviews.llvm.org/D61722 (emits RecoveryExpr for failed overload resolution).
It does create a form of recovery that affects diagnostics. But the changes exposed by tests are few and seem mostly good (especially compared to emitting CallExpr).
Some open questions:

  • what’s the type of the expression when we can’t guess? (IntTy in the patch)
  • will there be surprises if RecoveryExpr is emitted in more places and they start to interact?
  • should RecursiveASTVisitor, ASTMatchers etc skip these nodes by default, for compatibility with external code?

I think being conservative when it comes to the RecursiveASTVisitor is probably correct, but at the same time it would be nice if it was traversed by default. That way the existing tools can visit the children of the expression that the compiler thinks are correct so they will have access to a more complete AST. The typo expressions have a bit of a prior art there I suppose, since they actually don’t correspond to the real source, but to the source that the compiler thinks should’ve been written.

Note that I've already written several patches towards making that possible, such as

https://reviews.llvm.org/D61837

waiting on review :slight_smile:

Stephen.

I've already been working on that in order to ignore 'invisible nodes'

  https://reviews.llvm.org/D61837

See my talk:

  2019 EuroLLVM Developers’ Meeting: S. Kelly “The Future of AST Matcher-based Refactoring” - YouTube

I haven't had success with getting the simple patches reviewed yet, but I've just added you as a reviewer.

Thanks,

Stephen.