Clang parser diagnostics

Hello everyone.

First, let me say I'm very impressed with both the clang and LLVM
projects. The quality of the clang source code is so good that I've
learned a lot about how compilers work just from reading it.

In fact, I am writing a compiler for a different language using clang
as a sort of "design guide". So far I've gotten to the parser, and
I've noticed something interesting about the way clang generates parse
error diagnostics.

If you type something like:

"x = 3 * * 4"

it will point to the second "*" and say "expected expression", because
a node for the operator precedence parser was expected. However, this
is not really in keeping with the design philosophy of very expressive
error messages.

If I understand it right, the problem is this: you want to be sure you
don't generate multiple errors that are really all the same error.
Thus when ParseCastExpression (or some other very basic rule) fails,
you want to generate the one and only error, and return an invalid
OwningExprResult which will unwind every single expression parsing
production. As they unwind, each production remains silent about the
fact that is has failed to parse, because obviously it has: something
it directly depends on is invalid.

However, the caller of ParseCastExpression is aware of more useful
information about why the expression was expected, so it could give a
more useful message like "a binary operator should not follow another
binary operator". For this though, every user of ParseCastExpression
would need its own diagnostic reporting code which I'd image would get
sloppy. From what I understand, this is why you introduced "notes",
which are a very nice idea even though they don't seem to be present
in this case. However, that made me think of this situation:

x = 3 * (a really complicated and malformed parenthetical expression)

as you exit from somewhere in there, you will get a wall of "note
spam" and probably most of it is not that necessary. The user kind of
gets the point about what went wrong after the first one or two notes,
and then the rest would actually be more confusing to show (like those
"instantiated from here..." template errors). I can think of a whole
bunch of ways to handle this, but I don't really like any of them. So
I thought I'd ask what (if anything) you guys plan to do in the
future, since I usually like the ideas in clang. Perhaps tons of
notes should always be generated, and it is the responsibility of the
user of the diagnostic client to wade through them? I don't want to
pick something, let my parser get much bigger with diagnostics
everywhere, then change the design later if someone has already got a
better idea.

Also, I am using C++ because of LLVM, so I am looking forward to using
clang as a compiler. But the impressive IDE's that clang technology
will enable is probably what I'd wait for before leaving the comfort
of Microsoft's Visual Studio because I really dislike the unix-style
build system. I assume that Xcode will be the first to really use
clang like this because of the Apple relationship. The status of C++
support is easy to follow on the commits list but I was wondering:
whats the status of the IDE? Is there any plan at this point or is it
still too far in the future?

Thanks,
Ken Camann

Ken Camann wrote:

If you type something like:

"x = 3 * * 4"

it will point to the second "*" and say "expected expression", because
a node for the operator precedence parser was expected. However, this
is not really in keeping with the design philosophy of very expressive
error messages.
  

I personally think the error message is actually pretty good. It could
be better, but what you describe below doesn't feel like an improvement
to me.

If I understand it right, the problem is this: you want to be sure you
don't generate multiple errors that are really all the same error.
  

Yes, this is very important. You're a Visual Studio user - you know what
happens when you type
itn variable;
You usually get three errors from that simple typo.
The underlying principle of this rule is: don't clutter the diagnostics
output. Aside from "generate a single error message per actual error"
there are some more rules that can be derived from this:
- Keep error messages as short as possible without losing expressiveness.
- Only generate information that is genuinely useful to the user.

However, the caller of ParseCastExpression is aware of more useful
information about why the expression was expected, so it could give a
more useful message like "a binary operator should not follow another
binary operator".

I believe we try to generate positive messages whenever possible, i.e.
"expression expected" instead of "don't write an operator here". They
are, in general, more informative to the user. (That's why I'm quite
happy with the message here already.) An improved message would be,
"right operand expression expected". But since we have caret
diagnostics, that feels almost superfluous.

From what I understand, this is why you introduced "notes",
  

Not really. Notes are for pointing out related information that is
somewhere else in the source code - the declaration of the function you
just called with incorrect arguments, for example.

which are a very nice idea even though they don't seem to be present
in this case. However, that made me think of this situation:

x = 3 * (a really complicated and malformed parenthetical expression)

as you exit from somewhere in there, you will get a wall of "note
spam" and probably most of it is not that necessary. The user kind of
gets the point about what went wrong after the first one or two notes,
and then the rest would actually be more confusing to show (like those
"instantiated from here..." template errors). I can think of a whole
bunch of ways to handle this, but I don't really like any of them. So
I thought I'd ask what (if anything) you guys plan to do in the
future, since I usually like the ideas in clang. Perhaps tons of
notes should always be generated, and it is the responsibility of the
user of the diagnostic client to wade through them?

No. Although I'm not at all the one making decisions in Clang, I can
state with high certainty that there is no intention to introduce tons
of notes anywhere.

  I don't want to
pick something, let my parser get much bigger with diagnostics
everywhere, then change the design later if someone has already got a
better idea.
  

Something we do in some situations where very generic parsers should
emit a diagnostic is to have the caller pass the diagnostic ID to the
generic code. This means that all diagnostics in this situations need
the same set of parameters, but that's usually fine. Follow the code
path for parsing variable initializers for an example.

Sebastian

How do you get this message ? :

OK, '*' is a special case, I got it with '/' or another operator.

Hi Sebastian.

Thank you for the response. You've helped me realize that the nature
of the problem is different than I thought it was. I thought it was
sort of a technical problem, but it's not. I agree with you that the
current clang error is good. I think the difference is that clang is
a compiler for the C family.

You can expect that your users have a very good idea what things like
"expression", "statement", "declaration", and "definition" mean.
Since I am writing something that is more for scientists and engineers
(sort of like MATLAB, if you've ever used that) the language can be
used by people who understand math but are not totally comfortable
with programming. Ignoring the fact that "expression" is also a
mathematical term, I think different kinds of messages may be
appropriate for different audiences. As far as clang is concerned,
it's fair to say that C-family programmers have a good working
knowledge of how the grammar works and what the different productions
are called, even if its just an intuitive sense and they've never
learned formally.

Also, thanks for the clarification on notes. I'll also have a look at
the variable initializer parsing.

Ken

Sebastian gave an excellent response to the rest of the email.