Diagnostic Improvements

I've been pondering on diagnostics a bit lately, and reviewing one of Doug's patches got me to thinking about this more intently. Here are three problems I've noticed with Clang's current diagnostics:

1) We create a ton of std::strings when emitting Diagnostics. We do this any time a substitution happens (e.g. %1) like when printing integers (utostr), types, identifiers, and soon DeclarationNames. This generally doesn't matter, as the actual I/O and other work to print out the diagnostic dwarfs the cost of doing the string manipulations. However, there are plenty of diagnostics for warnings and extensions that end up *not* being printed, and we end up doing a ton of string manipulation (yay heap traffic) to create strings that are ultimately never printed. I believe, but haven't proven, that this also makes the code for "Diag" call sites large (in terms of code size), as they have to make lots of temporary strings and manage their lifetime.

2) There are cases when we want to format some things more richly than we're already doing. One simple example of this is with types. For example, I've seen diagnostics complaining that "there is some problem with type 'foo'". The problem here is that 'foo' is a typedef, and I have no idea what it is a typedef of - this is the amusing inverse of the problem most compilers have where they don't track typedefs at all. In an IDE, we'd like to be able to have more rich information about the 'foo' in the diagnostic, potentially allowing the user to click through it to dive into the structure of the type (e.g. click once and 'foo' turns into 'bar*', click again and it turns into "void (*)()"). In a CLI, there might be an -funravel-types=2 option or something. There is obviously no way to do this if the type is turned into a string before the DiagnosticClient gets it.

You could imagine doing this for things like template instantiation issues etc as well of course.

3) Poor Doug :slight_smile: is having to write code like this:

       if (NumParams == 1)
         DK = diag::err_operator_overload_must_be_binary;
       else
         DK = diag::err_operator_overload_must_be_binary_plural;

The difference is a single 's':

DIAG(err_operator_overload_must_be_binary, ERROR,
      "overloaded operator '%0' must be a binary operator (has %1 parameter)")
DIAG(err_operator_overload_must_be_binary_plural, ERROR,
      "overloaded operator '%0' must be a binary operator (has %1 parameters)")

and:

       diag::kind DK;
       if (Op == OO_PlusPlus)
         DK = diag::err_operator_overload_post_inc_must_be_int;
       else
         DK = diag::err_operator_overload_post_dec_must_be_int;

The difference is increment vs decrement:
DIAG(err_operator_overload_post_inc_must_be_int_member, ERROR,
      "parameter of overloaded post-increment operator must have type 'int' (not '%0')")
DIAG(err_operator_overload_post_dec_must_be_int_member, ERROR,
      "parameter of overloaded post-decrement operator must have type 'int' (not '%0')")

Having tons of variants of diagnostics like this is currently necessary for localization purposes, but it is ugly and painful to write.

I think it is time to stop the madness. Currently, diagnostics are passed down into DiagnosticClient with an array of strings that are referenced by substitutions, and turned into the final string by DiagnosticClient::FormatDiagnostic. I think it would be straight-forward to change this to be an array of void*, where (at the beginning) all of these are pointers to std::string. Once that was done, we can start introducing some more rich formatting characters like "%T0" instead of "%0" and making the corresponding pointers be to richer thing than strings. For example, "%T0" format the 0th element of the array as a QualType. We could also introduce things like %u0 (unsigned int) and %i0 (signed int).

Doing this would directly solve #1 above, because 1) diagnostics that are squelched never make it to FormatDiagnostic and 2) those that are could be printed into one big SmallString. This would also solve #2, because FormatDiagnostic would be given the QualType, so it could do any crazy AST-level groveling that it wants to format it based on what the DiagClient implementation wants. This also is a big step to solving #3, because it would mean that we could write things like:

DIAG(err_operator_overload_must_be_binary, ERROR,
      "overloaded '%0' must be a binary operator (has %u1 parameter%up1)")

Where %u1 prints the unsigned, and %up1 (p=plural, or pick a better letter) prints a "s" if the unsigned is != 1, or nothing if it is.

Does this seem like a reasonable thing to do? Does anyone volunteer to start tackling it?

-Chris

Hi,

Where %u1 prints the unsigned, and %up1 (p=plural, or pick a better
letter) prints a "s" if the unsigned is != 1, or nothing if it is.

This is not very localizable. First, some languages have more than one plural form (e.g. russian – they have more than "one" and "many"). Second, in many languages, the plural form is different for different words. E.g. in German, the plural of "Fehler" ("error") is "Fehler" (no change), the plural of "Ausnahme" ("exception") is "Ausnahmen" ("n" added), and the plural of "Baum" ("tree") is "Bäume" ("e" added, and "a" changed to its umlaut). I have no idea how relevant this is :wink:

Nico

Indeed. See GNU gettext utilities: Plural forms
for more detail on this issue and gettext's solution.

- Daniel

I don't understand how this is different than what I'm saying :). I'm not saying that "p" always turns into "s" for all languages. When doing localization, my assumption is that the entire english string would be replaced, and it could be replaced by anything. If the german string required crazy other things, it could definitely key off the number and form whatever phrases are needed. Is this not sufficient?

-Chris

In fact, for exactly the issues raised in this section (great stuff, thanks for the link!), the current foo and foo_plural approach is broken for translating into Polish.

-Chris

I've been pondering on diagnostics a bit lately, and reviewing one of
Doug's patches got me to thinking about this more intently. Here are
three problems I've noticed with Clang's current diagnostics:

[snip]

I think it is time to stop the madness. Currently, diagnostics are
passed down into DiagnosticClient with an array of strings that are
referenced by substitutions, and turned into the final string by
DiagnosticClient::FormatDiagnostic. I think it would be straight-
forward to change this to be an array of void*, where (at the
beginning) all of these are pointers to std::string. Once that was
done, we can start introducing some more rich formatting characters
like "%T0" instead of "%0" and making the corresponding pointers be to
richer thing than strings. For example, "%T0" format the 0th element
of the array as a QualType. We could also introduce things like %u0
(unsigned int) and %i0 (signed int).

[snip]

Does this seem like a reasonable thing to do?

GCC does something very similar to this, and it works quite well. They
have printf-like specifiers such as '%T' (print a type node), '%E'
(print an expression node), with qualifiers like 'q' (qualified name,
e.g., '%qT' prints 'foo::bar::MyClass' instead of just 'MyClass') to
tweak the output. We definitely need something like this for Clang,
and as long as the result is type-safe (e.g., asserts if you write
'%T0' and give it a string), I'm fine with whatever markup we choose.

This would also be a great time to make Clang's error output word wrap
when written to a terminal...

  - Doug

I think it is time to stop the madness. Currently, diagnostics are
passed down into DiagnosticClient with an array of strings that are
referenced by substitutions, and turned into the final string by
DiagnosticClient::FormatDiagnostic. I think it would be straight-
forward to change this to be an array of void*, where (at the
beginning) all of these are pointers to std::string. Once that was
done, we can start introducing some more rich formatting characters
like "%T0" instead of "%0" and making the corresponding pointers be to
richer thing than strings. For example, "%T0" format the 0th element
of the array as a QualType. We could also introduce things like %u0
(unsigned int) and %i0 (signed int).

I like this idea. I think that our diagnostics should be more "intentional" in nature and have the localization component handle more of the actual text that is rendered. Another way of phrasing it is the component that issues warning should be indifferent on how they are presented or rendered (which includes localization).

Another benefit of passing arbitrary objects instead of strings is that you have the opportunity (depending on the DiagnosticClient) to render key things like function names (e.g., by passing a FunctionDecl*) in arbitrary ways. For example, one can imagine that the HTML diagnostics engine (used by the static analyzer, but also usable for regular diagnostics) would add hyperlinks from function names to their definition. In a similar way IDEs could leverage such information to build truly rich experiences. By "lowering" to strings too early when we generate diagnostics, we take away some of the responsibility of "presentation" from the DiagnosticClient.

Doing this would directly solve #1 above, because 1) diagnostics that
are squelched never make it to FormatDiagnostic and 2) those that are
could be printed into one big SmallString. This would also solve #2,
because FormatDiagnostic would be given the QualType, so it could do
any crazy AST-level groveling that it wants to format it based on what
the DiagClient implementation wants. This also is a big step to
solving #3, because it would mean that we could write things like:

DIAG(err_operator_overload_must_be_binary, ERROR,
     "overloaded '%0' must be a binary operator (has %u1 parameter
%up1)")

I think that in the end we should probably go farther than this and not have any English text in the character string passed to DIAG since we have to think about localization from the get-go instead of retrofitting it in. This of course is problematic for custom diagnostics.

by passing a FunctionDecl*) in arbitrary ways. For example, one can imagine that the HTML diagnostics engine (used by the static analyzer, but also usable for regular diagnostics) would add hyperlinks from function names to their definition. In a similar way IDEs could leverage

That would be cool.

because FormatDiagnostic would be given the QualType, so it could do
any crazy AST-level groveling that it wants to format it based on what
the DiagClient implementation wants. This also is a big step to
solving #3, because it would mean that we could write things like:

DIAG(err_operator_overload_must_be_binary, ERROR,
    "overloaded '%0' must be a binary operator (has %u1 parameter
%up1)")

I think that in the end we should probably go farther than this and not have any English text in the character string passed to DIAG

Yes, I agree. We already aim to avoid this.

since we have to think about localization from the get-go instead of retrofitting it in. This of course is problematic for custom diagnostics.

At this point, I'm not too worried about custom diags. They are basically an escape hatch. Lets solve the simple case first :slight_smile:

-Chris

DIAG(err_operator_overload_must_be_binary, ERROR,
     "overloaded '%0' must be a binary operator (has %u1 parameter
%up1)")

Where %u1 prints the unsigned, and %up1 (p=plural, or pick a better
letter) prints a "s" if the unsigned is != 1, or nothing if it is.

Does this seem like a reasonable thing to do? Does anyone volunteer
to start tackling it?

Well, if you really want to have good localization, you simply can't do it. Plural is very tricky is other languages than English. Even in english you've many exceptions (e.g. one mouse -> two mice; one child -> two children; etc..). In German it's very complex: there are 7 different rules (AFAIR) and tons of exceptions, so..
Not really important, but I though I could share this info before you make any decision.

Nuno

Sure you can, it just involves giving the format specifier more control. e.g., create a localization DSL so that during localization, "parameter(s)" can be expressed correctly as "if (n = 1) then 'parameter' else 'parameters'" [*].

— Gordon

* Not a proposed syntax.