Bug report warning message duplicated as a note by most checkers

Bug report warning message duplicated as a note by most checkers

The class BugReport introduces two separate data members called Description and ShortDescription and nicely documents their purpose:

  • Description is “A verbose warning message that is appropriate for displaying next to the source code that introduces the problem.”
  • ShortDescription is “A short general warning message that is appropriate for displaying in the list of all reported bugs. It should describe what kind of bug is found but does not need to try to go into details of that specific bug.”

However, in practice these two data members almost always contain the same string in classes like PathSensitiveBugReport that derive from BugReport [1]. The only exception is BitwiseShiftChecker, where I introduced separate full and short descriptions a few months ago (because I trusted the comments in BugReport and didn’t realize that this distinction doesn’t appear anywhere else).

In the analyzer output the ShortDescription appears as the warning message, while Description is printed as a note at the same location, so currently practically every checker produces a message that’s printed twice (once as a note, once as a warning).

The repetition is not misleading, but I think that this a high priority issue, because it is very visible and completely pointless and has an obvious solution, so it creates an impression that the Static Analyzer is an unfinished, “dirty” project.

I think there are three potential decisions at this point:

  • Just eliminate all the redundant note messages and remove the distinction between the two Description fields. This is very straightforward, although it would change lots of lines in the test files.
  • Gradually separate short and full messages in each checker. This is a very long project with 100+ small commits, but could eventually lead to better messages.
  • Hybrid solution: Don’t print the note message when it just repeats the warning, but let the checkers provide a separate longer Description if they want (and gradually introduce these full descriptions when we’re improving checkers).

I would be happy to implement either the simple elimination, or the hybrid solution (or more precisely its first phase that eliminates the repetition). I don’t think that the gradual separation is an adequate solution, as separately editing all the checkers would be a very slow process with limited benefits (and this issue would persist until its completion).

I understand that eliminating all those useless note messages would be a breaking change in the output of the Clang Static Analyzer (for example, bugreport hashes that include the note messages would change and I’d guess that there could be frontends that don’t display the warning because they except that its content was already displayed as a note).

I think that despite these potential difficulties this issue must be fixed, but I would be very careful when I release these changes. (E.g. if needed, I could introduce a compatibility flag that can reactivate the legacy behavior for users that depend on it.)

@NoQ @Xazax-hun @steakhal @Szelethus @gamesh411 @ anybody else
What do you think about this issue? What kind of solution would you prefer? What do you know about the potential impact of this change, what should I do to ensure that it’s acceptable for the users?

By the way, this question is relevant for me because I’m trying to “clean up” alpha checkers (currently ArrayBoundV2) and when I work on improving the diagnostics, I need to decide whether I want to separate the short and full descriptions. As a concrete example, a few days ago I uploaded PR #70056 for review and that change would separate the short and full description in ArrayBoundV2 (among other improvements). Resolving this issue would help me move forward with that process.

I couldn’t find any earlier discussion about this topic; I’m sorry if I’m duplicating some old ticket.

[1] I checked all calls to make_unique<PathSensitiveBugReport>() and make_unique<BasicBugReport>() with the help of clang-query and grep.

1 Like

I don’t know why would this imply that CSA is a “dirty” project. I can’t connect the dots.
Anyways, I agree with you that the logic separation of the Description and ShortDescription has value, and we should use and enforce them.

In new checkers, we could try to enforce this distinction on reviews.
IDK how to automate this.

My take is that the short-description should be generic, thus expressed by a single cstring (without any substitutions). This would make people find them easier, by simply grepping the whole message.
The notes are by nature, closely bound to the bug-paths, thus they offer a better place for specific messages; which allows us to use substitutions there, contextualizing the message.

IMO such separation makes sense from the users perspective, thus I’d avoid “elimintating” them. I’m okay with the other two options though.
However, we should not expose a partial transition in a clang release if we choose option “gradual separation”. FYI the next clang release (release/18.x) is expected to branch off in early January.

Probably, the most practical approach would be the “hybrid solution”.
Along with that transition, we should make ShortDescription mandatory, and Description optional.

Yes, it needs to be mention in the release notes, and vendors would need to adapt. I think that’s fine.

TBH I strongly believe that tool-vendors can uplift their product (tools). I’d be weakly against more compatibility flags, especially without a clear deprecation timeline. We already have a couple “compatibility” flags, without much documentation, reasoning or timeline - and that also leads to poor user-experience.

  1. Agreed and relevant.
  2. Hybrid approach.
  3. They should adapt, and I’ll look into what Sonar would need to adapt.

I can also recommend weggli.

I think the hybrid approach is fine in the short term, but I am not 100% sure whether the distinction between description and short description has any added value.

The question is, what would be the purpose of the longer description? In case we want a longer text explaining the current instance of the problem, we might be better off adding additional notes to the diagnostic that appear at the right source location. In case we want to give a better explanation of the error category, I wonder if a link to some documentation with examples is a better user experience. That being said, I do not have strong feelings about eliminating the long version, so I am also fine keeping it if other people see the value in that. What do you think?

@steakhal

My take is that the short-description should be generic, thus expressed by a single cstring (without any substitutions).

I feel that if we strictly ban all substitutions in the short-description, then it will be too close to the checker name and won’t provide enough information. I’d say that the short-description:

  • Should have a simple “subject verb object” structure like “the foo breaks bar” or “right operand is negative in left shift”. (A well-known name like “zero division error” is even better, if there’s one.)
  • Should not name variable names, line numbers, class names or other data that’s meaningless without seeing the code context (but I’d say that “Left shift by ‘4294967275’ overflows the capacity of ‘uint64_t’” is acceptable, because numbers and number types usually don’t need context).
  • Should preferably have a searchable part (that’s a good point), but it’s OK if it substitutes something e.g. at the end of the message.
  • Should contain as much relevant information as possible (within these bounds). If the user reviews a long list of reports, they’ll only see this short description and it’s better if the list doesn’t contain twenty identical entries when we could report some “natural” difference between them (that’s more relevant and memorable than the location/reportID/other generic identifiers).

Probably, the most practical approach would be the “hybrid solution”.
Along with that transition, we should make ShortDescription mandatory, and Description optional.

I agree with these.

TBH I strongly believe that tool-vendors can uplift their product (tools).

That’s good news :slight_smile: (I had practically no knowledge about the potential reactions of the tool-vendors, that’s why I mentioned the compatibility flag defensively.)

I can also recommend weggli.

Thanks for the suggestion, it seems to be a very useful tool!

@Xazax-hun

I think the hybrid approach is fine in the short term, but I am not 100% sure whether the distinction between description and short description has any added value.
The question is, what would be the purpose of the longer description?

IMO the purpose of the long description is to collect and summarize all the assumptions and facts which together show that the analyzer reached a failure state.

There are some bug types where the full answer to the “What is the problem?” question is very concise (e.g. “division by zero”, “dereference of null pointer” – there are no relevant additional details), but there are also some situations where there are many relevant factors and only their combination signifies that this is a bug. Some concrete examples:

  • in array bounds checking “Access of ‘v.elems’ at index 64, while it holds only 64 ‘double’ elements” (from my proposed PR #70056)
  • in pedantic mode of bitshift checking “The shift ‘1 << 31’ is undefined because ‘int’ can hold only 31 bits (excluding the sign bit), so 1 bit overflows” (from core.BitwiseShift)

I feel that there are use cases where these messages are simply too long and unwieldy, and it’s better to display shorter messages like:

  • Out of bound access to memory after the end of ‘v.elems’
  • The shift ‘1 << 31’ overflows the capacity of ‘int’

In case we want a longer text explaining the current instance of the problem, we might be better off adding additional notes to the diagnostic that appear at the right source location.

There are three advantages of the “long description note at end of path” style that I’m advocating for:

  • The bug path is often full of almost-irrelevant assumptions corresponding to branches taken in conditional statements. (The path-sensitive analysis doesn’t know what’s on the other branch, so it must note that it’s making that assumption, but most of them are just noise.) In theory this could be improved in the future, but until then, it’s much easier to find the relevant information if it’s at the end (instead of being hidden among the garbage).
  • The detailed note may state information (e.g. bit size of integer types) that doesn’t correspond to relevant source locations.
  • The long description can show the combined assumptions of the analyzer, as opposed to the (also relevant, but different) individual assumptions that are displayed along the bug path. For example if we consider the following code:
int A[6];
if (x > 5) {
  if (x < 7) {
    return A[x];
  }
}

then the array bounds checker can emit a long description which shows that the analyzer (correctly) assumed that x == 6 at that point (while the notes along the bug path only show the two separate inequalities).

Showing the final form of the assumptions is especially relevant when the analyzer assumes a counter-intuitive overflow or underflow: in those cases it’s very helpful to see that according to the analyzer’s assumptions a certain variable is contains INT_MIN / INT_MAX (or something similar).

However, I also realized that I badly misunderstood the situation and the “users see ugly duplicated messages” issue seems to be a figment of my imagination. It’s still relevant to discuss the potential separation of long and short descriptions, but there is no need for any breaking changes in the output format because the status quo is adequate and can naturally handle the long/short message separation.

Conceptually the output of CSA consists of two separate parts:

  • The warning message is a bug summary, which is the short description (or the single description).
  • The note messages form the bug path, which contains other notes and ends in the long description (or the single description).

I didn’t realize this earlier, but it’s obvious that the bug summary should not be appended to the end of the bug path, because it never contains information that’s not present in the bug path (when there’s a single description, it clearly shouldn’t be duplicated, but even in the other case, we don’t want to display the short description directly after the long description).

I’m certain that other people (especially those who develop front-ends for CSA) already realized this and they don’t produce ugly output that looks like our tests, (where the note and warning appear directly after each other).

[I ended up a bit like the guy who hears in the radio that “Be careful on highway XX, there is a madman who’s driving in the wrong direction!” then exclaims “Just one?! All of them do!”.]

This means that we don’t need to do any sweeping systemic changes; and it’s presumably feasible to introduce separate long and short messages in any checker where they seem to be useful.

(I’m still asking for feedback to decide whether is it a good idea to use this possibility and extend some checkers with messaging that looks like PR #70056 – but this is a smaller and less important question.)

As you mentioned, we already have a relatively short description of the problem in the BugType. So, I think the question is, whether we need 3 descriptions of different verbosity.

That being said, example warning message you provided looks wonderful to me. I think including info like that in the message makes sense.

That being said, example warning message you provided looks wonderful to me.

Thanks :smile:

Then I will continue to introduce separate long/short messages if I’m working on a checker where it’s valuable, but I won’t start a refactoring process that would systemically introduce this feature in many checkers. (Not that I would’ve had time for such a long and boring procedure…)

The duplicate note at the end of a path-sensitive report isn’t a problem, it’s completely intended, I don’t see this as an issue. The original warning tells the user what the issue is. Then, path notes tell the user how we believe the issue happened. Then the final note acts as a Q.E.D., the conclusion to the story: “…and this is how all of this leads to a null pointer dereference.”

The final note is displayed differently in different UIs with different degrees of clumsiness. For example, in scan-build it doesn’t result in a separate bubble; the warning simply becomes the top text of the html page, and the final note becomes a bubble. Which I think makes perfect sense.

I think the only place where it looks ugly is the text UI, especially when the path is short, so the user notices that there’s duplication. But even then, for longer reports, the user is arguably grateful that we didn’t force them to scroll all the way back up in order to recall what’s wrong. This also matches our recommended way to read the report: from bottom to top! So a different, arguably better solution would be to simply write notes in reverse order?!

For example:

int foo(bool flag) {
  int x = 0;      // 'x' initialized to 0
  if (flag)       // assuming 'flag' is false, taking false branch
    x = 1;
  return 1 / x;   // division by zero
}

Before (Compiler Explorer):

<source>:5:12: warning: Division by zero [core.DivideZero]
    5 |   return 1 / x;
      |          ~~^~~
<source>:2:3: note: 'x' initialized to 0
    2 |   int x = 0;
      |   ^~~~~
<source>:3:7: note: Assuming 'flag' is false
    3 |   if (flag)
      |       ^~~~
<source>:3:3: note: Taking false branch
    3 |   if (flag)
      |   ^
<source>:5:12: note: Division by zero
    5 |   return 1 / x;
      |          ~~^~~

After:

<source>:5:12: warning: Division by zero [core.DivideZero]
    5 |   return 1 / x;
      |          ~~^~~
<source>:3:3: note: Taking false branch
    3 |   if (flag)
      |   ^
<source>:3:7: note: Assuming 'flag' is false
    3 |   if (flag)
      |       ^~~~
<source>:2:3: note: 'x' initialized to 0
    2 |   int x = 0;
      |   ^~~~~

We probably can’t change this now that users got used to reading it from bottom to top, but I think it would have made a lot more sense if implemented from the start. This is also how normal compiler warnings usually organize their notes: starting from warning location, going all the way up to external context.


Now, separately from that, there’s this whole situation where each warning message is actually 4 warning messages in a trenchcoat:

  1. Category: "Memory Error"
  2. Bug type name: "Leak of returned object"
  3. Description: "Object returned to caller as an owning reference (single retain count transferred to caller)"
  4. The actual message: "Object leaked: object allocated and stored into 'foo' is returned from a method that is annotated as CF_RETURNS_NOT_RETAINED and has reference count of +1"

On top of that, we have checker names, which aren’t messages, but they can be used for many purposes for which messages are also used, eg. for grouping bug reports by “type of issue”.

Scan-build displays all four messages. The first two are displayed in the index table, then the other two are displayed inside the actual report.

Xcode ignores the bug name thing (2.) entirely. It displays the category as a collapsible list of all warnings in that category, then it displays each warning as Description, that you can further un-collapse to see all the steps ending with the actual warning.

Text output mode ignores (1.) and (2.) entirely.

So it might be that we can nuke (2.).

Then, the difference between the description and the actual message is only used by a couple checkers. Maybe there should have never been a difference. But that would be a bit harder to undo now, as we need to confirm that valuable information and usability/UX isn’t lost in any of these cases.

And, yeah, same as checker names, the problem with categories is that they’re yet another form of warning categorization that should have been implemented as hashtags instead, because right now they are overlapping and it makes zero sense to assign every checker into only one category.

Yes, I realized that since writing my initial message.

We probably can’t change this now that users got used to reading it from bottom to top, but I think it would have made a lot more sense if implemented from the start.

The current “earlier steps first” order also has some advantages. However, you’re right that this question is mostly irrelevant, as we shouldn’t flip the “usual” order.

So it might be that we can nuke (2.). [ Bug type name: "Leak of returned object" ]

I’d support that change – you’re right that it doesn’t contribute useful information.

Then, the difference between the description and the actual message is only used by a couple checkers.

I feel that “the actual message” is basically just another note on the bug path, so it doesn’t cause any problems if it’s sometimes different from the stand-alone description. (E.g. at the end of the bug path we want to to specify variable/function names, while in the description there is no context and they’re much less useful. Also, long descriptions would be cumbersome, while a long note is OK if it’s full of useful information.)

And, yeah, same as checker names, the problem with categories is that they’re yet another form of warning categorization that should have been implemented as hashtags instead, because right now they are overlapping and it makes zero sense to assign every checker into only one category.

Would it be possible/useful to use the same set of hashtags for both the checkers and the individual reports that are emitted by the checkers?