Reporting UBSan issues at compile-time

vedantk · March 23, 2017, 1:52am

Hi,

I've performed some experiments with reporting UBSan diagnostics at
compile-time and think that this is a useful thing to do. I'd like to discuss
the motivation, the approach I took, and some results.

=== Motivation ===

We're interested in fixing UB in our projects and use UBSan to do this.
However, we have lots of software that is easy to build but hard to run, or
hard to test with adequate code coverage (e.g firmware). This limits the amount
of bugs we can catch with UBSan.

It would be nice if we could report UB at compile-time without false positives.
We wouldn't be able to report everything a runtime tool could, but we would be
able to report a large number of real bugs very quickly, just by rebuilding all
our software with a flag enabled.

=== Approach ===

I wrote a simple analysis which detects UB statically by piggybacking off UBSan.
It's actually able to issue decent diagnostics. It only issues a diagnostic if
it finds a call to a UBSan diagnostic handler which post-dominates the function
entry block.

The idea is: if a function unconditionally exhibits UB when called, it's worth
reporting the UB at compile-time.

Here is a full example. This C program has UB because it returns a null pointer
when it shouldn't:

  __attribute__((returns_nonnull)) int *returns_nonnull(int *p) {
    return p; // Bug: null pointer returned here.
  }
  
  int main() {
    returns_nonnull((int *)0LL);
    return 0;
  }

With UBSan enabled, here's the IR we get:

  define nonnull i32* @returns_nonnull(i32* %p) #0 {
  entry:
    ...
    %1 = icmp ne i32* %p, null, !nosanitize !2
    br i1 %1, label %cont, label %handler.nonnull_return
  
  handler.nonnull_return:
    call void @__ubsan_handle_nonnull_return(...), !nosanitize !2
    br label %cont, !nosanitize !2
  
  cont:
    ret i32* %p
  }

  define i32 @main() #0 {
  entry:
    ...
    %call = call nonnull i32* @returns_nonnull(i32* null)
    ret i32 0
  }

At -O2, LLVM inlines @returns_nonnull and throws away the null check:

  define i32 @main() local_unnamed_addr #0 {
  entry:
    tail call void @__ubsan_handle_nonnull_return(...), !nosanitize !2
    ret i32 0
  }

The call to UBSan's diagnostic handler post-dominates the function entry block,
so we report it right away:

$ clang -fsanitize=undefined -O2 -Xclang -enable-llvm-linter buggy.c
Undefined behavior: invalid null return value (buggy.c:3:1)

=== Results ===

I packaged up my analysis into LLVM's Lint pass and added a clang option to
enable linting. The initial patch is up for review:

https://reviews.llvm.org/D30949 - Add an option to enable LLVM IR linting

I built a few internal projects with UBSan, optimizations, and linting enabled.
This exposed real bugs. The only problem was that I got reports about UB in
dead code. Maybe this can be addressed by setting up sanitizer blacklists?

=== Alternatives? ===

We could try implementing something like the STACK UB checker:

https://people.csail.mit.edu/nickolai/papers/wang-stack-tocs.pdf

I haven't compared my approach vs. STACK in terms of bug-finding efficacy. The
latter does seem harder to implement.

I'm interested in hearing what others think.

thanks,
vedant

bogner · March 23, 2017, 5:21am

Vedant Kumar via cfe-dev <cfe-dev@lists.llvm.org> writes:

Hi,

I've performed some experiments with reporting UBSan diagnostics at
compile-time and think that this is a useful thing to do. I'd like to
discuss the motivation, the approach I took, and some results.

=== Motivation ===

We're interested in fixing UB in our projects and use UBSan to do
this. However, we have lots of software that is easy to build but
hard to run, or hard to test with adequate code coverage (e.g
firmware). This limits the amount of bugs we can catch with UBSan.

It would be nice if we could report UB at compile-time without false
positives. We wouldn't be able to report everything a runtime tool
could, but we would be able to report a large number of real bugs very
quickly, just by rebuilding all our software with a flag enabled.

=== Approach ===

I wrote a simple analysis which detects UB statically by piggybacking
off UBSan. It's actually able to issue decent diagnostics. It only
issues a diagnostic if it finds a call to a UBSan diagnostic handler
which post-dominates the function entry block.

The idea is: if a function unconditionally exhibits UB when called,
it's worth reporting the UB at compile-time.

So does this only emit the warning if -fsanitize=undefined is set? How
much overhead would be implied by doing an analysis like this without
actually emitting the ubsan diagnostic handler?

Even if the answer to that is "a lot" or "enough to be contentious",
what about enabling this warning by default when building with
-fsanitize=undefined? It sounds like the compile time overhead when
already building with sanitizers would be fairly insignificant.

Here is a full example. This C program has UB because it returns a
null pointer when it shouldn't:
  __attribute__((returns_nonnull)) int *returns_nonnull(int *p) {
    return p; // Bug: null pointer returned here.
  }
 
  int main() {
    returns_nonnull((int *)0LL);
    return 0;
  }
With UBSan enabled, here's the IR we get:
  define nonnull i32* @returns_nonnull(i32* %p) #0 {
  entry:
    ...
    %1 = icmp ne i32* %p, null, !nosanitize !2
    br i1 %1, label %cont, label %handler.nonnull_return
 
  handler.nonnull_return:
    call void @__ubsan_handle_nonnull_return(...), !nosanitize !2
    br label %cont, !nosanitize !2
 
  cont:
    ret i32* %p
  }

  define i32 @main() #0 {
  entry:
    ...
    %call = call nonnull i32* @returns_nonnull(i32* null)
    ret i32 0
  }
At -O2, LLVM inlines @returns_nonnull and throws away the null check:
  define i32 @main() local_unnamed_addr #0 {
  entry:
    tail call void @__ubsan_handle_nonnull_return(...), !nosanitize !2
    ret i32 0
  }
The call to UBSan's diagnostic handler post-dominates the function
entry block, so we report it right away:

  $ clang -fsanitize=undefined -O2 -Xclang -enable-llvm-linter buggy.c
  Undefined behavior: invalid null return value (buggy.c:3:1)

=== Results ===

I packaged up my analysis into LLVM's Lint pass and added a clang
option to enable linting. The initial patch is up for review:

  ⚙ D30949 [CodeGen] Add an option to enable LLVM IR linting - Add an option to enable LLVM IR linting

Assuming turning this on by default with ubsan is too much, it seems to
me that this overlaps quite a bit with the static analyzer and/or
clang-tidy - integrating it into one of those might make more sense than
adding yet another option to -cc1.

dblaikie · March 23, 2017, 3:18pm

FWIW - Clang is fairly allergic to emitting diagnostics based on optimization because they tend to present usability problems. They can appear/disappear due to seemingly unrelated changes in the code (that trigger or hinder optimizations that cause the diagnostic path to be hit).

Usually the idea is to implement these sort of bug finding techniques in Clang’s static analyzer. So perhaps there would be a way to feed UBSan’s facts/checks into the static analyzer in a more consistent way (I’m sure some of the same checks are implemented there already - but generalizing/unifying UBSan’s checks to feed into the static analyzer could be handy).

Dave

AnnaZaks · March 23, 2017, 5:39pm

FWIW - Clang is fairly allergic to emitting diagnostics based on optimization because they tend to present usability problems. They can appear/disappear due to seemingly unrelated changes in the code (that trigger or hinder optimizations that cause the diagnostic path to be hit).

Usually the idea is to implement these sort of bug finding techniques in Clang’s static analyzer. So perhaps there would be a way to feed UBSan’s facts/checks into the static analyzer in a more consistent way (I’m sure some of the same checks are implemented there already - but generalizing/unifying UBSan’s checks to feed into the static analyzer could be handy).

1

We already have several tools that detect errors, mainly UB, statically. Depending on the type of analysis you choose to write and false positive tolerance I’d suggest extending one of these:

compiler warnings
clang-tidy
clang static analyzer

I believe it will be trivial to teach the clang static analyzer to catch the bug in the example you provide below.

Cheers,
Anna

rnk · March 23, 2017, 5:45pm

FWIW - Clang is fairly allergic to emitting diagnostics based on
optimization because they tend to present usability problems. They can
appear/disappear due to seemingly unrelated changes in the code (that
trigger or hinder optimizations that cause the diagnostic path to be hit).

I was going to say this, but users keep asking for UB warnings from the
middle-end. At some point, we might want to throw them a bone and give them
something. =) We wouldn't want to enable such warnings by default, though.

Usually the idea is to implement these sort of bug finding techniques in

Clang's static analyzer. So perhaps there would be a way to feed UBSan's
facts/checks into the static analyzer in a more consistent way (I'm sure
some of the same checks are implemented there already - but
generalizing/unifying UBSan's checks to feed into the static analyzer could
be handy).

I don't think it will be feasible to generalize UBSan's knowledge to the
static analyzer. The static analyzer CFG is also at best an approximation
of the real CFG, especially for C++.

=== Approach ===

I wrote a simple analysis which detects UB statically by piggybacking off
UBSan.
It's actually able to issue decent diagnostics. It only issues a
diagnostic if
it finds a call to a UBSan diagnostic handler which post-dominates the
function
entry block.

The idea is: if a function unconditionally exhibits UB when called, it's
worth
reporting the UB at compile-time.

Checking for ubsan handlers that post-dominate a function entry block seems
like a weak heuristic. If you put that code inside an if, or if main is
inlined into a block that doesn't post-dominate the entry of the caller, we
won't warn on it.

What if we had a way of tagging certain branches as "warn if this branch is
optimized to true", and then we had hooks from branch simplification
utilities to emit warnings? That seems like it might find a lot more bugs,
but it could have false positives in dead code. Do you think that could
work?

dblaikie · March 23, 2017, 5:55pm

Why not? The rough idea I meant would be to express the constraints UBSan is checking into the static analyzer - I realize the current layering (UBSan being in Clang’s IRGen) doesn’t make that trivial/obvious, but it seems to me that the constraints could be shared in some way - with some work.

Sure enough - and I believe some of the people working/caring about it would like to fix that. I think Manuel & Chandler have expressed the notion that the best way to do that would be to move to a world where the CFG is used for CodeGen, so it’s a single/consistent source of truth.

rnk · March 23, 2017, 6:48pm

I don't think it will be feasible to generalize UBSan's knowledge to the
static analyzer.

Why not? The rough idea I meant would be to express the constraints UBSan
is checking into the static analyzer - I realize the current layering
(UBSan being in Clang's IRGen) doesn't make that trivial/obvious, but it
seems to me that the constraints could be shared in some way - with some
work.

Maybe I am not imaginative enough, but I cannot envision a clean way to
express the conditions that trigger UB that is useful for both static
analysis and dynamic instrumentation. The best I can come up with is
AST-level instrumentation: synthesizing AST nodes that can be translated to
IR or used for analysis. That doesn't seem reasonable, so I think getting
ubsan into the static analyzer would end up duplicating the knowledge of
what conditions trigger UB.

The static analyzer CFG is also at best an approximation of the real CFG,

especially for C++.

Sure enough - and I believe some of the people working/caring about it
would like to fix that. I think Manuel & Chandler have expressed the notion
that the best way to do that would be to move to a world where the CFG is
used for CodeGen, so it's a single/consistent source of truth.

Yes, we could do all that work, but LLVM's CFG today is already precise for
C++. If we allow ourselves to emit diagnostics from the middle-end, we can
save all that work.

Going down the high-effort path of extending the CFG and abstracting or
duplicating UBSan's checks as static analyses on that CFG would definitely
provide a better diagnostic experience, but it's worth re-examining
conventional wisdom and exploring alternatives first.

vedantk · March 23, 2017, 8:02pm

Responses inline --

Justin:

So does this only emit the warning if -fsanitize=undefined is set? How
much overhead would be implied by doing an analysis like this without
actually emitting the ubsan diagnostic handler?

Yes, you'd need to have -fsanitize=something set. IMO this is nice because it generalizes to other/future sanitizers, and it's totally opt-in. You wouldn't generate diagnostics for issues you aren't interested in.

Even if the answer to that is "a lot" or "enough to be contentious",
what about enabling this warning by default when building with
-fsanitize=undefined? It sounds like the compile time overhead when
already building with sanitizers would be fairly insignificant.

I think it'd at least be contentious to the analysis without -fsanitize=undefined set. We'd need to change a lot of code in llvm to do the reporting, and it could have compile-time costs.

I'd be in favor of enabling the warning when building with -fsanitize=undefined.

Assuming turning this on by default with ubsan is too much, it seems to
me that this overlaps quite a bit with the static analyzer and/or
clang-tidy - integrating it into one of those might make more sense than
adding yet another option to -cc1.

Do you mean that the checks should operate on the clang CFG, or that the driver option to enable diagnostics from the middle-end should be a static analyzer driver option?

dblaikie:

FWIW - Clang is fairly allergic to emitting diagnostics based on optimization because they tend to present usability problems. They can appear/disappear due to seemingly unrelated changes in the code (that trigger or hinder optimizations that cause the diagnostic path to be hit).

What if we advertise this as a new clang warning: "-Woptimizer". We'd be up front about it only reporting issues the optimizer actually runs in to, and that it would be dependent on the optzn level.

Usually the idea is to implement these sort of bug finding techniques in Clang's static analyzer. So perhaps there would be a way to feed UBSan's facts/checks into the static analyzer in a more consistent way (I'm sure some of the same checks are implemented there already - but generalizing/unifying UBSan's checks to feed into the static analyzer could be handy).

One roadblock with doing the checks on the clang CFG is that we'd miss issues that occur after LTO. This is an important use-case.

I think an approach where we have one simple LLVM analysis passing diagnostics up to clang is actually a more unified solution. We wouldn't have to reimplement the bug checking on the clang CFG, we'd get feature parity between the static and runtime checks without writing extra code, and the diagnostics would work in all scenarios the optimizer is used in.

Anna:

We already have several tools that detect errors, mainly UB, statically. Depending on the type of analysis you choose to write and false positive tolerance I’d suggest extending one of these:
- compiler warnings
- clang-tidy
- clang static analyzer

I believe it will be trivial to teach the clang static analyzer to catch the bug in the example you provide below.

I have the same concerns about this. I'm not sure how we'd get the clang static analyzer to "see" issues exposed by LTO. I'm also concerned about this being a much larger undertaking, with potentially higher false-positive rates, and without bug-checking parity with UBSan.

Reid:

Checking for ubsan handlers that post-dominate a function entry block seems like a weak heuristic. If you put that code inside an if, or if main is inlined into a block that doesn't post-dominate the entry of the caller, we won't warn on it.

What if we had a way of tagging certain branches as "warn if this branch is optimized to true", and then we had hooks from branch simplification utilities to emit warnings? That seems like it might find a lot more bugs, but it could have false positives in dead code. Do you think that could work?

You're right, this is a weak heuristic. UBSan-inserted branches are marked "!nosanitize", we could probably create diagnostics when we throw such branches away.

I don't have a good solution yet for the dead code issue except to: maintain blacklists, or special-case patterns which lead to dead code (in my case, NSRequestConcreteImplementation).

Thank you all for your comments.

vedant

AnnaZaks · March 23, 2017, 8:42pm

First, the CFG is not only used by the analyzer, but it is also used by Sema (ex: unreachable code warnings and uninitialized variables).
Second, while there are corners of C++ that are not supported, it has high fidelity otherwise.

The idea of analysis based on top of LLVM IR is not new and have been discussed before. My personal belief is that having access to the AST (or just code as was written by the user) is very important. It ensures we can provide precise diagnostics. It also allows us to see when users want to suppress an issue report by changing the way the source code is uttered. For example, allows to tell the developer that they can suppress with a cast. We can also differentiate between “NULL” and “0”, which allows us to determine if the programmer intended to use a pointer constant or a zero numeric value.

Ensuring that clang CFG completely supports C++ is a bit challenging but not insurmountable task. In addition, fixing-up the unsupported corners in the CFG would benefit not only the clang static analyzer, but all the other “users” such as clang warnings, clang-tidy, possibly even refactoring in the future.

Here is a thread where this has been discussed before:
http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html

vedantk · March 23, 2017, 11:23pm

I don't think it will be feasible to generalize UBSan's knowledge to the static analyzer.

Why not? The rough idea I meant would be to express the constraints UBSan is checking into the static analyzer - I realize the current layering (UBSan being in Clang's IRGen) doesn't make that trivial/obvious, but it seems to me that the constraints could be shared in some way - with some work.

Maybe I am not imaginative enough, but I cannot envision a clean way to express the conditions that trigger UB that is useful for both static analysis and dynamic instrumentation. The best I can come up with is AST-level instrumentation: synthesizing AST nodes that can be translated to IR or used for analysis. That doesn't seem reasonable, so I think getting ubsan into the static analyzer would end up duplicating the knowledge of what conditions trigger UB.

The static analyzer CFG is also at best an approximation of the real CFG, especially for C++.

First, the CFG is not only used by the analyzer, but it is also used by Sema (ex: unreachable code warnings and uninitialized variables).
Second, while there are corners of C++ that are not supported, it has high fidelity otherwise.

Sure enough - and I believe some of the people working/caring about it would like to fix that. I think Manuel & Chandler have expressed the notion that the best way to do that would be to move to a world where the CFG is used for CodeGen, so it's a single/consistent source of truth.

Yes, we could do all that work, but LLVM's CFG today is already precise for C++. If we allow ourselves to emit diagnostics from the middle-end, we can save all that work.

Going down the high-effort path of extending the CFG and abstracting or duplicating UBSan’s checks as static analyses on that CFG would definitely provide a better diagnostic experience, but it's worth re-examining conventional wisdom and exploring alternatives first.

The idea of analysis based on top of LLVM IR is not new and have been discussed before. My personal belief is that having access to the AST (or just code as was written by the user) is very important.

UBSan diagnostics provide a filename, line #, and column #. There is enough information here for a clang-based tool to retrieve an AST node, and generate an appropriate diagnostic. In the non-LTO case, we'd have an ASTContext ready to go at the point where we get the diagnostics back from LLVM. In the LTO case, we could serialize the diagnostics and pass them to a clang tool for processing.

It ensures we can provide precise diagnostics. It also allows us to see when users want to suppress an issue report by changing the way the source code is uttered. For example, allows to tell the developer that they can suppress with a cast.

When users of a hypothetical "-Woptimizer" need to suppress diagnostics, they could use the no_sanitize function attribute, or sanitizer blacklists. These are big hammers: if it proves to be too clumsy, we could find a way to do more fine-grained sanitizer issue suppression.

Ensuring that clang CFG completely supports C++ is a bit challenging but not insurmountable task. In addition, fixing-up the unsupported corners in the CFG would benefit not only the clang static analyzer, but all the other “users” such as clang warnings, clang-tidy, possibly even refactoring in the future.

Here is a thread where this has been discussed before:
http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html

I've tried to summarize this thread:

--- Slides & Minutes from the Static Analyzer BoF ---

* Cross Translation Unit Static Analysis: The proposed approach is to generate summaries for each TU, and to teach the analyses to emit diagnostics based on the summaries. I don't know the current status of this, or what using summaries does to false-positive rates.

* A Clang IR: Modeling things like temporaries is hard, so is dealing with language changes. One proposal is to create an alternate IR for C, then lower it to LLVM IR. The argument is that analyses would be easier to perform on the "C/C++" IR, because CodeGen/StaticAnalyzer would operate on the same structures. There are concerns raised about the complexity of this undertaking

* A Rich Type System in LLVM IR: An alternate proposal is to add a rich, language-agnostic type system to LLVM IR, with pointers back to frontend AST's. Under this proposal, I'm assuming LLVM would be responsible for generating diagnostics. There were concerns raised about having optimizations preserve the type information, about AST node lifetime issues, and about layering violations (i.e LLVM shouldn't be dealing with frontend stuff).

dblaikie · March 23, 2017, 11:32pm

I don’t think it will be feasible to generalize UBSan’s knowledge to the static analyzer.

Why not? The rough idea I meant would be to express the constraints UBSan is checking into the static analyzer - I realize the current layering (UBSan being in Clang’s IRGen) doesn’t make that trivial/obvious, but it seems to me that the constraints could be shared in some way - with some work.

Maybe I am not imaginative enough, but I cannot envision a clean way to express the conditions that trigger UB that is useful for both static analysis and dynamic instrumentation. The best I can come up with is AST-level instrumentation: synthesizing AST nodes that can be translated to IR or used for analysis. That doesn’t seem reasonable, so I think getting ubsan into the static analyzer would end up duplicating the knowledge of what conditions trigger UB.

The static analyzer CFG is also at best an approximation of the real CFG, especially for C++.

First, the CFG is not only used by the analyzer, but it is also used by Sema (ex: unreachable code warnings and uninitialized variables).
Second, while there are corners of C++ that are not supported, it has high fidelity otherwise.

Sure enough - and I believe some of the people working/caring about it would like to fix that. I think Manuel & Chandler have expressed the notion that the best way to do that would be to move to a world where the CFG is used for CodeGen, so it’s a single/consistent source of truth.

Yes, we could do all that work, but LLVM’s CFG today is already precise for C++. If we allow ourselves to emit diagnostics from the middle-end, we can save all that work.

Going down the high-effort path of extending the CFG and abstracting or duplicating UBSan’s checks as static analyses on that CFG would definitely provide a better diagnostic experience, but it’s worth re-examining conventional wisdom and exploring alternatives first.

The idea of analysis based on top of LLVM IR is not new and have been discussed before. My personal belief is that having access to the AST (or just code as was written by the user) is very important.

UBSan diagnostics provide a filename, line #, and column #. There is enough information here for a clang-based tool to retrieve an AST node, and generate an appropriate diagnostic. In the non-LTO case, we’d have an ASTContext ready to go at the point where we get the diagnostics back from LLVM. In the LTO case, we could serialize the diagnostics and pass them to a clang tool for processing.

It ensures we can provide precise diagnostics. It also allows us to see when users want to suppress an issue report by changing the way the source code is uttered. For example, allows to tell the developer that they can suppress with a cast.

When users of a hypothetical “-Woptimizer” need to suppress diagnostics, they could use the no_sanitize function attribute, or sanitizer blacklists. These are big hammers: if it proves to be too clumsy, we could find a way to do more fine-grained sanitizer issue suppression.

Ensuring that clang CFG completely supports C++ is a bit challenging but not insurmountable task. In addition, fixing-up the unsupported corners in the CFG would benefit not only the clang static analyzer, but all the other “users” such as clang warnings, clang-tidy, possibly even refactoring in the future.

Here is a thread where this has been discussed before:
http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html

I’ve tried to summarize this thread:

— Slides & Minutes from the Static Analyzer BoF —

Cross Translation Unit Static Analysis: The proposed approach is to generate summaries for each TU, and to teach the analyses to emit diagnostics based on the summaries. I don’t know the current status of this, or what using summaries does to false-positive rates.

A Clang IR: Modeling things like temporaries is hard, so is dealing with language changes. One proposal is to create an alternate IR for C, then lower it to LLVM IR. The argument is that analyses would be easier to perform on the “C/C++” IR, because CodeGen/StaticAnalyzer would operate on the same structures. There are concerns raised about the complexity of this undertaking

A Rich Type System in LLVM IR: An alternate proposal is to add a rich, language-agnostic type system to LLVM IR, with pointers back to frontend AST’s. Under this proposal, I’m assuming LLVM would be responsible for generating diagnostics. There were concerns raised about having optimizations preserve the type information, about AST node lifetime issues, and about layering violations (i.e LLVM shouldn’t be dealing with frontend stuff).

ISTM that work on improving C++ support in the clang CFG can proceed in parallel to work on reporting sanitizer diagnostics from the middle-end. I’d like to know what others think about this.

Just to recap, the approach I have in mind is different from the options discussed in the thread Anna linked to. It’s advantages are that;

There is a clear path towards reporting issues that arise when LTO is enabled.

We can design it s.t there aren’t false positives, with the caveat that we will annoy users about bugs in dead code.

It’s extensible, in case we ever want to explore statically reporting diagnostics from other sanitizers.

The classes of UB bugs we’d be able to diagnose statically would always be the same as the ones we diagnose at runtime, without writing much (or any) extra code.

It’s relatively simple to implement.

The main disadvantage are that;

It constitutes a layering violation, because we’d need to teach LLVM about the names of some ubsan diagnostic handlers. IMO this isn’t too onerous, we already do this with builtins.

It’s awkward to have llvm issue diagnostics, because this is usually done by Sema/clang-tidy/the static analyzer. I think this is something we could get used to.

I’m not sure this last bulletpoint quite captures the difficulties of having diagnostics powered by optimizers. The concern is generally that this makes diagnostics unreliable and surprising to users because what would appear to be unrelated changes (especially to the degree of changing the optimization level/tweaks, or maybe changing architecture/target) cause diagnostics to appear/disappear. This is the sort of behavior that’s been seen with some of GCC’s diagnostics that are optimizer-powered, and something Clang’s done a fair bit to avoid.

Dave

vedantk · March 23, 2017, 11:47pm

>
>
>>
>> I don't think it will be feasible to generalize UBSan's knowledge to the static analyzer.
>>
>> Why not? The rough idea I meant would be to express the constraints UBSan is checking into the static analyzer - I realize the current layering (UBSan being in Clang's IRGen) doesn't make that trivial/obvious, but it seems to me that the constraints could be shared in some way - with some work.
>>
>> Maybe I am not imaginative enough, but I cannot envision a clean way to express the conditions that trigger UB that is useful for both static analysis and dynamic instrumentation. The best I can come up with is AST-level instrumentation: synthesizing AST nodes that can be translated to IR or used for analysis. That doesn't seem reasonable, so I think getting ubsan into the static analyzer would end up duplicating the knowledge of what conditions trigger UB.
>>
>> The static analyzer CFG is also at best an approximation of the real CFG, especially for C++.
>>
>
> First, the CFG is not only used by the analyzer, but it is also used by Sema (ex: unreachable code warnings and uninitialized variables).
> Second, while there are corners of C++ that are not supported, it has high fidelity otherwise.
>
>> Sure enough - and I believe some of the people working/caring about it would like to fix that. I think Manuel & Chandler have expressed the notion that the best way to do that would be to move to a world where the CFG is used for CodeGen, so it's a single/consistent source of truth.
>>
>> Yes, we could do all that work, but LLVM's CFG today is already precise for C++. If we allow ourselves to emit diagnostics from the middle-end, we can save all that work.
>>
>> Going down the high-effort path of extending the CFG and abstracting or duplicating UBSan’s checks as static analyses on that CFG would definitely provide a better diagnostic experience, but it's worth re-examining conventional wisdom and exploring alternatives first.
>
> The idea of analysis based on top of LLVM IR is not new and have been discussed before. My personal belief is that having access to the AST (or just code as was written by the user) is very important.

UBSan diagnostics provide a filename, line #, and column #. There is enough information here for a clang-based tool to retrieve an AST node, and generate an appropriate diagnostic. In the non-LTO case, we'd have an ASTContext ready to go at the point where we get the diagnostics back from LLVM. In the LTO case, we could serialize the diagnostics and pass them to a clang tool for processing.

> It ensures we can provide precise diagnostics. It also allows us to see when users want to suppress an issue report by changing the way the source code is uttered. For example, allows to tell the developer that they can suppress with a cast.

When users of a hypothetical "-Woptimizer" need to suppress diagnostics, they could use the no_sanitize function attribute, or sanitizer blacklists. These are big hammers: if it proves to be too clumsy, we could find a way to do more fine-grained sanitizer issue suppression.

> Ensuring that clang CFG completely supports C++ is a bit challenging but not insurmountable task. In addition, fixing-up the unsupported corners in the CFG would benefit not only the clang static analyzer, but all the other “users” such as clang warnings, clang-tidy, possibly even refactoring in the future.
>
> Here is a thread where this has been discussed before:
> http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html

I've tried to summarize this thread:

--- Slides & Minutes from the Static Analyzer BoF ---

* Cross Translation Unit Static Analysis: The proposed approach is to generate summaries for each TU, and to teach the analyses to emit diagnostics based on the summaries. I don't know the current status of this, or what using summaries does to false-positive rates.

* A Clang IR: Modeling things like temporaries is hard, so is dealing with language changes. One proposal is to create an alternate IR for C, then lower it to LLVM IR. The argument is that analyses would be easier to perform on the "C/C++" IR, because CodeGen/StaticAnalyzer would operate on the same structures. There are concerns raised about the complexity of this undertaking

* A Rich Type System in LLVM IR: An alternate proposal is to add a rich, language-agnostic type system to LLVM IR, with pointers back to frontend AST's. Under this proposal, I'm assuming LLVM would be responsible for generating diagnostics. There were concerns raised about having optimizations preserve the type information, about AST node lifetime issues, and about layering violations (i.e LLVM shouldn't be dealing with frontend stuff).

---

ISTM that work on improving C++ support in the clang CFG can proceed in parallel to work on reporting sanitizer diagnostics from the middle-end. I'd like to know what others think about this.

Just to recap, the approach I have in mind is different from the options discussed in the thread Anna linked to. It's advantages are that;

* There is a clear path towards reporting issues that arise when LTO is enabled.

* We can design it s.t there aren't false positives, with the caveat that we will annoy users about bugs in dead code.

* It's extensible, in case we ever want to explore statically reporting diagnostics from other sanitizers.

* The classes of UB bugs we'd be able to diagnose statically would always be the same as the ones we diagnose at runtime, without writing much (or any) extra code.

* It's relatively simple to implement.

The main disadvantage are that;

* It constitutes a layering violation, because we'd need to teach LLVM about the names of some ubsan diagnostic handlers. IMO this isn't too onerous, we already do this with builtins.

* It's awkward to have llvm issue diagnostics, because this is usually done by Sema/clang-tidy/the static analyzer. I think this is something we could get used to.

I'm not sure this last bulletpoint quite captures the difficulties of having diagnostics powered by optimizers. The concern is generally that this makes diagnostics unreliable and surprising to users because what would appear to be unrelated changes (especially to the degree of changing the optimization level/tweaks, or maybe changing architecture/target) cause diagnostics to appear/disappear. This is the sort of behavior that's been seen with some of GCC's diagnostics that are optimizer-powered, and something Clang's done a fair bit to avoid.

That's fair, I should have examined that last point in more detail.

I think we could minimize the confusion by making it clear that the warnings are dependent on optimizations. In part, this would mean writing good documentation. We'd also need to find the right language for the warnings. Something like: "warning: foo.cpp:N:M: At the -OX optimization level, undefined behavior due to <...> was detected [-Woptimizer]". We could teach clang to emit Fixits in some cases, or even notes which explain that the code could be removed by the optimizer.

vedant

dblaikie · March 23, 2017, 11:59pm

I don’t think it will be feasible to generalize UBSan’s knowledge to the static analyzer.

Why not? The rough idea I meant would be to express the constraints UBSan is checking into the static analyzer - I realize the current layering (UBSan being in Clang’s IRGen) doesn’t make that trivial/obvious, but it seems to me that the constraints could be shared in some way - with some work.

Maybe I am not imaginative enough, but I cannot envision a clean way to express the conditions that trigger UB that is useful for both static analysis and dynamic instrumentation. The best I can come up with is AST-level instrumentation: synthesizing AST nodes that can be translated to IR or used for analysis. That doesn’t seem reasonable, so I think getting ubsan into the static analyzer would end up duplicating the knowledge of what conditions trigger UB.

The static analyzer CFG is also at best an approximation of the real CFG, especially for C++.

First, the CFG is not only used by the analyzer, but it is also used by Sema (ex: unreachable code warnings and uninitialized variables).
Second, while there are corners of C++ that are not supported, it has high fidelity otherwise.

Sure enough - and I believe some of the people working/caring about it would like to fix that. I think Manuel & Chandler have expressed the notion that the best way to do that would be to move to a world where the CFG is used for CodeGen, so it’s a single/consistent source of truth.

Yes, we could do all that work, but LLVM’s CFG today is already precise for C++. If we allow ourselves to emit diagnostics from the middle-end, we can save all that work.

Going down the high-effort path of extending the CFG and abstracting or duplicating UBSan’s checks as static analyses on that CFG would definitely provide a better diagnostic experience, but it’s worth re-examining conventional wisdom and exploring alternatives first.

The idea of analysis based on top of LLVM IR is not new and have been discussed before. My personal belief is that having access to the AST (or just code as was written by the user) is very important.

UBSan diagnostics provide a filename, line #, and column #. There is enough information here for a clang-based tool to retrieve an AST node, and generate an appropriate diagnostic. In the non-LTO case, we’d have an ASTContext ready to go at the point where we get the diagnostics back from LLVM. In the LTO case, we could serialize the diagnostics and pass them to a clang tool for processing.

It ensures we can provide precise diagnostics. It also allows us to see when users want to suppress an issue report by changing the way the source code is uttered. For example, allows to tell the developer that they can suppress with a cast.

When users of a hypothetical “-Woptimizer” need to suppress diagnostics, they could use the no_sanitize function attribute, or sanitizer blacklists. These are big hammers: if it proves to be too clumsy, we could find a way to do more fine-grained sanitizer issue suppression.

Ensuring that clang CFG completely supports C++ is a bit challenging but not insurmountable task. In addition, fixing-up the unsupported corners in the CFG would benefit not only the clang static analyzer, but all the other “users” such as clang warnings, clang-tidy, possibly even refactoring in the future.

Here is a thread where this has been discussed before:
http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html

I’ve tried to summarize this thread:

— Slides & Minutes from the Static Analyzer BoF —

Cross Translation Unit Static Analysis: The proposed approach is to generate summaries for each TU, and to teach the analyses to emit diagnostics based on the summaries. I don’t know the current status of this, or what using summaries does to false-positive rates.

A Clang IR: Modeling things like temporaries is hard, so is dealing with language changes. One proposal is to create an alternate IR for C, then lower it to LLVM IR. The argument is that analyses would be easier to perform on the “C/C++” IR, because CodeGen/StaticAnalyzer would operate on the same structures. There are concerns raised about the complexity of this undertaking

A Rich Type System in LLVM IR: An alternate proposal is to add a rich, language-agnostic type system to LLVM IR, with pointers back to frontend AST’s. Under this proposal, I’m assuming LLVM would be responsible for generating diagnostics. There were concerns raised about having optimizations preserve the type information, about AST node lifetime issues, and about layering violations (i.e LLVM shouldn’t be dealing with frontend stuff).

ISTM that work on improving C++ support in the clang CFG can proceed in parallel to work on reporting sanitizer diagnostics from the middle-end. I’d like to know what others think about this.

Just to recap, the approach I have in mind is different from the options discussed in the thread Anna linked to. It’s advantages are that;

There is a clear path towards reporting issues that arise when LTO is enabled.

We can design it s.t there aren’t false positives, with the caveat that we will annoy users about bugs in dead code.

It’s extensible, in case we ever want to explore statically reporting diagnostics from other sanitizers.

The classes of UB bugs we’d be able to diagnose statically would always be the same as the ones we diagnose at runtime, without writing much (or any) extra code.

It’s relatively simple to implement.

The main disadvantage are that;

It constitutes a layering violation, because we’d need to teach LLVM about the names of some ubsan diagnostic handlers. IMO this isn’t too onerous, we already do this with builtins.

It’s awkward to have llvm issue diagnostics, because this is usually done by Sema/clang-tidy/the static analyzer. I think this is something we could get used to.

I’m not sure this last bulletpoint quite captures the difficulties of having diagnostics powered by optimizers. The concern is generally that this makes diagnostics unreliable and surprising to users because what would appear to be unrelated changes (especially to the degree of changing the optimization level/tweaks, or maybe changing architecture/target) cause diagnostics to appear/disappear. This is the sort of behavior that’s been seen with some of GCC’s diagnostics that are optimizer-powered, and something Clang’s done a fair bit to avoid.

That’s fair, I should have examined that last point in more detail.

I think we could minimize the confusion by making it clear that the warnings are dependent on optimizations. In part, this would mean writing good documentation. We’d also need to find the right language for the warnings. Something like: “warning: foo.cpp:N:M: At the -OX optimization level, undefined behavior due to <…> was detected [-Woptimizer]”. We could teach clang to emit Fixits in some cases, or even notes which explain that the code could be removed by the optimizer.

Fixits, notes, and other high precision source information (even in the case of a basic diagnostic - highlighting the expression range, etc) would be pretty challenging for a feature like this, btw. The source locations used by LLVM to report backend diagnostics (LLVM has some - for things like instruction selection/inline asm issues, etc - as well as the optimization report diagnostics) are from debug info, only line/col. Stitching that back to any part of the AST would be pretty ambiguous in most/many cases I’d imagine.

AnnaZaks · March 24, 2017, 12:36am

I don’t think it will be feasible to generalize UBSan’s knowledge to the static analyzer.

Why not? The rough idea I meant would be to express the constraints UBSan is checking into the static analyzer - I realize the current layering (UBSan being in Clang’s IRGen) doesn’t make that trivial/obvious, but it seems to me that the constraints could be shared in some way - with some work.

Maybe I am not imaginative enough, but I cannot envision a clean way to express the conditions that trigger UB that is useful for both static analysis and dynamic instrumentation. The best I can come up with is AST-level instrumentation: synthesizing AST nodes that can be translated to IR or used for analysis. That doesn’t seem reasonable, so I think getting ubsan into the static analyzer would end up duplicating the knowledge of what conditions trigger UB.

The static analyzer CFG is also at best an approximation of the real CFG, especially for C++.

First, the CFG is not only used by the analyzer, but it is also used by Sema (ex: unreachable code warnings and uninitialized variables).
Second, while there are corners of C++ that are not supported, it has high fidelity otherwise.

Sure enough - and I believe some of the people working/caring about it would like to fix that. I think Manuel & Chandler have expressed the notion that the best way to do that would be to move to a world where the CFG is used for CodeGen, so it’s a single/consistent source of truth.

Yes, we could do all that work, but LLVM’s CFG today is already precise for C++. If we allow ourselves to emit diagnostics from the middle-end, we can save all that work.

Going down the high-effort path of extending the CFG and abstracting or duplicating UBSan’s checks as static analyses on that CFG would definitely provide a better diagnostic experience, but it’s worth re-examining conventional wisdom and exploring alternatives first.

The idea of analysis based on top of LLVM IR is not new and have been discussed before. My personal belief is that having access to the AST (or just code as was written by the user) is very important.

UBSan diagnostics provide a filename, line #, and column #. There is enough information here for a clang-based tool to retrieve an AST node, and generate an appropriate diagnostic. In the non-LTO case, we’d have an ASTContext ready to go at the point where we get the diagnostics back from LLVM. In the LTO case, we could serialize the diagnostics and pass them to a clang tool for processing.

It ensures we can provide precise diagnostics. It also allows us to see when users want to suppress an issue report by changing the way the source code is uttered. For example, allows to tell the developer that they can suppress with a cast.

When users of a hypothetical “-Woptimizer” need to suppress diagnostics, they could use the no_sanitize function attribute, or sanitizer blacklists. These are big hammers: if it proves to be too clumsy, we could find a way to do more fine-grained sanitizer issue suppression.

Ensuring that clang CFG completely supports C++ is a bit challenging but not insurmountable task. In addition, fixing-up the unsupported corners in the CFG would benefit not only the clang static analyzer, but all the other “users” such as clang warnings, clang-tidy, possibly even refactoring in the future.

Here is a thread where this has been discussed before:
http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html

I’ve tried to summarize this thread:

— Slides & Minutes from the Static Analyzer BoF —

Cross Translation Unit Static Analysis: The proposed approach is to generate summaries for each TU, and to teach the analyses to emit diagnostics based on the summaries. I don’t know the current status of this, or what using summaries does to false-positive rates.

A Clang IR: Modeling things like temporaries is hard, so is dealing with language changes. One proposal is to create an alternate IR for C, then lower it to LLVM IR. The argument is that analyses would be easier to perform on the “C/C++” IR, because CodeGen/StaticAnalyzer would operate on the same structures. There are concerns raised about the complexity of this undertaking

A Rich Type System in LLVM IR: An alternate proposal is to add a rich, language-agnostic type system to LLVM IR, with pointers back to frontend AST’s. Under this proposal, I’m assuming LLVM would be responsible for generating diagnostics. There were concerns raised about having optimizations preserve the type information, about AST node lifetime issues, and about layering violations (i.e LLVM shouldn’t be dealing with frontend stuff).

ISTM that work on improving C++ support in the clang CFG can proceed in parallel to work on reporting sanitizer diagnostics from the middle-end. I’d like to know what others think about this.

Just to recap, the approach I have in mind is different from the options discussed in the thread Anna linked to. It’s advantages are that;

There is a clear path towards reporting issues that arise when LTO is enabled.

We can design it s.t there aren’t false positives, with the caveat that we will annoy users about bugs in dead code.

It’s extensible, in case we ever want to explore statically reporting diagnostics from other sanitizers.

The classes of UB bugs we’d be able to diagnose statically would always be the same as the ones we diagnose at runtime, without writing much (or any) extra code.

It’s relatively simple to implement.

The main disadvantage are that;

It constitutes a layering violation, because we’d need to teach LLVM about the names of some ubsan diagnostic handlers. IMO this isn’t too onerous, we already do this with builtins.

It’s awkward to have llvm issue diagnostics, because this is usually done by Sema/clang-tidy/the static analyzer. I think this is something we could get used to.

I’m not sure this last bulletpoint quite captures the difficulties of having diagnostics powered by optimizers. The concern is generally that this makes diagnostics unreliable and surprising to users because what would appear to be unrelated changes (especially to the degree of changing the optimization level/tweaks, or maybe changing architecture/target) cause diagnostics to appear/disappear. This is the sort of behavior that’s been seen with some of GCC’s diagnostics that are optimizer-powered, and something Clang’s done a fair bit to avoid.

That’s fair, I should have examined that last point in more detail.

I think we could minimize the confusion by making it clear that the warnings are dependent on optimizations. In part, this would mean writing good documentation. We’d also need to find the right language for the warnings. Something like: “warning: foo.cpp:N:M: At the -OX optimization level, undefined behavior due to <…> was detected [-Woptimizer]”. We could teach clang to emit Fixits in some cases, or even notes which explain that the code could be removed by the optimizer.

Fixits, notes, and other high precision source information (even in the case of a basic diagnostic - highlighting the expression range, etc) would be pretty challenging for a feature like this, btw. The source locations used by LLVM to report backend diagnostics (LLVM has some - for things like instruction selection/inline asm issues, etc - as well as the optimization report diagnostics) are from debug info, only line/col. Stitching that back to any part of the AST would be pretty ambiguous in most/many cases I’d imagine.

I completely agree with all the points provided here by David. The user experience of using these will be quite bad and the behavior will be surprising to the user (ex: diagnostics starting to appear only at higher optimization levels). The proper solution here is to implement these in the existing diagnostic tools. You’d probably catch more interesting bugs that way as well since the analysis power will not be limited to the optimizer.

Eli_Friedman · March 24, 2017, 12:55am

The source locations clang uses for inline asm diagnostics are not from debug info; we serialize a clang::SourceLocation into the metadata. See . We could (and probably should) extend this to other diagnostics. Granted, that isn’t really the problem here. The static analyzer is very carefully designed to generate diagnostics which are useful; it shows the assumptions it makes, and a dynamic path through the function. With diagnostics coming from the optimizer, we have no way to explain why a function has undefined behavior: we throw away the CFG in the process of optimizing the code. -Eli

dblaikie · March 24, 2017, 2:32am

I don’t think it will be feasible to generalize UBSan’s knowledge to the static analyzer.

Why not? The rough idea I meant would be to express the constraints UBSan is checking into the static analyzer - I realize the current layering (UBSan being in Clang’s IRGen) doesn’t make that trivial/obvious, but it seems to me that the constraints could be shared in some way - with some work.

Maybe I am not imaginative enough, but I cannot envision a clean way to express the conditions that trigger UB that is useful for both static analysis and dynamic instrumentation. The best I can come up with is AST-level instrumentation: synthesizing AST nodes that can be translated to IR or used for analysis. That doesn’t seem reasonable, so I think getting ubsan into the static analyzer would end up duplicating the knowledge of what conditions trigger UB.

The static analyzer CFG is also at best an approximation of the real CFG, especially for C++.

First, the CFG is not only used by the analyzer, but it is also used by Sema (ex: unreachable code warnings and uninitialized variables).
Second, while there are corners of C++ that are not supported, it has high fidelity otherwise.

Sure enough - and I believe some of the people working/caring about it would like to fix that. I think Manuel & Chandler have expressed the notion that the best way to do that would be to move to a world where the CFG is used for CodeGen, so it’s a single/consistent source of truth.

Yes, we could do all that work, but LLVM’s CFG today is already precise for C++. If we allow ourselves to emit diagnostics from the middle-end, we can save all that work.

Going down the high-effort path of extending the CFG and abstracting or duplicating UBSan’s checks as static analyses on that CFG would definitely provide a better diagnostic experience, but it’s worth re-examining conventional wisdom and exploring alternatives first.

The idea of analysis based on top of LLVM IR is not new and have been discussed before. My personal belief is that having access to the AST (or just code as was written by the user) is very important.

UBSan diagnostics provide a filename, line #, and column #. There is enough information here for a clang-based tool to retrieve an AST node, and generate an appropriate diagnostic. In the non-LTO case, we’d have an ASTContext ready to go at the point where we get the diagnostics back from LLVM. In the LTO case, we could serialize the diagnostics and pass them to a clang tool for processing.

It ensures we can provide precise diagnostics. It also allows us to see when users want to suppress an issue report by changing the way the source code is uttered. For example, allows to tell the developer that they can suppress with a cast.

When users of a hypothetical “-Woptimizer” need to suppress diagnostics, they could use the no_sanitize function attribute, or sanitizer blacklists. These are big hammers: if it proves to be too clumsy, we could find a way to do more fine-grained sanitizer issue suppression.

Ensuring that clang CFG completely supports C++ is a bit challenging but not insurmountable task. In addition, fixing-up the unsupported corners in the CFG would benefit not only the clang static analyzer, but all the other “users” such as clang warnings, clang-tidy, possibly even refactoring in the future.

Here is a thread where this has been discussed before:
http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html

I’ve tried to summarize this thread:

— Slides & Minutes from the Static Analyzer BoF —

Cross Translation Unit Static Analysis: The proposed approach is to generate summaries for each TU, and to teach the analyses to emit diagnostics based on the summaries. I don’t know the current status of this, or what using summaries does to false-positive rates.

A Clang IR: Modeling things like temporaries is hard, so is dealing with language changes. One proposal is to create an alternate IR for C, then lower it to LLVM IR. The argument is that analyses would be easier to perform on the “C/C++” IR, because CodeGen/StaticAnalyzer would operate on the same structures. There are concerns raised about the complexity of this undertaking

A Rich Type System in LLVM IR: An alternate proposal is to add a rich, language-agnostic type system to LLVM IR, with pointers back to frontend AST’s. Under this proposal, I’m assuming LLVM would be responsible for generating diagnostics. There were concerns raised about having optimizations preserve the type information, about AST node lifetime issues, and about layering violations (i.e LLVM shouldn’t be dealing with frontend stuff).

ISTM that work on improving C++ support in the clang CFG can proceed in parallel to work on reporting sanitizer diagnostics from the middle-end. I’d like to know what others think about this.

Just to recap, the approach I have in mind is different from the options discussed in the thread Anna linked to. It’s advantages are that;

There is a clear path towards reporting issues that arise when LTO is enabled.

We can design it s.t there aren’t false positives, with the caveat that we will annoy users about bugs in dead code.

It’s extensible, in case we ever want to explore statically reporting diagnostics from other sanitizers.

The classes of UB bugs we’d be able to diagnose statically would always be the same as the ones we diagnose at runtime, without writing much (or any) extra code.

It’s relatively simple to implement.

The main disadvantage are that;

It constitutes a layering violation, because we’d need to teach LLVM about the names of some ubsan diagnostic handlers. IMO this isn’t too onerous, we already do this with builtins.

It’s awkward to have llvm issue diagnostics, because this is usually done by Sema/clang-tidy/the static analyzer. I think this is something we could get used to.

I’m not sure this last bulletpoint quite captures the difficulties of having diagnostics powered by optimizers. The concern is generally that this makes diagnostics unreliable and surprising to users because what would appear to be unrelated changes (especially to the degree of changing the optimization level/tweaks, or maybe changing architecture/target) cause diagnostics to appear/disappear. This is the sort of behavior that’s been seen with some of GCC’s diagnostics that are optimizer-powered, and something Clang’s done a fair bit to avoid.

That’s fair, I should have examined that last point in more detail.

I think we could minimize the confusion by making it clear that the warnings are dependent on optimizations. In part, this would mean writing good documentation. We’d also need to find the right language for the warnings. Something like: “warning: foo.cpp:N:M: At the -OX optimization level, undefined behavior due to <…> was detected [-Woptimizer]”. We could teach clang to emit Fixits in some cases, or even notes which explain that the code could be removed by the optimizer.

Fixits, notes, and other high precision source information (even in the case of a basic diagnostic - highlighting the expression range, etc) would be pretty challenging for a feature like this, btw. The source locations used by LLVM to report backend diagnostics (LLVM has some - for things like instruction selection/inline asm issues, etc - as well as the optimization report diagnostics) are from debug info, only line/col. Stitching that back to any part of the AST would be pretty ambiguous in most/many cases I’d imagine.

The source locations clang uses for inline asm diagnostics are not from debug info; we serialize a clang::SourceLocation into the metadata. See http://llvm.org/docs/LangRef.html#inline-asm-metadata . We could (and probably should) extend this to other diagnostics.

Oh, right! neat.

Granted, that isn’t really the problem here. The static analyzer is very carefully designed to generate diagnostics which are useful; it shows the assumptions it makes, and a dynamic path through the function. With diagnostics coming from the optimizer, we have no way to explain why a function has undefined behavior: we throw away the CFG in the process of optimizing the code.

Fair points, all. Thanks for mentioning. (& I love that part of the Static Analyzer - walking through the sequence of execution needed to reach the problematic case, etc)

Dave

preames · March 25, 2017, 9:35pm

Ok, so the discussion in the thread so far as been pretty strong against optimizer influenced warnings. To me, this seems a bit off point. I completely and utterly agree that I don't want my -Werror builds to be influenced by the black box which is the optimizer, but that doesn't mean there isn't a utility to such diagnostics.

It seems like what we need here is a different interface to the warnings. As a strawman, what if we had the compiler produce an output file which contains a list of functions which are known to trip an ubsan assert. By combining all of these files, we can produce a binary wide list of "as built" warnings. This could be integrated into tools like the static analyzer in a couple of obvious ways: a) report them as "must be dead" code, b) highlight calls to such functions as errors, or c) use them to emphasize warnings produced through another chain of logic (i.e. "hey this actually happens!").

Such "as built" warnings seem particular useful for large projects which have to support multiple compiler versions at once. By combining a profile collected on one build with an "as built" warning file from another, you could immediate identify (some of the) the critical bugs which had to be fixed before moving to the new compiler. Worth noting is that this works for miscompiles as well as for application bugs.

Another use would be to intersect "as built" warnings list from two or more built flavors. Any warning which wasn't in all variants is one which probably needs urgent attention. ("Hey, why did my O3 build produce more "as built" warnings than my debug build?!")

Taking things in a different direction, the original idea could also be generalized a bit. Instead of focusing on functions which are post dominated by failure points, let's focus on statements which are post dominated by such. Any such statement produces a region which must be dead and can feed into all of the same uses as those above.

Philip

Topic		Replies	Views
Detect Undefined Behavior on LLVM IR Beginners	0	239	January 11, 2023
[ubsan] Add -fsanitize-warn-once, only emit runtime error once per check Clang Frontend	10	116	December 30, 2012
[release_34] [compiler_rt] UBSAN test failures on i586 Clang Frontend	4	94	December 9, 2013
[ubsan] Check recovery Clang Frontend	6	77	December 1, 2012
How ubsan inserts checking code to detect undefined behaviors? Clang Frontend	1	89	October 16, 2014

Reporting UBSan issues at compile-time

Related topics