Summer of Code idea -- detecting undefined behavior (fwd)

Oops-- just sent the message below to the main LLVM list, it probably would have been better sent to this list in the first place.

I second this as a great project idea.

- Daniel

I would be interested in helping with this project.
I have a couple ideas along these lines as well.
However, this would be a (very) part-time thing for me; since am not a student, and have to work full-time.

-- Marshall

I'll start a wiki page about this...

I can't find a clang-specific wiki anywhere. So I assume a page under "proposed projects" at the LLVM wiki would be the most appropriate location?

John

I'll start a wiki page about this...

I can't find a clang-specific wiki anywhere. So I assume a page under "proposed projects" at the LLVM wiki would be the most appropriate location?

If you want commit access, please send me the info requested in the llvm developer policy, then you can commit directly to the open projects page. I also agree that this would be a great SoC project,

-Chris

I found this idea very interesting, but I don’t have any background on compilers. I know just some basic things and started looking at Clang just some days ago (read the tutorial to get the general idea. I was more focused on llvm). So, I don’t think I’d be a very helpful at SoC, but I’m definitely interested.
If some one can give some pointers (directions, not 0xdeadbeef), I’d very thankful.

Hi,
like Conrado I also wonder about the requirements one would have to meet to
make this project a success.
Is a basic understanding of compilers sufficient?
I imagine such detection mechanisms would be difficult to isolate from other
code generation behavior, so I think it would be quite a challenge to avoid
degrading the design of the current code generation module, or even breaking
working behavior, wouldn't it?

Leopold Walkling

Fun motivation for detecting undefined behavior, direct from Google:

   245 - nativeclient - Native code for web apps - Monorail

John

I've created a page in the LLVM Wiki about this topic:

   http://wiki.llvm.org/Detecting_undefined_behavior_in_Clang

John

Leo, my guess would be that the easy undefined behaviors can be tackled with only a basic knowledge of compilers. Some of these will have large payoff.

The hard undefined behaviors will require a great deal of intimacy with the C standard and with Clang/LLVM. My favorite example is that a program is undefined if "Between two sequence points, an object is modified more than once, or is modified and the prior value is read other than to determine the value to be stored." Yuck.

John

I suspect some of these tests might incur a significant performance penalty
if run as part of compilation. Simple and easy tests are good, but anything
that is too expensive might make more sense to move into the Clang Static
Analyzer instead.

That said, some work from the CSA might be usable here. For example, I
believe that the CSA already detects divide-by-zero. While a full analysis
of the source code would be overkill for the compiler to do every time,
adding the ability to detect divide-by-zero from literals:

int a = 5; int b = 5/0;

or using constant propagation:

int a = 5; int b = 0; int c = a/b;

would certainly be useful and probably relatively easy.

Chris Hacking

Behalf Of John Regehr

Hopefully the new C++ standard's wording will be easier to apply, as
apparently "sequence points" are gone[1], which seems right since I
can't find a mention of them in n3000.

[1] C++ Team Blog

One minor piece of feedback on the list of undefined behavior: it is probably worth removing pieces of this that are implementation defined (e.g. how token pasting works), and pieces that do not manifest as undefined behavior at runtime.

-Chris

Assuming the static analyzer doesn't actually produce executables (why
would it, it's static), it wouldn't be useful here, since finding
undefined behavior in general, is a runtime thing, like:

int i;
cin >> i;
int x = 5 / i;

static analysis can't detect this case of divide by zero. It could
tell that i is runtime defined and not proven to be zero, therefore it
should be tested for zero before used in the denominator of a divide.

That said, you could eliminate checks for cases where it's proven that
undefined behavior could not occur, based on static analysis, which
would result in the resulting program running faster.

Ah, my bad - I have static analysis on the brain and didn't realize you were talking about runtime checks. Those are certainly very handy. It would be interesting to know what the difference in binary size and runtime performance is with those checks.

The hybrid approach (dynamic checks when your static analysis can't tell for sure) would be great, but I'm not sure how one would put that together. Adding the ability to include runtime checks is probably a more important initial step.

I’m interested on this idea, and sure this would be a great project.
I have ever developed a static analyzer just like meta-compilation, from Dawson Engler, Stanford.
But i’m not very familiar with clang
maybe we could do sth. like mygcc http://mygcc.free.fr/
I want be involved in this project if started.

BTW: I ever read Prof. Regehr’s papar about static analysis for interrupt-driven software. In fact, we’re now researching on data race detection for interrupt-driven software. Could you please give me some advice in future?

To be precise: Every behavior in this list is genuine undefined behavior, not implementation-defined behavior. I know this because I took the list directly from the C standard :). Also, every kind of undefined behavior can manifest at runtime, because undefined behavior is undefined.

But of course I take yor point. I've split the list into two parts: behaviors that seem to require runtime checks, and behaviors that do not.

I don't want to remove any behaviors from the list because every behavior undefined by C should be checked somewhere. Token pasting problems should be checked in the frontend, for example. If the ideal case of a compiler that fully checks every item on the list remains unrealized, then fine, but I still like the list.

Next steps include:

- dividing the undefined behaviors into groups that share an implementation (FP-related, file-related, signal-related, memory safety, etc.)

- prioritizing these groups

I appreciate any help.

John