summer of code idea — checking bounds overflow bugs

Hi,

Some days ago I am interested in detecting undefined behaviors

in C programs based on Clang. After several days’ investigation, I think

checking bounds overflow bugs is more interesting, because bounds

overflow is one of the most frequently encountered errors in C programs.

For example, performing pointer arithmetic without checking bounds

can cause bounds overflow. To increase the accuracy of finding bugs,

I want to write several passes, based on slicing, inline and summary function

/ (partial) transition function, to implement intre-procedural analysis.

Does some person have interest in the project? I need a mentor,

and wait for your reply.

Best Reagards!

Qiuping Yi

Qiuping,

Have you looked at what has already been done? I would expect that taking previous work such as this:

   http://llvm.org/pubs/2006-05-24-SAFECode-BoundsCheck.html

and integrating into current LLVM would be a better idea than starting over.

John

John Regehr wrote:

Qiuping,

Have you looked at what has already been done? I would expect that taking previous work such as this:

   Backwards-Compatible Array Bounds Checking for C with Very Low Overhead

and integrating into current LLVM would be a better idea than starting over.
  
This code is publicly available from the SAFECode project (see http://safecode.cs.illinois.edu to see how to get it). However, it has not been maintained well over the years and is currently disabled. Getting it to work again with LLVM 2.6 or replacing it with something better would be nice.

I'm writing up a response to this project idea as I'm willing to mentor it; I'll send it out shortly.

-- John T.

©ö¬îµÓ wrote:

Hi,

Some days ago I am interested in detecting undefined behaviors

in C programs based on Clang. After several days¡¦ investigation, I think

checking bounds overflow bugs is more interesting, because bounds

overflow is one of the most frequently encountered errors in C programs.

For example, performing pointer arithmetic without checking bounds

can cause bounds overflow. To increase the accuracy of finding bugs,

I want to write several passes, based on slicing, inline and summary
function

/ (partial) transition function, to implement intre-procedural analysis.

Does some person have interest in the project? I need a mentor,

and wait for your reply.

The SAFECode project (http://safecode.cs.illinois.edu) is very
interested in having a static array bounds checking analysis pass.
However, we're not interested in a static analysis tool, and we're not
interested in something that works on Clang's IR. What we want is a one
or more LLVM IR analysis passes that tells us which getelementptr (GEP)
instructions in a program are statically known not to overflow (this
allows us to eliminate run-time checks for such instructions).

Such a pass exists in the SAFECode project's source tree, but it has not
been maintained well over the years and has fallen into disuse (it also
had some engineering limitations, such as using exec() to repeatedly
execute the Omega compiler for constraint solving). Either getting that
code to work again or replacing it with something better would be
beneficial.

If you'd be willing to work on something that works with SAFECode, I'd
be willing to mentor your project. I do, however, have one condition:
you need to find a specific static array bounds checking algorithm and
understand how it works. I don't see a specific algorithm or paper
reference above.

A good starting point for algorithms might be Dinakar's paper (see
Section 5): Memory Safety Without Garbage Collection for Embedded Applications. There has
also been talk of implementing the ABCD algorithm in LLVM
(http://portal.acm.org/citation.cfm?id=349342); you may want to read
about that algorithm as well.

As an aside, I've written a pass that finds the inter-procedural static
backwards slice of an LLVM value (disclaimer: it uses DSA to find the
targets of indirect function calls). I'm pretty sure we could release it
to the public if you find that you need it for your project.

-- John T.

John-- a couple questions:

Can you explain the SAFECode model in a bit more detail? I am getting conflicting information. On one hand, some of the papers describe a system that is primarily designed to hide safety violations. On the other hand, the 2006 ICSE paper that I cited earlier today seems to be talking about catching violations. These are very different goals! What does the code in the SAFECode repository actually do?

Can you comment on the speed of LLVM when shelling out to Omega? My guess would be that this will result in unacceptable compile times for large software, and that something fast and relatively simple like ABCD is a better choice for general usage.

Finally a comment: it's a clear that a comprehensive system for trapping undefined behavior in Clang is a multi-year project. Some parts of this must live in Clang. Some parts, such as bounds check optimizations, should go into LLVM passes. Anyway I'm just saying that the project you outlines seems to fit very well into the overall vision of detecting undefined behavior in C programs.

John

Sounds an good idea, is that means lowerinng down the SAFECode project
from the higher level(clang)to lower level for an more general work on
bound check? I aslo want to know is it possoble to detecting memory
leak at the very low(llvm ir) level to detecting memory leaks? Or at
llvm ir level to providing an stackfull hooks? It's very useful to
have such an feature. The stack hooks can help us to print extra stack
info in the exec period without modify the original code, to help us
to find bugs easier:)

I think for memory leaks, we just need to modify the internal method
llvm.malloc, add hooks to this method

For stack hooks, we can modify the calling conventions:)

罗勇刚(Yonggang Luo) wrote:

Sounds an good idea, is that means lowerinng down the SAFECode project
from the higher level(clang)to lower level for an more general work on
bound check?

SAFECode has always worked on the LLVM IR.

What I am saying is that my preference is to have LLVM passes that do static array bounds checking instead of Clang passes that do static array bounds checking. The problem that I see with implementing static array bounds checking in Clang is that it benefits only languages utilizing Clang's libraries. That means that VMKit, llvm-gcc/g++, and other potential frontends can't benefit from it. SAFECode won't derive any benefit except when it is used in conjunction with Clang. That's okay but not ideal.

Also, SAFECode, being a set of LLVM passes, uses LLVM passes better than Clang passes. If static array bounds checking were implemented in Clang, then a Clang-based transform would need to insert information into the LLVM IR to communicate to SAFECode which GEP instructions stayed within bounds. If static array bounds checking is implemented as an LLVM pass, then SAFECode will just need to add it as a prerequisite and query the results.

Now, having said that, static array bounds checking in Clang is probably a very good thing for the Clang static analyzer, and having strong static analysis tools for finding bugs is a good thing, so if anyone wants to build static array bounds checking for Clang, go for it. However, I can't mentor such a project (I have no experience with Clang analyses), and it won't benefit my project (SAFECode) very easily.

I aslo want to know is it possoble to detecting memory
leak at the very low(llvm ir) level to detecting memory leaks?

I don't see why not. I believe Valgrind does it on assembly code; you could probably build an LLVM transform that does what Valgrind does but does it more efficiently (primarily because using LLVM as a static compiler removes the dynamic binary translation overhead).

Or at
llvm ir level to providing an stackfull hooks? It's very useful to
have such an feature. The stack hooks can help us to print extra stack
info in the exec period without modify the original code, to help us
to find bugs easier:)
  
I'm not sure what you mean here. Can you clarify?

-- John T.

P.S. I use the term "Clang IR" to mean whatever data structures Clang uses to represent code. I believe it uses Abstract Syntax Trees (ASTs). Perhaps I should have said ASTs...

John-- a couple questions:

Can you explain the SAFECode model in a bit more detail? I am getting
conflicting information. On one hand, some of the papers describe a
system that is primarily designed to hide safety violations. On the other
hand, the 2006 ICSE paper that I cited earlier today seems to be talking
about catching violations.

SAFECode has both capabilities. The complete array bounds checking described in the ICSE paper is an option. So are the complete dangling pointer checks described in a DSN 2006 paper (that one has higher overhead for many programs). The guarantees in the PLDI 2006 paper -- partial type safety, control flow integrity, and a sound operational semantics -- hold regardless of whether those two options are enabled or not.

The code in the public repository defaults to a debugging version that (a) turns on complete bounds checking; (b) I think turns on dangling pointer checks; but (c) does not use Automatic Pool Allocation, so it does not guarantee the soundness of pointer analysis. It also does much more information tracking to give useful feedback on bugs. Of course, it has much higher overhead than if we had made other choices, e.g., for use in production code.

--Vikram

Adve, Vikram Sadanand wrote:

John-- a couple questions:

Can you explain the SAFECode model in a bit more detail? I am getting conflicting information. On one hand, some of the papers describe a system that is primarily designed to hide safety violations. On the other hand, the 2006 ICSE paper that I cited earlier today seems to be talking about catching violations.
    
SAFECode has both capabilities. The complete array bounds checking described in the ICSE paper is an option. So are the complete dangling pointer checks described in a DSN 2006 paper (that one has higher overhead for many programs). The guarantees in the PLDI 2006 paper -- partial type safety, control flow integrity, and a sound operational semantics -- hold regardless of whether those two options are enabled or not.

The code in the public repository defaults to a debugging version that (a) turns on complete bounds checking; (b) I think turns on dangling pointer checks;

Dangling pointer detection (DSN 2006) is disabled by default. A command line option in the sc tool will enable it.

but (c) does not use Automatic Pool Allocation, so it does not guarantee the soundness of pointer analysis.

I have comments to John's other questions below.

  It also does much more information tracking to give useful feedback on bugs. Of course, it has much higher overhead than if we had made other choices, e.g., for use in production code.

--Vikram

These are very different goals! What does the code in the SAFECode repository actually do?

Can you comment on the speed of LLVM when shelling out to Omega? My guess would be that this will result in unacceptable compile times for large software, and that something fast and relatively simple like ABCD is a better choice for general usage.
    
I don't have numbers, but it was unusable for doing static array bounds checking on the Linux kernel during the Secure Virtual Architecture (SVA) work. There might be performance numbers in one of the SAFECode publications.

The immediate engineering problem with using the Omega constraint solver is that SAFECode does multiple fork() and exec() calls to execute a separate binary program to do constraint solving; this occurs multiple times during the static array bounds checking pass. Performance could be greatly improved by using the Omega code as a library and linking it straight into the static array bounds checking pass. The overheads of constraint generation and solving would still be there, but calling into the Omega code would be much faster.

Regarding the use of alternative algorithms, there are new, simple static array bounds checking passes within SAFECode that utilize analysis group chaining. When one analysis cannot determine that a GEP remains within bounds, it can ask a more powerful pass to return a result. This could potentially be used to reduce the use of the Omega method to just those GEPs that cannot be proven safe using simpler methods.

Finally a comment: it's a clear that a comprehensive system for trapping undefined behavior in Clang is a multi-year project. Some parts of this must live in Clang. Some parts, such as bounds check optimizations, should go into LLVM passes. Anyway I'm just saying that the project you outlines seems to fit very well into the overall vision of detecting undefined behavior in C programs.
    
Can you elaborate on the undefined behaviors you have in mind? SAFECode (when its Automatic Pool Allocation option is working) should provide sound operational semantics, so all undefined behavior due to memory safety should be gone. I'm sure there's other undefined behaviors (e.g., divide by zero), but I haven't thought much about them.

I'm curious why you think some undefined behavior detectors need to be built in Clang. It seems to me that any static analysis could be built using either LLVM or Clang; there are just tradeoffs to each approach. What advantages does Clang provide?

-- John T.

SAFECode has both capabilities. The complete array bounds checking described in the ICSE paper is an option.

Thanks Vikram, this is great to hear.

This would be super useful as a part of mainline LLVM. I'm happy to help with testing and mentoring for any effort to get this integrated.

pointer checks; but (c) does not use Automatic Pool Allocation, so it does not guarantee the soundness of pointer analysis.

As far as I'm concerned, unsound analysis is only a problem if it may lead to erroneous compilation of conforming programs.

John

John,

Thanks for the detailed response.

Regarding the use of alternative algorithms, there are new, simple static array bounds checking passes within SAFECode that utilize analysis group chaining. When one analysis cannot determine that a GEP remains within bounds, it can ask a more powerful pass to return a result. This could potentially be used to reduce the use of the Omega method to just those GEPs that cannot be proven safe using simpler methods.

I think this is the right engineering choice, with a command line option to simply disable the heavyweight decision procedure for people who want faster compile times.

I'm curious why you think some undefined behavior detectors need to be built in Clang. It seems to me that any static analysis could be built using either LLVM or Clang; there are just tradeoffs to each approach. What advantages does Clang provide?

Some checks must live in Clang because too much information has been lost by the time LLVM sees the code. There are many examples but here is the canonical one. A program has undefined behavior if "between two sequence points, an object is modified more than once, or is modified and the prior value is read other than to determine the value to be stored."

To implement this check in LLVM, we would have to answer the question: Where, in the LLVM code, are the sequence points?

John

Some checks must live in Clang because too much information has been lost
by the time LLVM sees the code. There are many examples but here is the
canonical one. A program has undefined behavior if "between two sequence
points, an object is modified more than once, or is modified and the prior
value is read other than to determine the value to be stored."

To implement this check in LLVM, we would have to answer the question:
Where, in the LLVM code, are the sequence points?

By the way I can hear readers of this list saying to themselves "this does not seem like a useful check to implement." Perhaps this is correct, but let's consider the tradeoffs:

- This is a relatively simple, localized check that should not be too hard to implement.

- Almost all of the added checks would be destroyed by LLVM after simple queries to the alias analyzer, so applications running with this check turned on will not slow down much.

- Common optimizing compilers change the meaning of a computation that makes this mistake.

My guess is that this check would find problems in real apps...

John

John Regehr wrote:

Some checks must live in Clang because too much information has been lost
by the time LLVM sees the code. There are many examples but here is the
canonical one. A program has undefined behavior if "between two sequence
points, an object is modified more than once, or is modified and the prior
value is read other than to determine the value to be stored."

To implement this check in LLVM, we would have to answer the question:
Where, in the LLVM code, are the sequence points?
    
By the way I can hear readers of this list saying to themselves "this does not seem like a useful check to implement." Perhaps this is correct, but let's consider the tradeoffs:

- This is a relatively simple, localized check that should not be too hard to implement.

- Almost all of the added checks would be destroyed by LLVM after simple queries to the alias analyzer, so applications running with this check turned on will not slow down much.
  
I'm not sure if the above is true. For example, consider the code:

void foo (int * a, int * b) {
    *a = *b++;
}

void bar (int a) {
    foo (&a, &a);
}

I think this is undefined behavior in foo() (two writes within a set of sequence points), but it will take inter-procedural alias analysis to determine whether the check can be dropped.

Is this correct, or am I missing something?

- Common optimizing compilers change the meaning of a computation that makes this mistake.

My guess is that this check would find problems in real apps...
  
That would be an interesting experiment.

-- John T.

<snip>

I'm curious why you think some undefined behavior detectors need to be built
in Clang. It seems to me that any static analysis could be built using
either LLVM or Clang; there are just tradeoffs to each approach. What
advantages does Clang provide?

Some checks must live in Clang because too much information has been lost
by the time LLVM sees the code. There are many examples but here is the
canonical one. A program has undefined behavior if "between two sequence
points, an object is modified more than once, or is modified and the prior
value is read other than to determine the value to be stored."

I agree. There are a number of such semantic rules that must be checked in the front end, another common example being type checking rules for types that are lowered down to the IR (e.g., all the class related rules in C++). SAFECode as it currently stands doesn't try to address such properties.

To implement this check in LLVM, we would have to answer the question:
Where, in the LLVM code, are the sequence points?

John

--Vikram

- Almost all of the added checks would be destroyed by LLVM after simple queries to the alias analyzer, so applications running with this check turned on will not slow down much.

I'm not sure if the above is true. For example, consider the code:

void foo (int * a, int * b) {
  *a = *b++;
}

void bar (int a) {
  foo (&a, &a);
}

I think this is undefined behavior in foo() (two writes within a set of sequence points), but it will take inter-procedural alias analysis to determine whether the check can be dropped.

Is this correct, or am I missing something?

You're right, I was imagining that this kind of code would be inlined...

John

2010-04-01

Did you mean implementing a new static array bouns checking algorithm
with SAFECode is a better idea than that with LLVM?
I am not sure whether it's feasible to finish it within a summer, under the
condition that I have little knowledge of SAFECode project.

SAFECode is built as a set of LLVM passes, so there is no difference. E.g., Dinakar's earlier work that John mentioned is an LLVM pass used by SAFECode.

--Vikram
Associate Professor, Computer Science
University of Illinois at Urbana-Champaign
http://llvm.org/~vadve

Hi, John Criswell!

You have said to me that SAFECode had not been maintained for several years,
now I have submitted my proposal for updating the SAFCode project to the new LLVM APIs.
If you are still interested in the topic and willing to guid my project, I will be very happy.
Now I’m waiting for you comments.

Here is my proposal:
http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/easyqiu/t127038894856

2010-04-07

yiqiuping1986 wrote:

Hi£¬ John Criswell£¡
You have said to me that SAFECode had not been maintained for several
years,

Just to clarify, SAFECode *has* been and *is* maintained (primarily by
me). The release_26 branch in the SVN repository works with LLVM 2.6,
and mainline is working (with some regressions) with the upcoming LLVM
2.7. You can subscribe to the SVA Commits mailing list
(http://lists.cs.uiuc.edu/mailman/listinfo/sva-commits) to get email
whenever a commit is made to SAFECode.

However, there is an inter-procedural static array bounds checking pass
that comprises a small component of SAFECode. It is an optional
component, so SAFECode works without it, but SAFECode would work better
if this static array bounds checking pass were working.

This inter-procedural static array bounds checking pass in SAFECode has
not been maintained and needs to 1) be updated to work with LLVM
2.6/2.7; 2) have its bugs fixed; and 3) improved so that it has better
performance (i.e., no repeated fork()/exec() for constraint solving).

now I have submitted my proposal for updating the SAFCode project to
the new LLVM APIs.
If you are still interested in the topic and willing to guid my
project, I will be very happy.
Now I'm waiting for you comments.
Here is my proposal:
http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/easyqiu/t127038894856

Looking back at my previous email, I see that I was unclear. SAFECode,
as a whole, is up-to-date with LLVM. However, it is missing its static
array bounds checking pass (because that specific pass has not been
maintained). What I was suggesting is that, as a GSoC project, you could
update the static array bounds checking code from the SAFECode project.

With that in mind, here is what I think you should do:

1) Update your "Check bounds overflow bugs in C programs based on LLVM"
proposal
(http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/easyqiu/t126993884556).
In it, state that
a) you are specifically interested in implementing *static* array bounds
checking
b) there is old code in the SAFECode project which you may reuse to
build your static array bounds checking pass
c) you will investigate whether any of the static array bounds checking
code in SAFECode is useful. If it is, you can reuse it. Otherwise, you
will develop something new.
d) remove the comment about Edwin Torok's pass being intra-procedural.
He has told us (or at least me) that his analysis is inter-procedural.
e) you can list me as a potential mentor, if you like.

2) Delete the "Update the SAFECode project to the new LLVM API"
proposal. SAFECode, as a whole, works with LLVM 2.6/2.7. Reusing the old
static array bounds checking pass from SAFECode can be merged into your
bounds checking proposal (as I stated above).

3) In all of your proposals, you should state why the mentor would be a
good mentor. In my case, you can state that I am a past LLVM contributor
and that I am the primary maintainer of the SAFECode project (which uses
LLVM extensively).

4) In your ABCD proposal, did you copy the text describing the ABCD
algorithm from the ABCD paper? If so, don't do that; describe the
algorithm in your own words (even if your English isn't perfect). It
will show that you understand the algorithm and prevent you from being
accused of plagiarism.

5) Cite sources. If your proposal states that you are going to implement
an algorithm described in a paper, specify exactly which paper it is.
That way, the proposal reviewer can look at the paper.

-- John T.

Hi, John Criswell!

Thank you for your detailed and fast reply.
I will update my proposal following your suggestion as soon as possible.

Best Regards!
Qiuping

2010-04-08