Google Summer of Code proposal: Adding memory safety checks to the LLVM bitcodes

Dear LLVMers,

My name is Raphael Ernani, and I am doing my MsC at the Federal
University of Minas Gerais, Brazil. I have been using LLVM for a
while, and I would like to participate in this year’s Summer of Code.
One particular idea, in your “open projects” page caught my eye, and I
decided to write a proposal about it. The line that I liked in the
page was “Create an LLVM pass that adds memory safety checks to code,
like Valgrind does for binaries, or like mudflap does for gcc compiled
code.”, and my proposal is written below:

Hi Raphael!

I haven’t read your proposal in detail, but there are actually three projects that implement memory safety by transforming LLVM bitcode:

  1. ASAN is integrated into LLVM. It is meant as a low-overhead debugging tool in the spirit of Valgrind. It’s primary technique is to place guard memory in between memory objects and check loads and stores to see if they return a poisoned bit-patter from a guard. It isn’t sound (certain dangling pointer errors and array bound violations can get past it), but it’s easy to implement.

  2. SAFECode (my project: ) aims to provide sound memory safety checks and is a sub-project of LLVM. It inserts function pointer checks, precise structure and array bounds checks, load/store checks, and checks on C library function calls into code. It is also designed to work with automatic pool allocation to provide sound points-to analysis and partial type-safety, but these features are currently disabled because they are not sufficiently robust. While initially designed to protect applications during production, SAFECode has a generic pass to add debug information to its run-time checks, essentially make it a valgrind replacement like ASAN. 3) SoftBound and its CETS extension () have been integrated into the SAFECode compiler and can be enabled with options in SAFECode’s clang. It provides array bounds checking and, with CETS, optional dangling pointer detection. While I don’t think we need another memory safety checker for LLVM, I do think that there’s a lot of interesting projects in making each of these tools better. Some ideas that come to mind: o Creating common instrumentation passes for ASAN, SAFECode, and SoftBound to permit code reuse. All three essentially place checks on loads, stores, and atomic operations. If they use the same run-time check names, then the same optimizations can improve the speed of all three. o Creating or improving optimizations for memory safety checks. Static array bounds checking alone provides numerous options, such as implementing ABCD, using value-range analysis (mentioned in another GSoC post), investigating SMT solver-based techniques, etc, etc. Other optimizations include hoisting checks out of loops, polishing and re-enabling the SAFECode type-safety optimization, and removing redundant load/store checks. o Polishing and enabling the Baggy Bounds Checking (BBC) run-time. This run-time can speed up the run-time checks and should be faster than the splay-tree approach that SAFECode currently uses. I’ll go ahead and create a list of open projects on the SAFECode web page (), and I’ll remove the “Valgrind” tool project off the open projects page as there are now several memory safety tools built using LLVM. If you want my opinion, I think the static array bounds checker or the monotonic loop optimization make nice, self-contained projects. Finally, you might be interested in the Memory Safety Menagerie (). This web page contains a whole list of papers on the subject of memory safety transforms. – John T.

Dear All,

Okay, the SAFECode open projects page is done: http://sva.cs.illinois.edu/projects.html.

-- John T.

Dear John,

in the last e\-mail we'd exchanged I forgot to CC the llvmdev list,

so I am copying the e-mail back to the list.

One thing I should point out is that SAFECode and SoftBound (ideally) should be using a static array bounds checking algorithm that is sound with respect to integer overflow. For example, x < x+5 is always true if x is an integer, but it's not always true if x is a 32-bit two's complement number; additional constraints on x are needed to prove that x < x+5 is true.

Does your analysis take two's complement overflow/underflow into account?

No. The less-than lattice only reports if a variable is less than
another, assuming that your lattice is {-inf, ... -2, -1, 0, 1, 2, ...
+inf}. But I know of another lattice that takes overflows into
consideration. It is called the Pentagon domain, and it was published
by Logozzo and Fahndrich in the paper [1]:

[1] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly
relational abstract domain for the efficient validation of array
accesses. Sci. Comput. Program. 75(9): 796-807 (2010)

Pentagons combine the less-than lattice with the interval lattice to
have an analysis that is sound in terms of wrapping arithmetics. We
have an implementation of range analysis here at UFMG. It also uses
the lattice {-inf, ... -1, 0, 1, ... +inf}, but it is safe to use
limits that are bound by values different than infinity, e.g., [10,
100], for instance. On the other hand, [10, +inf] could wrap around,
thus compromising the lower bound, and it is not safe to use it.

Just to make sure I understand: you're using abstract interpretation to map SSA values to elements in some lattice. In your standard value range analysis, you're assigning concrete values ranges (e.g., (4,10)) to each SSA value. In the "less-than" domain, you're mapping each LLVM SSA value to a set of SSA values which are greater than or equal to the SSA value (e.g., x -> [a, y, z] means that x <=a and x <= y and x <=z). Is this correct?

Yes!

Interesting. One problem with on-demand analysis (at least the way it's done in LLVM for things like alias analysis) is that it gets tricky for inter-procedural analyses.

With the less-than domain, I suppose what you could do is have the "user" of the analysis first specify which variable relationships they want to query. With that information, your 'less-than' analysis runs as usual with the exception that it ignores variables not queried by the "user", thus making the sets of variables smaller. The run-time is still cubic, but the input size is smaller.

Yes: that is the idea of doing it on demand. And the analysis must be
context-sensitive too. Venet and Brat had discussed it before in the
paper [2]. One way to get a bit of context sensitiveness is with
function inlining, that LLVM already does for us.

[2] Precise and Efficient Static Array Bound Checking for Large
Embedded C Programs, Arnaud Venet and Guillaume Brat.

Thus, to summary my GSoC project, I would implement an array bounds
check elimination engine, after the paper [1], using the Pentagon
domain, which combines the less-than lattice with the interval
lattice. The analysis would be context sensitive. I think to do all of
it in three months is too much, but I could implement the pentagon
domain, and then use function inlining to get some sort of context
sensitiveness. I would pursue this project further, as my Master
dissertation. Then I would try to implement true context
sensitiveness.

What do you think? Could I give it a try and rewrite my initial proposal?

By the way, I've been following your discussion regarding Victor's
proposal. We worked together in the implementation of the range
analysis engine.

regards,

Raphael.

Dear LLVMers,

My name is Raphael Ernani, and I am doing my MsC at the Federal
University of Minas Gerais, Brazil. I have been using LLVM for a
while, and I would like to participate in this year’s Summer of Code.
One particular idea, in your “open projects” page caught my eye, and I
decided to write a proposal about it. The line that I liked in the
page was “Create an LLVM pass that adds memory safety checks to code,
like Valgrind does for binaries, or like mudflap does for gcc compiled
code.”, and my proposal is written below:

Hi Raphael!

I haven’t read your proposal in detail, but there are actually three projects that implement memory safety by transforming LLVM bitcode:

  1. ASAN is integrated into LLVM. It is meant as a low-overhead debugging tool in the spirit of Valgrind. It’s primary technique is to place guard memory in between memory objects and check loads and stores to see if they return a poisoned bit-patter from a guard. It isn’t sound (certain dangling pointer errors and array bound violations can get past it), but it’s easy to implement.

  2. SAFECode (my project: http://safecode.cs.illinois.edu) aims to provide sound memory safety checks and is a sub-project of LLVM. It inserts function pointer checks, precise structure and array bounds checks, load/store checks, and checks on C library function calls into code. It is also designed to work with automatic pool allocation to provide sound points-to analysis and partial type-safety, but these features are currently disabled because they are not sufficiently robust. While initially designed to protect applications during production, SAFECode has a generic pass to add debug information to its run-time checks, essentially make it a valgrind replacement like ASAN.

  3. SoftBound and its CETS extension (http://www.cis.upenn.edu/acg/softbound) have been integrated into the SAFECode compiler and can be enabled with options in SAFECode’s clang. It provides array bounds checking and, with CETS, optional dangling pointer detection.

While I don’t think we need another memory safety checker for LLVM, I do think that there’s a lot of interesting projects in making each of these tools better. Some ideas that come to mind:

o Creating common instrumentation passes for ASAN, SAFECode, and SoftBound to permit code reuse. All three essentially place checks on loads, stores, and atomic operations. If they use the same run-time check names, then the same optimizations can improve the speed of all three.

o Creating or improving optimizations for memory safety checks. Static array bounds checking alone provides numerous options, such as implementing ABCD, using value-range analysis (mentioned in another GSoC post), investigating SMT solver-based techniques, etc, etc. Other optimizations include hoisting checks out of loops, polishing and re-enabling the SAFECode type-safety optimization, and removing redundant load/store checks.

“removing redundant load/store checks” could be quite useful for both
AddressSanitizer (http://clang.llvm.org/docs/AddressSanitizer.html) and
ThreadSanitizer (http://code.google.com/p/data-race-test/wiki/ThreadSanitizer2).
Some basic examples here: http://code.google.com/p/address-sanitizer/wiki/CompileTimeOptimizations
I am looking into having some simple and human verifiable analysis with lots of unit tests,
rather than something heavyweight and exhaustive.

–kcc

Dear LLVMers,

I wrote a new proposal, to improve the static array bounds checking in
SAFEcode, as follows:

Improving static array bounds checking in SAFEcode

Dear LLVMers,

I wrote a new proposal, to improve the static array bounds checking in
SAFEcode, as follows:

I haven't had time to read over this carefully; hopefully I can do that tomorrow. However, one quick comment is that SAFECode already has a pass that instruments GEPs (the array indexing instruction in LLVM) with run-time checks. You can see the code at the URL below (or you can check out SAFECode and examine the source code):

http://llvm.org/viewvc/llvm-project/safecode/trunk/lib/InsertPoolChecks/GEPChecks.cpp?revision=136382&view=markup

SAFECode is moving to a design in which the Clang front-end inserts run-time checks and then later passes (including passes within libLTO) optimize those checks away. It makes the code easier to maintain and understand, and it makes experiments easier.

Since your analysis will need to be run within libLTO *after* SAFECode has inserted its run-time checks, I recommend that you write a pass that *removes* run-time checks using your analysis.

I think you should read the (very early draft of the) SAFECode Software Architecture Manual located within the SAFECode source tree (you'll need LaTeX to compile it to a PDF file). It describes the source code layout and provides an overview of how the analysis and instrumentation passes are integrated into Clang/libLTO.

-- John T.

Dear LLVMers,

I wrote a new proposal, to improve the static array bounds checking in
SAFEcode, as follows:

Improving static array bounds checking in SAFEcode

Objective
---------

the main objective of this project is to improve the static array
bounds checking engine used in SAFECode. It was written after the open
project athttp://safecode.cs.illinois.edu/projects.html, which reads
as follows:

"Improve static array bounds checking: SAFECode used to have an
inter-procedural static array bounds checking pass, but it suffered
serious scalability issues and was eventually disabled. Improving the
old code or writing a new pass would be extremely beneficial."

I plan to build a new array bounds checking algorithm from scratch.

I think what you want to say here is that you're going to build a new static array bounds checking pass for SAFECode/LLVM but that the code may also be useful for LLVM sub-projects like VMKit and DragonEgg's Ada support.

As far as organization goes, think you should provide the background section first and then describe how you will solve the problem.

How it will be done
-------------------

I will build an array bounds checking algorithm based on the ABCD
algorithm that was described by Bodik in the paper [3]. The key
insight in the ABCD was to use a lattice that some researchers, e.g.,
[2], call the "less-than" lattice. Instead of determining precise
ranges for the variables using for instance, the interval lattice,
Bodik's less-than lattice finds, for each variable v, the set of other
variables that are less-than it.
    It is a requirement that the algorithm must be relient against the

Replace reliant with resilient.

wrapping semantics of 2-complement's arithmetic. The original
algorithm proposed by Bodik is not. However, it is possible to limit
Bodik's algorithm in such a way that only overflow resilient
equalities are taken into consideration. In other words, if the source
program contains a comparison such as:

if (x< y) {
  // p1
} else {
  // p2
}

   then we are guaranteed that inside p1 we have that x< y.

This example is not clear, does not demonstrate integer overflow, and is not related to memory safety. I would use an example like this:

p = malloc ((x * sizeof (struct foo)) + 1);
p[x] = ...

It is not necessarily the case that 0 <= x <= ((x * sizeof (struct foo)) + 1). Therefore, this array access, which looks safe at first glance, may not be safe.

    I propose an analysis that, contrary to the original ABCD

s/contrary to/unlike

algorithm is inter-procedural. To obtain context sensitiveness, which

s/sensitiveness/sensitivity

is very important, I will initially use function inlining, which is
available in LLVM. If time allows it, I will implement true context

If time permits

sensitiveness using an "on-demand" approach on the call graph of
functions. Context sensitiveness is remarkably important in this type
of domain because it is necessary to match array sizes and identifiers
across function calls in order to have a precise analysis, as
described in the paper [1].
    I do not plan to couple my analysis with the interval domain, but,
if the LLVM community judge that it should be done, than I can
implement my algorithm after the pentagon domain, proposed by Logozzo
and Fahndrich in 2010 [2]. The pentagon domain is a different approach
to get an analysis that is overflow resilient. It combines the
interval domain {-inf, ..., -1, 0, 1, ... +inf} with the "less-than"
domain. If some variable is bound to an interval like [l, u], where l
-inf and u< +inf, then we know that operations that add up to
MAX_INT - u to it will not produce overflow. Same for the lower limit.
I am reticent to use the pentagon domain because Logozzo et al report
some relatively long runtimes (in the order of minutes) and so this
approach might not attack the scalability issue that the SAFECode
maintainers would like to handle.

I think you should remove the last paragraph; it doesn't seem relevant. Chances are good that you know more about these static analyses than the judges do. There's no point in letting them select which algorithm you implement.

Background
----------

Invalid memory access is a major source of program failures. If a
program statement dereferences a pointer that points to an invalid
memory cell, the program is either aborted by the operating system or,
often worse, the program continues to run with an undefined behavior.

You can make an even stronger argument: out of bounds pointers are used to subvert system security.

To avoid the latter, one can perform checks before every memory access
at run time. For some programming languages (e.g., Java) this is done
automatically by the compiler/run-time environment. For the language
C, neither the compiler nor the run-time environment enforces
memory-safety policies [1].

An example of access violation is given by the program below. When
this program runs, probably no visible error happens. However, the
program violates the semantics of well defined C code: it is accessing
the 11-th cell of an array that was declared with only 10 cells. Thus,
the behavior of this program is undefined.

#include<stdlib.h>
#include<stdio.h>
int main(int argc, char** argv){
       int i;
       int* A;
       A = (int*)calloc(10,sizeof(int));
       for(i=0;i<=11;i++)
               A[i] = i;
       printf("%d\n",A[11]);
       return 0;
}

This type of behavior can be the source of some very hard-to-find
bugs. If no explicit error occurs, other variables can have their
values changed without any reasonable explanation. Finding and fixing
this type of bug can be really hard, especially if the program runs
with more than one active thread. There exists tools that have been
built specially to track access violations like the one in the program
above. Valgrind is probably the best known example. These tools are
dynamic: they are used during the execution of the program, either
emulating it, or instrumenting it to catch errors at runtime. We would
like to implement the detection of array bounds violation statically.

First, Valgrind does not do bounds checking in its default mode; it merely finds invalid loads and stores. Its ptrcheck tool does bounds checking but is slow and can sometimes be unreliable.

Second, you should actually mention that tools for dynamic array bounds checking do exist (primarily SAFECode and SoftBound). You should emphasize that you're trying to optimize the speed at which SAFECode/SoftBound instrumented programs can run by optimizing away run-time array bounds checks.

Third, I don't think you need a C code example to demonstrate your point. Everyone knows what a buffer overflow is. I think you want to argue that buffer overflows cause security vulnerabilities and hard-to-fix bugs, there are tools that catch these errors at run-time, and you're going to try to optimize the run-time checks by using static array bounds checking.

Fourth, I think you should point out that your work can easily be modified to work with other tools such as VMKit, SoftBound, and the Dragonegg Ada front-end even if it targets SAFECode first.

Which results the analysis will generate
-------------------------------------------

This analysis will classify each array access site into one of two categories:
(i) always safe
(ii) possibly unsafe

The locations classified as always safe will never be able to cause an
array access violation, even if we take the wrapping semantics of
integer arithmetics of C/C++.

Additionally, I would like to implement an algorithm that inserts
safety checks in some possibly unsafe locations. The less-than domain
keeps, for each array, the variables that are equal to its allocated
size. Thus, if this information is available in the sites that the
array is used, we can insert a safety check there. For instance,
consider the program below:

As I mentioned before, SAFECode and SoftBound already insert the checks. Your analysis would be used to remove those that are unneeded.

#include<stdlib.h>
#include<string.h>
#include<stdio.h>

char* alloca2(int size) {
  return (char*) malloc(size * sizeof(char));
}

int main() {
  char* c = alloca2(10);
  char* x;
  strcpy(c, "hello!");
  x = c;
  while (*x != '\0') {
    putc(*x, stdout);
    x++;
  }
}

The inter-procedural analysis should be able to relate 10 with the
size of array c. Then it could insert the following test in the code:

int main() {
  char* c = alloca2(10);
  char* x;
  strcpy(c, "hello!");
  x = c;
  while (*x != '\0') {
    if (x> c + 10) {
      abort();
    }
    putc(*x, stdout);
    x++;
  }
}

It is not possible to secure statically every array access site, but
we should be able to secure at least a good percentage of them.

Remove the code example and the last sentence.

Timeline
--------

This is a three months project, and we would like to schedule the time
in the following way (schedule given in weeks):

[1-4]: Implement the less-than relational domain.

Will you be reusing infrastructure from your research group? If so, make sure to say that. Also specify the open-source license under which that infrastructure is released. Finally, be sure to explain how that infrastructure is going to help you build the less-than analysis more quickly.

[5-7] Implement static array bounds checking. The analysis should
point that a violation might happen in a given array access site.
[8-10] Implement a diagnosis tool, that inserts access checks whenever
we have the information of the limits of an array available.

You could do that, but then you're not helping make SAFECode faster. Your static array bounds checking pass should be used to remove run-time checks on GEPs that your analysis shows are always safe.

[11-12] Final tests and report
- Correctly compile the whole LLVM test suite.
- Generate performance numbers.
- Write the final report.

My background
-------------

I am currently a first year master student at the Federal University
of Minas Gerais, under orientation of professor Fernando Pereira. I
have graduated in this school on the year of 2011.

I assume you have graduated with a Bachelor's. If so, you should state that explicitly.

Also, it's "in the year" and not "on the year."

  I've worked with
software development for five years, in different projects. I built a
Data Warehouse viewer and Business Intelligence graphics and gauges
using Delphi; Commercial programs in MS Visual FoxPro; Improvements in
a logistics planning system, in Excel VBA. I have five years of
experience with SQL. Now, I'm working with LLVM, writing passes to
support scientifical researchers.

s/scientifical researchers/scientific research

There are two problems with the above paragraph:

1) Your work in industry doesn't really argue for your strengths. Can you tell us how large these programs are? Are they still used today? Providing a few key details can make this stronger.

2) You don't highlight your C++ and LLVM programming skills, and nothing in your experience suggests that you have a strong compiler background. Can you explain more about what work you've done on the range analysis with LLVM? Do you have any other C++ programming experience?

Why I can do well in this project
---------------------------------

I believe that I am a good candidate for this Summer of Code for four
reasons. Firstly, I am a master student of computer science in the
Federal
University of Minas Gerais (UFMG), which is one of the best schools of
computer science in Brazil. To support this claim, notice that UFMG
has the maximum grade in the ranking of CAPES (Coordenação de
Aperfeiçoamento de Pessoal de Nivel Superior - the Evaluation Body
under the Ministry of Education). I am working in the compiler's lab,

"working in the compiler lab"

under the advice of Fernando Pereira, who was a Summer of Coder

"under the supervision"

himself in 2006.

"who was a Google Summer of Code participant in 2006."

  Second, I have good experience with programming, with
almost six years of experience in many languages (Java, C, C++, C#,
VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
LLVM passes. I already know some LLVM tools and I already have the
basic knowledge to start this project. I have written a pass to
instrument LLVM bitcodes, in order to find the minimum and maximum
values that each variable assumes at runtime.

Did this pass find the dynamic minimum and maximum values when the program was executed?

  This code has been used
to check the precision of our range analysis. It is publicly available
at (http://code.google.com/p/range-analysis/source/browse/#svn%2Ftrunk%2Fsrc%2FRAInstrumentation).
Finally, in the lab where I work there are six other people
who work everyday with LLVM; thus, in addition to getting help in the
forum, I can easily talk about my project to my colleagues.

I would drop the last sentence. Having smart colleagues is great, but it's a weak argument when trying to convince someone to hire you.

References
----------

[1] Dirk Beyer, Thomas A. Henzinger, Ranjit Jhala and Rupak Majumdar.
Checking Memory Safety with Blast. FASE 2005, LNCS 3442, pages 2-18,
Springer-Verlag, 2005

[2] Francesco Logozzo, Manuel Fähndrich: Pentagons: A weakly
relational abstract domain for the efficient validation of array
accesses. Sci. Comput. Program. 75(9): 796-807 (2010)

[3] Rastislav Bodík, Rajiv Gupta, Vivek Sarkar: ABCD: eliminating
array bounds checks on demand. PLDI 2000: 321-333

You didn't provide any references for SAFECode. See http://sva.cs.illinois.edu/pubs.html.

-- John T.

Second, I have good experience with programming, with
almost six years of experience in many languages (Java, C, C++, C#,
VB, VBA, Object Pascal, SQL, etc...). Third, I'm already working with
LLVM passes. I already know some LLVM tools and I already have the
basic knowledge to start this project. I have written a pass to
instrument LLVM bitcodes, in order to find the minimum and maximum
values that each variable assumes at runtime.

Did this pass find the dynamic minimum and maximum values when the program
was executed?

Yes. We're using it to measure our static range analysis precision.

John,

Thank you very much for your feedback. It was very important to me to
produce the official proposal, aligned with the actual needs of
SAFEcode. I tried to answer all your questions and follow your
advices.

The proposal can be found at:
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/raphaelernani/1

Regards,

Raphael Ernani