Pool allocator + safecode

Hi,
I'm trying to get the pool allocator and safe code building against llvm
trunk. I've run into a build error, and I see that in the past another
user was told just not to build the pool allocator for use with safecode
[1]. However, I really want the pool allocator transforms, so I just
wanted to check why the suggestion was not to use it. Has it been
superseded in some way by something else? Or is it just broken at the
moment? I'll assume the latter for now and see if I can fix it...

Cheers

[1] http://lists.llvm.org/pipermail/llvm-dev/2015-July/088739.html

Dear Ed,

First, someone has updated the DSA code in the poolalloc project to LLVM 3.7, and a Master's student worked for me over the summer to update a large chunk of SAFECode to LLVM 3.7. However, the update to LLVM 3.7 isn't finished (we need to finish integrating SAFECode back into Clang), and my student has opted to focus on his studies instead of finishing the update, so the current status of SAFECode for LLVM 3.7 is that it's nearly done but suspended.

If you want the currently updated code, you can get it at https://github.com/jtcriswell/safecode-llvm37.

Second, if you need the pool allocation transform, then I must sadly disappoint. The Automatic Pool Allocation (APA) transformation has suffered bit-rot over the years, and we've stopped using it. We haven't updated it to work with LLVM 3.7 and currently have no plans to do so.

The code is still publicly available, so if you are sufficiently motivated, you can update it and fix it if you would like.

Just out of curiosity, what are you trying to accomplish? Depending on what you need, there may be simpler approaches that will require less engineering effort.

Regards,

John Criswell

Thanks for the fast response John.

Dear Ed,

First, someone has updated the DSA code in the poolalloc project to LLVM
3.7, and a Master's student worked for me over the summer to update a
large chunk of SAFECode to LLVM 3.7. However, the update to LLVM 3.7
isn't finished (we need to finish integrating SAFECode back into Clang),
and my student has opted to focus on his studies instead of finishing
the update, so the current status of SAFECode for LLVM 3.7 is that it's
nearly done but suspended.

If you want the currently updated code, you can get it at
https://github.com/jtcriswell/safecode-llvm37.

It's great that someone has begun work on this! I will look at it now.

Can I ask what happened to the llvm-gcc (or dragonegg) front end? I
haven't looked into how it works too much yet (I'm just starting out
with LLVM/clang etc), but if integration with the clang front end works
do you get dragonegg integration for free? From the web page I'm not
sure if dragonegg is also bit-rotting, but a gcc front end would be
useful for me.

Second, if you need the pool allocation transform, then I must sadly
disappoint. The Automatic Pool Allocation (APA) transformation has
suffered bit-rot over the years, and we've stopped using it. We haven't
updated it to work with LLVM 3.7 and currently have no plans to do so.

The code is still publicly available, so if you are sufficiently
motivated, you can update it and fix it if you would like.

I do need the APA transform. I'm a bit confused as to how the DSA code
is useful without APA? I'll cross my fingers and hope that the API
changes to 3.7 aren't too bad. Am I right in thinking that llvm-3.2 is
the last time this worked? Does the above github repo include the latest
APA stuff (albeit disabled)?

Just out of curiosity, what are you trying to accomplish? Depending on
what you need, there may be simpler approaches that will require less
engineering effort.

We really want the all singing all dancing safecode framework with APA
as detailed in the 2005 TECS SafeCode paper (Memory Safety Without
Garbage Collection for Embedded Applications). We are trying to build a
C based embedded system that is type safe at the lowest possible run
time cost. So I am also going to modify the uninitialized pointer MMU
based stuff to work with the ARM Cortex M3 MPU. I don't think there are
any shortcuts here (I'd be happy to be proved wrong though) - we need
APA.

Thanks again,
Ed

Thanks for the fast response John.

Dear Ed,

First, someone has updated the DSA code in the poolalloc project to LLVM
3.7, and a Master's student worked for me over the summer to update a
large chunk of SAFECode to LLVM 3.7. However, the update to LLVM 3.7
isn't finished (we need to finish integrating SAFECode back into Clang),
and my student has opted to focus on his studies instead of finishing
the update, so the current status of SAFECode for LLVM 3.7 is that it's
nearly done but suspended.

If you want the currently updated code, you can get it at
https://github.com/jtcriswell/safecode-llvm37.

It's great that someone has begun work on this! I will look at it now.

Can I ask what happened to the llvm-gcc (or dragonegg) front end? I
haven't looked into how it works too much yet (I'm just starting out
with LLVM/clang etc), but if integration with the clang front end works
do you get dragonegg integration for free? From the web page I'm not
sure if dragonegg is also bit-rotting, but a gcc front end would be
useful for me.

The original llvm-gcc (in which we modified gcc to produce LLVM bitcode) no longer exists. When GCC was extended to use plugins, someone created the DragonEgg plugin. I am not sure if that still exists.

Is there a reason you need a GCC-based front-end? Clang is intended to be a drop-in replacement for GCC and supports many of the GCC extensions.

FWIW, integrating SAFECode into Clang does not automatically give you DragonEgg support. Integrating into Clang simply makes it easier to run the SAFECode transformation passes after the LLVM passes have been run.

Second, if you need the pool allocation transform, then I must sadly
disappoint. The Automatic Pool Allocation (APA) transformation has
suffered bit-rot over the years, and we've stopped using it. We haven't
updated it to work with LLVM 3.7 and currently have no plans to do so.

The code is still publicly available, so if you are sufficiently
motivated, you can update it and fix it if you would like.

I do need the APA transform. I'm a bit confused as to how the DSA code
is useful without APA? I'll cross my fingers and hope that the API
changes to 3.7 aren't too bad. Am I right in thinking that llvm-3.2 is
the last time this worked? Does the above github repo include the latest
APA stuff (albeit disabled)?

DSA is a points-to and call graph analysis. As such, it is used for many things in addition to APA and SAFECode. For example, the SMACK verifier uses DSA.

The APA code has been bit rotting for awhile. While you can compile and run it with LLVM 3.2, it won't work as well as it did for the paper, and it probably needs some work to make it robust. Please remember that APA has always been a research prototype.

Just out of curiosity, what are you trying to accomplish? Depending on
what you need, there may be simpler approaches that will require less
engineering effort.

We really want the all singing all dancing safecode framework with APA
as detailed in the 2005 TECS SafeCode paper (Memory Safety Without
Garbage Collection for Embedded Applications). We are trying to build a
C based embedded system that is type safe at the lowest possible run
time cost. So I am also going to modify the uninitialized pointer MMU
based stuff to work with the ARM Cortex M3 MPU. I don't think there are
any shortcuts here (I'd be happy to be proved wrong though) - we need
APA.

I see.

If you just need to isolate processes on your system so that they don't overwrite each other, it looks like the ARM Cortex M3 MPU can do that. Is there a reason why that won't work?

However, if you want to isolate processes using inferred type safety, you will need to address three significant challenges:

1) Due to changes in LLVM, DSA has a difficult time inferring types. In a nutshell, some LLVM optimizations turned typed-getelementptr (GEP) instructions into GEPs that use byte-level indexing. DSA was written with the assumption that GEPs carried high-level type information.

I think this is fixable by refactoring DSA's type inference code to act more like Value Set Analysis: it would use a map between a 4-tuple and a type. The 4-tuple (a, b, c, d) describes a formula ax + b in which a and b denote offset and stride and c and d place a limit on the lower and upper bounds of x. c and d can be +- infinity, allowing the 4-tuple to denote a type that occurs in an unbounded array.

As an FYI, there are multiple research groups interested in accurate points-to analysis results (mine included). I'll be holding a BoF at the LLVM Developer's meeting this year to discuss who needs what and if there's a way to develop it.

2) Making APA robust. In my personal opinion, the original APA may be more complicated than necessary for your application. APA currently creates context sensitive pools, which means that it needs to pass pool handles around. This requires transforming function signatures which makes inter-operating with external code a real pain. It also makes the transform complicated.

A simpler design would be to create one pool for each type that DSA infers. The points-to analysis would not be sound (unless it unified all points-to sets that have the same type), but it would allow all pools (or nearly all pools) to be global variables, greatly simplifying the transformation. It would also continue to prevent the application from accessing data outside its allocated memory.

3) The version of SAFECode in that paper used the Omega constraint solver to prove that array accesses are safe. That code has long bit-rotted away, and its implementation was not the most efficient (it exec()'ed the Omega solver for every query). A better approach today would be to integrate the constraint solver into the compiler proper. Additionally, you now have other tools available, such as CVC4, Z3, and SMACK/Boogie, for building and solving the constraints.

As an FYI, later versions of SAFECode use run-time checks for type-unsafe pools and array accesses so that it can support the full generality of C (see Dhurjati's PLDI 2007 paper and my SVA publications). You could probably do something simple in which type-safe data goes into pools and type-unsafe data goes into a large area for which loads and stores are subjected to simple SFI-like instrumentation (e.g., Google Native Client). There may be solutions in the middle of the spectrum between fast-but-difficult-to-implement and slow-but-easy-to-implement.

Regards,

John Criswell

Thanks again for the really nice response John.

The original llvm-gcc (in which we modified gcc to produce LLVM bitcode)
no longer exists. When GCC was extended to use plugins, someone created
the DragonEgg plugin. I am not sure if that still exists.

Is there a reason you need a GCC-based front-end? Clang is intended to
be a drop-in replacement for GCC and supports many of the GCC extensions.

The only reason was to make it easier to drop into the toolchain. I
don't anticipate that being a huge amount of work to switch to clang
anyway, but I saw that SafeCode was originally using llvm-gcc so I
thought that might make life easy.

Integrating into Clang simply makes it easier to run
the SAFECode transformation passes after the LLVM passes have been run.

So basically it means you don't have to run safecode manually?

DSA is a points-to and call graph analysis. As such, it is used for
many things in addition to APA and SAFECode. For example, the SMACK
verifier uses DSA.

Got it.

The APA code has been bit rotting for awhile. While you can compile and
run it with LLVM 3.2, it won't work as well as it did for the paper, and
it probably needs some work to make it robust. Please remember that APA
has always been a research prototype.

I really want to bring all this into trunk. I've spent this afternoon
chugging through minor API incompatibilities trying to get APA to build.
Not there yet!

> We really want the all singing all dancing safecode framework with APA
> as detailed in the 2005 TECS SafeCode paper (Memory Safety Without
> Garbage Collection for Embedded Applications). We are trying to build a
> C based embedded system that is type safe at the lowest possible run
> time cost. So I am also going to modify the uninitialized pointer MMU
> based stuff to work with the ARM Cortex M3 MPU. I don't think there are
> any shortcuts here (I'd be happy to be proved wrong though) - we need
> APA.

I see.

If you just need to isolate processes on your system so that they don't
overwrite each other, it looks like the ARM Cortex M3 MPU can do that.
Is there a reason why that won't work?

Maybe I should have been a bit clearer; we're really interested in full
memory and type safety. We want to harden the system against memory
corruption vulnerabilities. Process isolation isn't an issue, as we are
in an embedded context where we don't have processes. I was really
talking about setting uninitialised pointers to a value that is
configured in the MPU to be inaccessible, and using a handler to abort
in the case where an attempt is made to read/write to any of these
pointers.

However, if you want to isolate processes using inferred type safety,
you will need to address three significant challenges:

1) Due to changes in LLVM, DSA has a difficult time inferring types. In
a nutshell, some LLVM optimizations turned typed-getelementptr (GEP)
instructions into GEPs that use byte-level indexing. DSA was written
with the assumption that GEPs carried high-level type information.

I think this is fixable by refactoring DSA's type inference code to act
more like Value Set Analysis: it would use a map between a 4-tuple and a
type. The 4-tuple (a, b, c, d) describes a formula ax + b in which a
and b denote offset and stride and c and d place a limit on the lower
and upper bounds of x. c and d can be +- infinity, allowing the 4-tuple
to denote a type that occurs in an unbounded array.

This sounds like a good idea. I'm willing to give this a go, time
permitting.

As an FYI, there are multiple research groups interested in accurate
points-to analysis results (mine included). I'll be holding a BoF at
the LLVM Developer's meeting this year to discuss who needs what and if
there's a way to develop it.

That doesn't surprise me. If someone would like to do this it would be
very handy from my point of view too!

2) Making APA robust. In my personal opinion, the original APA may be
more complicated than necessary for your application. APA currently
creates context sensitive pools, which means that it needs to pass pool
handles around. This requires transforming function signatures which
makes inter-operating with external code a real pain. It also makes the
transform complicated.

A simpler design would be to create one pool for each type that DSA
infers. The points-to analysis would not be sound (unless it unified
all points-to sets that have the same type), but it would allow all
pools (or nearly all pools) to be global variables, greatly simplifying
the transformation. It would also continue to prevent the application
from accessing data outside its allocated memory.

We aren't going to need to inter-operate with external code all that
much (if at all), so perhaps this will not be an issue. DSA and the
context sensitive pools of APA (despite the ugly transform) were
originally what attracted us to SafeCode; we would like to remain sound.

3) The version of SAFECode in that paper used the Omega constraint
solver to prove that array accesses are safe. That code has long
bit-rotted away, and its implementation was not the most efficient (it
exec()'ed the Omega solver for every query). A better approach today
would be to integrate the constraint solver into the compiler proper.
Additionally, you now have other tools available, such as CVC4, Z3, and
SMACK/Boogie, for building and solving the constraints.

We had already noticed that the limit to linear array access was way
behind state of the art (we have a lot of experience with constraint
solvers). Using an SMT solver would make sense, as you can use the
theory that suits the particular problem - linear isn't good enough?
You've got IDL, octagons etc.

To start off with we'd like to get something running, even if it has
limitations. Unless you think it's a really bad idea I will for the
moment try to get the old APA working to some extent again. It isn't
really an issue if I have to run APA and safecode as separate parts of
the build process for the moment, or if parts of it are clunky, from our
point of view it would just be good just to get some toy examples
working relatively quickly.

As an FYI, later versions of SAFECode use run-time checks for
type-unsafe pools and array accesses so that it can support the full
generality of C (see Dhurjati's PLDI 2007 paper and my SVA
publications). You could probably do something simple in which
type-safe data goes into pools and type-unsafe data goes into a large
area for which loads and stores are subjected to simple SFI-like
instrumentation (e.g., Google Native Client). There may be solutions in
the middle of the spectrum between fast-but-difficult-to-implement and
slow-but-easy-to-implement.

We want speed speed speed! If we only support a (preferably large)
subset of C it doesn't matter - this is better than having additional
run time checks. Is there an option to warn whenever a run-time check is
inserted? I'll check out some of the other papers. I read the TECS paper
and gave a few of the others a skim, but looks like I missed some stuff.

Cheers,
Ed