source-to-source transformation to insert instrumentation calls

Hi

I'm trying to use CLANG to build a standalone source-to-source transformation tool. My first goal is to insert instrumentation calls for every memory access (wherever possible).

For example, from this:

int test(int *a)
{
     int b;
     b = 0x10;
     *a = b;

     if(*a == 0x10)
        return 1;
     return 0;
}

To this:

int test(int *a )
{
   int b ;

   b = 0x10;
   instrument_write(*a, 4); // 4 is width
   *a = b;

   instrument_read(*a, 4);
   if (*a == 0x10) {
     return (1);
   }
   return (0);
}

I currently have a MatchFinder filter that matches declRefExpr(). The problem is that I can't insert the instrumentation call at the point where it matched. So I tried to make a 'wider' match:
  compoundStmt( has( stmt( hasDescendant( declRefExpr() ) ) ) ) and then use the location of the 'stmt' to insert the call. This also doesn't work as good as I wished.

How would you advice me to find the right location where I can insert the call ? Are there any other ways I can do this? Perhaps using the MatchFinder is not the best way to do this?

Ultimately I would like to create a source-to-source transformation tool that applies the StackGuard principle and can move stack variables to the heap. (similar to what CIL does.)
I'm aware of other tools like Asan but it's not the implementation I'm looking for.

Thank you

  - Jan

Hi

I'm trying to use CLANG to build a standalone source-to-source
transformation tool. My first goal is to insert instrumentation calls for
every memory access (wherever possible).

For example, from this:

int test(int *a)
{
    int b;
    b = 0x10;
    *a = b;

    if(*a == 0x10)
       return 1;
    return 0;
}

To this:

int test(int *a )
{
  int b ;

  b = 0x10;
  instrument_write(*a, 4); // 4 is width
  *a = b;

  instrument_read(*a, 4);
  if (*a == 0x10) {
    return (1);
  }
  return (0);
}

I currently have a MatchFinder filter that matches declRefExpr(). The
problem is that I can't insert the instrumentation call at the point where
it matched. So I tried to make a 'wider' match:
compoundStmt( has( stmt( hasDescendant( declRefExpr() ) ) ) ) and then use
the location of the 'stmt' to insert the call. This also doesn't work as
good as I wished.

Yea, it needs quite a bit of special casing. Doing something very
generic is an areas we still have to work on. That said, it also looks
like the llvm level would be much better suited for the kind of
instrumentation you want to do.

Cheers,
/Manuel

How would you advice me to find the right location where I can insert the
call ? Are there any other ways I can do this? Perhaps using the
MatchFinder is not the best way to do this?

Ultimately I would like to create a source-to-source transformation tool
that applies the StackGuard principle and can move stack variables to the
heap. (similar to what CIL does.)
I'm aware of other tools like Asan but it's not the implementation I'm
looking for.

Cheers,
/Manuel

Hi

I'm trying to use CLANG to build a standalone source-to-source transformation tool. My first goal is to insert instrumentation calls for every memory access (wherever possible).

For example, from this:

int test(int *a)
{
    int b;
    b = 0x10;
    *a = b;

    if(*a == 0x10)
       return 1;
    return 0;
}

To this:

int test(int *a )
{
  int b ;

  b = 0x10;
  instrument_write(*a, 4); // 4 is width
  *a = b;

  instrument_read(*a, 4);
  if (*a == 0x10) {
    return (1);
  }
  return (0);
}

I currently have a MatchFinder filter that matches declRefExpr(). The problem is that I can't insert the instrumentation call at the point where it matched. So I tried to make a 'wider' match:
compoundStmt( has( stmt( hasDescendant( declRefExpr() ) ) ) ) and then use the location of the 'stmt' to insert the call. This also doesn't work as good as I wished.

How would you advice me to find the right location where I can insert the call ? Are there any other ways I can do this? Perhaps using the MatchFinder is not the best way to do this?

Ultimately I would like to create a source-to-source transformation tool that applies the StackGuard principle and can move stack variables to the heap. (similar to what CIL does.)

First, why do you want to do a source-to-source level transformation instead of an LLVM IR transformation? If it's because you need to feed the transformed source into a C compiler for a special hardware target, it may be easier to write an LLVM IR transform and to get the C backend up and running again.

If you can explain why you need a source-to-source transform, someone on the list may be able to provide ideas for a workable solution.

Second, as an FYI, SAFECode has a pass that will promote potentially escaping stack allocations into heap allocations. The transform hasn't been updated to LLVM mainline yet, but doing so should be relatively easy.

-- John T.

Hi

I'm trying to use CLANG to build a standalone source-to-source
transformation tool. My first goal is to insert instrumentation calls
for every memory access (wherever possible).

<snip>

How would you advice me to find the right location where I can insert
the call ? Are there any other ways I can do this? Perhaps using the
MatchFinder is not the best way to do this?

Ultimately I would like to create a source-to-source transformation
tool that applies the StackGuard principle and can move stack
variables to the heap. (similar to what CIL does.)

First, why do you want to do a source-to-source level transformation
instead of an LLVM IR transformation? If it's because you need to feed
the transformed source into a C compiler for a special hardware target,
it may be easier to write an LLVM IR transform and to get the C backend
up and running again.

Our main platform is MIPS and we're not convinced that the LLVM (mips) code generation is mature and well tested enough. Unfortunately I don't have the resources to get that done.
I think a source-to-source transformation tool with CLANG is the only short term solution.

If you can explain why you need a source-to-source transform, someone on
the list may be able to provide ideas for a workable solution.

I have changed my test code to a RecursiveASTVisitor and I have DeclRefExpr Visitor. Now I'm still looking for a way to determine a good location where I can insert my call.
Should I walk up the AST tree until I find some sort of Statement and then use that location to insert the call ? Would that be a good place?
How do I walk up the tree ?

Hi Jan,

Just a note: I hope you understand that such source-level
instrumentation will not be precise. Optimizations eliminate memory
accesses. For example, in this case there will be only one store:

define i32 @test(i32* nocapture %a) nounwind {
  store i32 16, i32* %a, align 4, !tbaa !0
  ret i32 1
}

Dmitri

That's OK. I'm aware that I won't be 100% accurate.
Now I just need to figure out a good location to insert the call! :slight_smile:

Thanks for the feedback.

That's OK. I'm aware that I won't be 100% accurate.

OK, good to know.

Now I just need to figure out a good location to insert the call! :slight_smile:

I think that there are cases where inserting instrumentation calls
will require non-trivial rewriting. For example:

int f(int, int);
int g(int *)
int test(int *a)
{
    f(*a, g(a));
}

Function argument evaluation order is unspecified, so we can not
simply insert an instrumentation call for *a before f() because *a can
be evaluated after g(a), which can change *a.

So we need to make an arbitrary decision ourselves and rewrite this to
something like:

int test(int *a)
{
    int tmp1 = *a;
    int tmp2 = g(a);
    f(tmp1, tmp2);
}

I didn't give this much thought, but I suspect there are much more
cases like this.

Dmitri

Yes, for example with side effect operators like ++ or -=, etc.

To deal with this kind of stuff in our own source-to-source compiler we
use the "," sequence operator to capture some values right from inside
the battle field.

For example you could compile this to:
int f(int, int);
int g(int *);
int test(int *a) {
   f((instrument_read(*a, 4), *a), (instrument_read(a, 4), g(a)));
}

That should work if the function formal argument evaluation order is
consistent for any call.

But, well, I guess we have enough work to deal with well-written
programs before dealing with all this wicked stuff... :slight_smile:

I realize that it won't be easy to handle these exceptional things, my basic aim is at the trivial assignments, compares, things inside if/while/for etc. The basic stuff.

So I would like to know how I can do just that.
As I said, I think I need to walk up the AST tree and find the intersection point with the CompountStmt. How can I do that? How do I get the parent?

Thanks again.

The AST doesn't have parent - you can either wait for me to implement
hasParent in the matchers, or build up the parent map yourself
thorough a RecursiveASTVisitor.

Cheers,
/Manuel