Dataflow analysis with LLVM/Clang

Hello list!

I'm working on a MSc project and I need detect conflictive actions
between different threads in a program, through statical analysis. The
idea is to detect actions like a mutex lock/unlock (besides others),
and do some code insertion at the program to notify a "manager" about
the next conflictive action that is going to happen. This insertion
should be done as early as possible on the program execution flow (so
the "manager" gets the info as early as possible), but it also has to
be accurate (so if a conflictive action is performed at one of the
branches of an if, for example, the insertion has to be made on the
beggining of the branch, because if that branch is not taken at
runtime, that action will not be performed).

I was wondering if it's better to do this on clang, since one of it's
intended uses is statical code analysis and code transformation, or if
it's better to do this over the LLVM IR, as an LLVM pass. AFAIK I'll
need to work over a control-flow graph and call graph in order to make
the analysis, but I'll also need some line information to make the
code insertion (maybe this is at the AST?). I very new to LLVM (and
also new to compilers-related stuff), so I would like to hear some
opinions on this, so I can make a good decision, and (if applicable)
contribute with this work to LLVM/Clang. Also, if someone can point me
a clang or LLVM pass that does similar stuff, it would be great.

Thanks in advance.

My take, if you want to use clang's Analysis engine (include/Analysis), you can't avoid using clang. It isn't clear to me from your description if you need to use it however. If all you want to do is insert code, and some some trivial analysis and llvm bitcode contains everything you need to do your work, then, you'd probably want to just stick with llvm.

If you gave an example of the most complex reasoning you want to perform, that might help us tell you what part of llvm/clang can help the most.

Hello list!

hi :slight_smile:

Unless you have a very strong reason to do it, I'd suggest doing this at the LLVM IR level. Doing transformations and instrumentation there is generally easier than dealing with C ASTs. Dealing with C ASTs means you have to deal with a lot of gross stuff that C has, which gets lowered away by the time you get to LLVM IR.

Reasons to use Clang ASTs are if you want to do an edit to the original user source, or if you need fine grained location info so that you can indicate to the user that some subexpression was involved.

You can find several example instrumentations in the llvm/lib/Transforms/Instrumentation directory,

-Chris

I think Mike's comments are pretty much spot on. To me this really amounts to listing out your requirements and what you are trying to accomplish. From a high-level, it sounds like what you want to do is program transformation. If the goal of the transformation is to change runtime behavior, then you can perform the transformation at either the LLVM IR level or by rewriting source code using Clang. If the goal is to modify the original source code so that users now are working with an instrumented source file, then obviously this has to be done using Clang.

I'm going to assume that your goal is simply to modify runtime behavior. If that is the case, my gut feeling is that it is better to do it at the LLVM IR level if you really don't require any specific knowledge about C. The lowered representation of the LLVM IR marginalizes out details of the high-level language that may be really superfluous for your task; C is a "rich" language with many constructs, so your analysis would have to reason about many edge cases. There are many other tradeoffs that we can go into if you are interested.

I think the other thing to keep in mind is how the concurrency primitives whose uses you are interested in monitoring are represented both in Clang's AST and the LLVM IR. If you can easily identify when such primitives are used at the LLVM IR level, then doing your transformations there makes the most amount of sense to me (given the information I know about what you are trying to do).

I'm not 100% certain how you wanted to use line information. Certainly Clang has rich information about the locations of expressions within a source file, but LLVM IR can capture some debugging information that may be useful for constructing the line information you need (others can chime in here, since I'm not an expert on this topic). It all depends on what you are trying to do.

I think Mike's comments are pretty much spot on. To me this really amounts
to listing out your requirements and what you are trying to accomplish.

Sorry guys, re-reading my email I agree I wasn't clear enough. To be
extremely honest, the supervisor of this project also didn't give much
details about the project. I doing this as an "research experience" on
the Technical University of Madrid, but actually I'm a MSc student at
University of Campinas (Brazil), so I just got here and just got in
touch with the project. A asked him some questions and now have more
details about what I need to do. I also don't want to go too OT, but
I'll try to explain a bit about what is the intended use of this info.

This analysis is going to be part of a project that tries to obtain a
deterministic behavior from multi-threaded replicas of a replicated
server (since determinism is one of the requirements for active
replication). Assuming all the requests arrive at the same order at
all the replicas (what can be achieved with a total-ordered
total-reliable multicast infrastructure) there is a scheduler, which
creates a new thread to handle each requisition and controls the
scheduling of the threads (it's implemented as an API with sleeps over
the thread API). This scheduler needs the info about about what is the
next conflictive action each thread will perform, so it can enforce an
order of scheduling between the conflictive actions, which is going to
be the same at all replicas. The info is going to be sent by each
thread through an API of the scheduler (something like
notify(lock_a)), and it needs to be sent as early as possible so the
scheduler can have info about the future actions of each thread.

From a high-level, it sounds like what you want to do is program
transformation. If the goal of the transformation is to change runtime
behavior, then you can perform the transformation at either the LLVM IR
level or by rewriting source code using Clang. If the goal is to modify the
original source code so that users now are working with an instrumented
source file, then obviously this has to be done using Clang.

The final product has to be instrumented C source code, that can be
compiled with any C compiler. The insertions has to be easily
verifiable, and it seems to me that the easiest way to achieve that is
to make the insertions over the original C code. I've read that there
is a LLVM backend that translates LLVM to C, but I'm not sure how
similar to the original program is the code generated (I think it's
not intended to recall the original program).

I'm going to assume that your goal is simply to modify runtime behavior. If
that is the case, my gut feeling is that it is better to do it at the LLVM
IR level if you really don't require any specific knowledge about C. The
lowered representation of the LLVM IR marginalizes out details of the
high-level language that may be really superfluous for your task; C is a
"rich" language with many constructs, so your analysis would have to reason
about many edge cases. There are many other tradeoffs that we can go into
if you are interested.

I can make assumptions to simplify the analysis, such as there are no
typedefs or macros on the program, and then try to relax this
constraints. The idea at first is to make a proof-of-concept, which
just handles pthread_mutex_lock() and pthread_mutex_unlock(). Then
latter I can try to make it deal with more generic C code.

I think the other thing to keep in mind is how the concurrency primitives
whose uses you are interested in monitoring are represented both in Clang's
AST and the LLVM IR. If you can easily identify when such primitives are
used at the LLVM IR level, then doing your transformations there makes the
most amount of sense to me (given the information I know about what you are
trying to do).

The definition of a conflictive action needs to be flexible. At first,
I'll look for statements like pthread_mutex_lock() and
pthread_mutex_unlock(), but maybe for some reason I can have a program
where the conflictive action would be foo(), so I'll need to be able
to configure my tool (maybe through a config file) for which primitive
to look for. Maybe clang would be more flexible at this point, since I
imagine (and please correct me here if I'm wrong) that each statement
at clang's AST it's similar to it's form at the source code. If using
LLVM I imagine I'll have to figure out for each new type of
conflictive action that my appear it's representation on the LLVM IR.

I'm not 100% certain how you wanted to use line information. Certainly
Clang has rich information about the locations of expressions within a
source file, but LLVM IR can capture some debugging information that may be
useful for constructing the line information you need (others can chime in
here, since I'm not an expert on this topic). It all depends on what you
are trying to do.

The line info would be used to know exactly at what point of the
source file the code insertion should be made. Maybe it's enough to
insert statements on clang's AST, and I will not need to handle this
by hand. Or maybe I can do the analysis and figure out the
transformations over LLVM, and latter use line info to modify the
source code by hand.

Right now it seems to me that for what I need clang it's more
appropriated, but I still not completely sure. I would also like to
hear any comments about if this kind of work can be useful to
LLVM/Clang, because if it's similar to something you guys need, I can
try to write it in such a way that it can be extended or modified to
suits your needs.

The final product has to be instrumented C source code, that can be compiled with any C compiler.

Hum... This is so old skool... but possible.

Normally we'd try and talk you out of this, and just say, portability to any system on which clang runs should be sufficient.

The next question is, do you want readable C code that matches the input, or what looks like assembly code output of an optimizing compiler translated into C?

If the former, you can use the clang rewriter. If the later, you can use C back end of llvm and implement your code as an llvm pass.

The insertions has to be easily verifiable,

Hum, strikes me as weird, but, if you want that, then I'd predict you want to go the way of the rewriter. The output of the C backend for lang doesn't yet contain a proof of transformation that you can easily verify. :slight_smile:

The final product has to be instrumented C source code, that can be
compiled with any C compiler.

Hum... This is so old skool... but possible.

I understand that, but unfortunately I can't make all the decisions :frowning:

Normally we'd try and talk you out of this, and just say, portability to any
system on which clang runs should be sufficient.

The next question is, do you want readable C code that matches the input, or
what looks like assembly code output of an optimizing compiler translated
into C?

Readable C

If the former, you can use the clang rewriter. If the later, you can use C
back end of llvm and implement your code as an llvm pass.

The insertions has to be easily verifiable,

Hum, strikes me as weird, but, if you want that, then I'd predict you want
to go the way of the rewriter. The output of the C backend for lang doesn't
yet contain a proof of transformation that you can easily verify. :slight_smile:

Ok, thanks a bunch for all the info. I really have a better idea of
how this should work right now.

I was looking at Clang's source code, and I found out there are two
ways of calling my analysis:

A) Write a stand-alone program, which would link to clang's libs;
B) Add a command-line option to clang driver, which would call my analysis code.

AFAIK, my analysis code will have to do something like this:

1) Parse the source file(s), and generate the AST;
2) Generate a CFG from the AST;
3) Do a dataflow analysis over the CFG, and insert some statements on
it (which I think, in turn, modifies the AST also);
4) Re-generate C code from the modified AST;

Since steps 1, 2 and 4 are already done on clang, it seems to make
sense to go with choice B, but I couldn't find how difficult it would
be to add one command line option to the driver. One the other side,
going with A will make my code more independent of clang's driver
changes, and it seems I can copy most of step's 1 and 2 from the
driver. What do you guys think would be the best option? And also,
there is any other example of use of the Clang's Analysis engine,
besides the driver?

Thanks and regards!

I was looking at Clang's source code, and I found out there are two
ways of calling my analysis:

A) Write a stand-alone program, which would link to clang's libs;
B) Add a command-line option to clang driver, which would call my analysis code.

It's easy to do B using the AnalysisConsumer interface (see below). Doing 'A' is only as hard as coming up with the ASTs to feed to the analysis routines.

AFAIK, my analysis code will have to do something like this:

1) Parse the source file(s), and generate the AST;
2) Generate a CFG from the AST;
3) Do a dataflow analysis over the CFG, and insert some statements on
it (which I think, in turn, modifies the AST also);

The CFG is just a "view" of the control flow relationships of the statements into the AST. Inserting statements into the CFG (which you cannot do) does not modify the original AST. You can insert statements into the original AST, but those changes will not be reflected in the CFG until you rebuild the CFG.

4) Re-generate C code from the modified AST;

Since steps 1, 2 and 4 are already done on clang,

3 is also done. The "DataflowSolver" class implements a flow-sensitive, intra-procedural dataflow analysis engine. Check out the LiveVariables and UninitializedValues examples.

Alternatively, GRExprEngine is a path-sensitive dataflow engine.

it seems to make
sense to go with choice B, but I couldn't find how difficult it would
be to add one command line option to the driver. One the other side,
going with A will make my code more independent of clang's driver
changes, and it seems I can copy most of step's 1 and 2 from the
driver. What do you guys think would be the best option? And also,
there is any other example of use of the Clang's Analysis engine,
besides the driver?

It's fairly easy to write an analysis that uses the "AnalysisConsumer" interface. Look at AnalysisConsumer.cpp and Analysis.def in the "Driver/" directory. The Analysis.def file allows you to declaratively define your analysis option for the driver, the scope of the analysis (e.g., does it run on functions, does it require the entire translation unit, etc.), and the "Action" function called by AnalysisConsumer.

For example, here is one action function defined in AnalysisConsumer.cpp:

static void ActionWarnObjCDealloc(AnalysisManager& mgr) {
   if (mgr.getLangOptions().getGCMode() == LangOptions::GCOnly)
     return;

   BugReporter BR(mgr);

   CheckObjCDealloc(cast<ObjCImplementationDecl>(mgr.getCodeDecl()),
                    mgr.getLangOptions(), BR);
}

and here is the corresponding line in Analyses.def:

ANALYSIS(WarnObjCDealloc, "warn-objc-missing-dealloc",
  "Warn about Objective-C classes that lack a correct implementation of -dealloc",
  ObjCImplementation)

In the future we may move to a more dynamic, plug-in model for defining new analyses for the analyzer. By modifying these two files (which very localized changes), I don't think you'll have much issues with merging, and its fairly easy to plug in a new analysis to the driver within a couple minutes.