Enhancing the DataFlowSolver in clang

Hi all,

I am currently trying to implement few simple dataflow analysis, namely dominators and reaching definitions with the purpose to actually understand how the clang's DataFlowSolver works.

I got stuck with the dominator tree: http://en.wikipedia.org/wiki/Dominator_(graph_theory)
The theory it's quite simple, but when I tried to implement it on the solver I started to have some problems.

The issue is that I need to set an initial state for each of the CFG blocks, and more important I need to specify a different initial state for each of the blocks (actually only the root node needs to have a different initialization)

For what I understand by reverse engineering the code, the method void setTopValue(DominatorGraph::ValTy& V) should be used to set the initial state of a dataflow problem.

However, in the dominators analysis I need to initialize the bitvectors for each of the blocks with different values and in order to be able to do that I need a different signature for the method which would be something like:

void SetTopValue(DominatorGraph::ValTy& V, clang::CFGBlock const& B){ }

This modification requires only small changes in the DataFlowSolver and implementation of dominators gets simpler.

Is there any other way to have this kind of behaviour or it is really missing? If you all agree with this change I can provide the patch!

regards, S. Pellegrini

Hi Simone,

setTopValue() is probably misnamed; it is used by DataflowSolver to determine the "top" value when computing a merge at a confluence point. It's not actually used anywhere else.

If you want to specially initialize your bitvector values, the DataflowValues class does have an InitializeValues() method that is called by solver at the beginning of the analysis:

  /// InitializeValues - Invoked by the solver to initialize state needed for
  /// dataflow analysis. This method is usually specialized by subclasses.
  void InitializeValues(const CFG& cfg) {}

The LiveVariables analysis doesn't actually use this method since the default bitvector values are fine for initial value, but in InitializeValues you can iterate over the CFG and set the values as you like.

Would this work?

Ted

Hi Simone,

setTopValue() is probably misnamed; it is used by DataflowSolver to determine the "top" value when computing a merge at a confluence point. It's not actually used anywhere else.

If you want to specially initialize your bitvector values, the DataflowValues class does have an InitializeValues() method that is called by solver at the beginning of the analysis:

   /// InitializeValues - Invoked by the solver to initialize state needed for
   /// dataflow analysis. This method is usually specialized by subclasses.
   void InitializeValues(const CFG& cfg) {}

The LiveVariables analysis doesn't actually use this method since the default bitvector values are fine for initial value, but in InitializeValues you can iterate over the CFG and set the values as you like.

Would this work?
   

Yes it solves my problem! Thanks again!

Hi Simone,

setTopValue() is probably misnamed; it is used by DataflowSolver to determine the "top" value when computing a merge at a confluence point. It's not actually used anywhere else.

If you want to specially initialize your bitvector values, the DataflowValues class does have an InitializeValues() method that is called by solver at the beginning of the analysis:

   /// InitializeValues - Invoked by the solver to initialize state needed for
   /// dataflow analysis. This method is usually specialized by subclasses.
   void InitializeValues(const CFG& cfg) {}

The LiveVariables analysis doesn't actually use this method since the default bitvector values are fine for initial value, but in InitializeValues you can iterate over the CFG and set the values as you like.

Would this work?
   

Hi again,
I just found out that using InitializeValue() to initialize the state of the dataflow solver is not enough.

Let's for example consider an easy dataflow analysis: dominator trees. I need to create a bitvector where each position refers to a block of the CFG. Before starting the solver I need to initialize the values in the following way:

n0 (root node) = {0, 0, 0, n0, 0, ...}
N-n0 (remaining nodes) = {1,1,1,1..,1,1}

and this is what I am doing in the InitializeValue function:

/// Initializes the CFGBlock values for Dominators analysis i.e.:
/// * Dom(root) = {root}
/// * for each n in N - {root}, Dom(n) = N
void Dominators::InitializeValues(const CFG& cfg) {
   for(CFG::const_iterator I = cfg.begin(), E = cfg.end(); I != E; ++I) {
     Dominators::ValTy& D = getBlockDataMap()[*I];
     if(cfg.getEntry().getBlockID() == (*I)->getBlockID()) {
         D.resetValues(getAnalysisData());
         D((*I)->getBlockID(), getAnalysisData()) = IsDominator;
     } else
         D.setValues(getAnalysisData());
   }
}

My transfer function looks like this:

class TransferFuncs : public CFGRecStmtVisitor<TransferFuncs> {
   Dominators::AnalysisDataTy& AD;
   Dominators::ValTy Dominance;
public:
   TransferFuncs(Dominators::AnalysisDataTy& ad) : AD(ad) { Dominance.resetValues(AD); }

   Dominators::ValTy& getVal() { return Dominance; }
   CFG& getCFG() { return AD.getCFG(); }

   void VisitTerminator(CFGBlock* B) {
     Dominance(B->getBlockID(), AD) = IsDominator; // add this as dominator
   }

   void SetTopValue(Dominators::ValTy& V) { }
};

nothing difficult, but when I run the solver it gives me wrong results. :frowning:
I investigated the problem and it seems that the initial values I set in the initialize function are overwritten the first time the ProcessMerge() function is called by the DataflowSolver.

For what I understood from the code, the first time ProcessMerge() is called there is no edge data, therefore the last instruction in the function will always overwrite the existing block data with the one returned by the function TF.SetTopValue(), and in my case I don't have any TopValue to set.

To solve the problem I introduced a flag which track whether edge data exists. If not, the D.getBlockDataMap()[B] will not be overwritten.

What do you think?

cheers, Simone

This seems reasonable. Does it work with the existing analyses? (i.e., do all the clang tests pass)

it works for mine :slight_smile:
eheh... just joking, I will run the tests and let you know.

cheers, Simone

   

For what I understood from the code, the first time ProcessMerge() is called there is no edge data, therefore the last instruction in the function will always overwrite the existing block data with the one returned by the function TF.SetTopValue(), and in my case I don't have any TopValue to set.

To solve the problem I introduced a flag which track whether edge data exists. If not, the D.getBlockDataMap()[B] will not be overwritten.

This seems reasonable. Does it work with the existing analyses? (i.e., do all the clang tests pass)
     

it works for mine :slight_smile:
eheh... just joking, I will run the tests and let you know.
   

I run make tests from the root clang folder and none of the existing tests have been broken.

In attachment the patch file with the fix to the DataFlowSolver.h

cheers, Simone

DataFlowSolver.patch (1.19 KB)

I think this patch is mostly there, but the following assumes that ValTy is a bitvector:

+ // if there the block value has been initialized, and this is the first
+ // iteration, use the initialized value
+ if(noEdges and !D.getBlockDataMap()[B].sizesEqual(ValTy()))
+ Merge(V, D.getBlockDataMap()[B]);

'sizesEqual' should not appear anywhere in DataSolver.h. The solver needs to be generic to not assume specific details of ValTy.

A performance nit: here 'getBlockDataMap()' is called twice. That's wasteful as it requires two lookups in the hashtable instead of one.

Also, please use spaces instead of tabs.

I run make tests from the root clang folder and none of the existing tests have been broken.

In attachment the patch file with the fix to the DataFlowSolver.h
     

I think this patch is mostly there, but the following assumes that ValTy is a bitvector:

+ // if there the block value has been initialized, and this is the first
+ // iteration, use the initialized value
+ if(noEdges and !D.getBlockDataMap()[B].sizesEqual(ValTy()))
+ Merge(V, D.getBlockDataMap()[B]);

'sizesEqual' should not appear anywhere in DataSolver.h. The solver needs to be generic to not assume specific details of ValTy.
   

Hello again,
I removed the sizesEqual check. Now, I just search for the block B in the map, if there is no entry in the BlockDataMap it means the user didn't provide any initialization. In this way we didn't assume any specific detail of ValTy.

A performance nit: here 'getBlockDataMap()' is called twice. That's wasteful as it requires two lookups in the hashtable instead of one.
   

I also optimized the code by reducing the number of lookups to 1.

Also, please use spaces instead of tabs.

ok.

cheers, Simone

DataFlowSolver.patch (1.48 KB)

Hi Simone,

This is definitely a better solution, but this actually still does two hash table lookups. The 'insert' causes a second lookup in the DenseMap. That said, I think this is reasonable to go into the tree as is. The patch still contains tabs, but I've gone and corrected that. I've committed this patch in r109243.

If we want to remove the second lookup, consider that DataFlowSolver is based on generic programming. This means that we just need to expect that there is something generic that we expect in all ValTys:

ValTy &blockV = D.getBlockDataMap()[B];
bool isInitialized = blockV.isInitialized();
// If no edges have been found, it means this is the first time the solver
// has been called on block B, we copy the initialization values (if any)
// as current value for V (which will be used as edge data)
if(noEdges && isInitialized)
  Merge(V, blockV);
...

and then add 'isInitialized' to the specific ValTy class.

Looking at this solver again, I think this would be better done with a trait class. For example, instead of adding 'isInitialized()' we could have:

  bool isInitialized = dataflow::ValTrait<ValTy>::isInitialized(blockV);

Unfortunately, it looks like DataflowSolver (which I wrote a while ago) expects that ValTy has a 'copyValues' method (and others), so this same design issue exists in multiple places. This requires us to always define new classes for ValTy just for use in the DataflowSolver, and prohibits us from using basic types such as "int" or even pointers. A better approach would be to use a trait, e.g.: dataflow::ValTrait<ValTy>::copyValues(...,...). This way we could define these operations without requiring special classes for ValTy. This would be a nice cleanup.

Thanks for the patch!

Cheers,
Ted