A howto or template for adding a transformation pass

Hello,

I’d like to find the best method of adding a transformation pass that uses the AST and the CFG. Is there a template or a how-to on this?

My goal in summary is to:

  1. Determine declarations of certain new statements (beginD and endD) in the C source (without changing the front-end; they are simple macros).
  2. Replace the statements with certain macros with additional labels.
  3. Output the changed C file.

I’ve successfully tried the following:

  • Added an Analysis consumer, but I noticed that this Analysis consumer method is invoked twice; once for each function in the C source. I’d like to have access to the full Analysis and then walk it using walker methods.
  • Added an ASTConsumer, but again, this is invoked twice. Once per function.

Your suggestions would be much appreciated.

~ Hiren

Hello,

I'd like to find the best method of adding a transformation pass that uses the AST and the CFG. Is there a template or a how-to on this?

My goal in summary is to:

1. Determine declarations of certain new statements (beginD and endD) in the C source (without changing the front-end; they are simple macros).

I admit that I'm a little confused. I'm not quite certain what you mean by "determine declarations of certain new statements." Are you talking about determining where in a function/method body you would want to insert code, or are you scanning for particular statements to replace?

2. Replace the statements with certain macros with additional labels.

Which statements are you replacing? Inserting macros should be fairly straightforward by using the rewriter (this is a textual translation by editing the source code, not editing the ASTs).

4. Output the changed C file.

If you are using the rewriter to do your edits, this is very straightforward. I would like at how the Objective-C -> C translator (RewriteObjC) uses the Rewriter to make edits to the source file and then blast the new source to disk.

I've successfully tried the following:

- Added an Analysis consumer, but I noticed that this Analysis consumer method is invoked twice; once for each function in the C source. I'd like to have access to the full Analysis and then walk it using walker methods.
- Added an ASTConsumer, but again, this is invoked twice. Once per function.

I'm not certain what you mean by the "invoked twice" part. ASTConsumer implements an action/visitor interface where it receives callbacks for each top level declaration (HandleTopLevelDecl) as well as a single call back once the entire translation unit has been parsed (HandleTranslationUnit).

I don't have enough context for what you are trying to accomplish, so it's difficult for me to advise you on the best course of action. For example, I'm not certain what you need the CFGs for, so it's hard for me to say whether you should just implement another callback function for AnalysisConsumer (to implement a new analysis) or implement a new ASTConsumer. The Objective-C rewriter, for example, is implemented as an ASTConsumer, where the major rewriting work is done in HandleTranslationUnit.

My gut feeling is that you want to use the Objective-C rewriter as your model, and that you want to rewrite pieces of the C source file by removing some statements and inserting others. You're probably just going to want to write a fresh ASTConsumer that contains a Rewriter object. You can either scan for functions/methods to edit using HandleTopLevelDecl, or you can wait until HandleTranslationUnit is called to scan all the declarations at once. In HandleTranlsationUnit you would also call the appropriate methods on the Rewriter to emit your changed source code to disk. If you need CFGs, it's very easy to create a CFG using a single method: CFG::BuildCFG (just pass as an argument the Stmt* that represents the body of the function or method you want to analyze).

Best,
Ted

Hello,

I'd like to find the best method of adding a transformation pass that uses the AST and the CFG. Is there a template or a how-to on this?

My goal in summary is to:

1. Determine declarations of certain new statements (beginD and endD) in the C source (without changing the front-end; they are simple macros).

I admit that I'm a little confused. I'm not quite certain what you mean by "determine declarations of certain new statements." Are you talking about determining where in a function/method body you would want to insert code, or are you scanning for particular statements to replace?

2. Replace the statements with certain macros with additional labels.

Which statements are you replacing? Inserting macros should be fairly straightforward by using the rewriter (this is a textual translation by editing the source code, not editing the ASTs).

4. Output the changed C file.

If you are using the rewriter to do your edits, this is very straightforward. I would like at how the Objective-C -> C translator (RewriteObjC) uses the Rewriter to make edits to the source file and then blast the new source to disk.

I've successfully tried the following:

- Added an Analysis consumer, but I noticed that this Analysis consumer method is invoked twice; once for each function in the C source. I'd like to have access to the full Analysis and then walk it using walker methods.
- Added an ASTConsumer, but again, this is invoked twice. Once per function.

I'm not certain what you mean by the "invoked twice" part. ASTConsumer implements an action/visitor interface where it receives callbacks for each top level declaration (HandleTopLevelDecl) as well as a single call back once the entire translation unit has been parsed (HandleTranslationUnit).

I don't have enough context for what you are trying to accomplish, so it's difficult for me to advise you on the best course of action. For example, I'm not certain what you need the CFGs for, so it's hard for me to say whether you should just implement another callback function for AnalysisConsumer (to implement a new analysis) or implement a new ASTConsumer. The Objective-C rewriter, for example, is implemented as an ASTConsumer, where the major rewriting work is done in HandleTranslationUnit.

My gut feeling is that you want to use the Objective-C rewriter as your model, and that you want to rewrite pieces of the C source file by removing some statements and inserting others. You're probably just going to want to write a fresh ASTConsumer that contains a Rewriter object. You can either scan for functions/methods to edit using HandleTopLevelDecl, or you can wait until HandleTranslationUnit is called to scan all the declarations at once. In HandleTranlsationUnit you would also call the appropriate methods on the Rewriter to emit your changed source code to disk. If you need CFGs, it's very easy to create a CFG using a single method: CFG::BuildCFG (just pass as an argument the Stmt* that represents the body of the function or method you want to analyze).

Best,
Ted

My apologies for the vague question formulation. I'll explain my objective with an example.

My apologies for the vague question formulation. I'll explain my objective with an example.

My initial source is as follows:
unsigned int fib(unsigned int m) {
unsigned int f0 = 0 ;
unsigned int f1 = 1, f2, i ;
unsigned int l1 = 0;
if (m <= 1) {
MARKSTART();
printf("Return\n");
MARKEND();
return m;
} else {
for (i = 2; i <=m; i++) {
f2 = f0 + f1;
f0 = f1;
f1 = f2;
for (l1 = 0; l1 <5; l1++) {
printf("woof\n");
MARKSTART();
}
MARKEND();
MARKSTART();
}
MARKEND();
return f2;
}
}

The MARKSTART and MARKEND represent MACROs that implement a certain instruction in a particular processor. The semantics of the MARKEND and MARKSTART can be simply thought of as starting and stopping cycle timers for segments of code between the MARKSTART and MARKEND. The resulting code after the transformation is as follows:
unsigned int fib(unsigned int m) {
unsigned int f0 = 0 ;
unsigned int f1 = 1, f2, i ;
unsigned int l1 = 0;
if (m <= 1) {
MARKSTART0("fib_seq0");
printf("Return\n");
MARKEND0("fib_seq0");
return m;
} else {
for (i = 2; i <=m; i++) {
MARKSTARTCOUNT("bb_k");
f2 = f0 + f1;
f0 = f1;
f1 = f2;
MARKENDCOUNT("bb_k");
for (l1 = 0; l1 <5; l1++) {
printf("woof\n");
MARKSTART2("fib_loop1");
}
MARKEND2("fib_loop1");
MARKSTART1("fib_loop0");
}
MARKEND1("fib_loop0");
return f2;
}
}

The number after a MARKSTART and MARKEND denotes a timer in the processor and the string, a unique label identifying blocks of interest used during profiling this code. We have certain rules that form a block, such as, a MARKSTART must always follow by a MARKEND and use of timers must be correctly nested. Notice the MARK*COUNT as only one example of labeling one basic block in the source. I would like to add these for every BB in the CFG as well. So the tasks here are:

Hi Hiren,

My apologies for the delay in my reply. I’ll address each of your points inline. In general, I think there are two views you can take of the source:

  1. The constructed ASTs.

  2. A token stream, which literally consists of the stream of tokens passed to/returned from the preprocessor. A good example of this is in RewriteMacros.cpp in Driver.

We have accurate SourceLocation information for both, but both are different views of the source code. Since macros can expand to arbitrary amounts of code (including code that contains control flow), their is a tricky mapping from macros to their expanded statements, or even the basic blocks in the CFG.

I assume that by default MARKSTART and MARKEND expand to nothing. Using the token stream approach, you could record the SourceLocations for the approprate macros, and then using the AST approach you can conceivably search for the AST Stmt* that appears closest to those recorded locations (this may require two passes). This might be really tricky. An easier approach might be to have both macros expand (when using clang) to the “annotate” attribute, which can then be affixed to any AST node. Your analysis could then look for Stmt* with the annotation that you are interested in.

Once you have computed what new MARKSTARTXXX and MARKENDXXX text you want to insert, however, it is easy to add these back to the source. Using the rewriter, you can splice in text at any arbitrary SourceLocation.

- Identify MARKSTART and MARKEND, allocate a cycle timer such that there are no conflicts between the START and END of a block.

I’m not quite certain what you mean by “conflicts”, but I assume you are referring to the nesting property of MARKSTART and MARKEND above. Once you determine the locations of the MARKSTART and MARKEND macros, determine their association to the appropriate statements, the allocating a cycle timer, etc., may just result in doing a walk on the AST, since the AST is hierarchical and represents the nested structure of the code.

- Identify MARKSTART and MARKEND, add a unique label used in profiling. The labels much match the START and END that form a block.
- For every basic block, add MARKSTARTCOUNT and MARKENDCOUNT respectively with a unique BB label.

Note that Basic Blocks in the CFG include lots of control-flow constructs: if statements, loops, ternary operators (?), short circuit evaluation (&&, ||), gotos, etc. If you just want to profile things like loops, you might not even need to go to the CFG at all unless you want to support gotos. With gotos, you then have to identify the loop structure yourself. I think we have analyses like this at the LLVM IR level, but nothing like that yet for source-level analysis.

Right now the CFG also has some oddities with DeclStmts. If a Decl represents a multiple variable declaration (e.g., int x, y, z), in the CFG the single DeclStmt in the AST is transformed into three DeclStmts to explicitly represent their control-flow dependency. These new DeclStmts don’t appear in the original AST, but they should have the correct sourcelocation information. This solution is likely to change soon.

That said, each basic block is uniquely numbered. I don’t know all the details of your analysis, but I think the hardest part of your task will be identifying the locations of the original macros and mapping those to the nearest AST Stmt*.