GSoC Proposal: Inter-Procedure Program Slicing in LLVM

Hi all,

This is a GSoC 2013 proposal for LLVM project. Please see the formatted version at here: http://pacman.cs.tsinghua.edu.cn/~liuml07/files/gsoc2013-proposal-program-slicing.pdf

Program slicing has been used in many applications, the criteria of which is a pair of statement and variables. I would like to write an inter-procedural program slicing pass in LLVM, which is able to calculate C program slices of source code effectively. There is no previous work implemented in LLVM, which considers both the dynamic program slicing and source code of the sliced program. Program slice contains all statements in a program that directly or indirectly act the value of a variable occurrence [5]. Program slicing has been used in many applications, e.g., program verification, testing, maintenance, automatic parallelization of program execution, automatic integration of program versions.

While it’s straightforward to implement the slicer in the back-end of compiler using SSA form, the source code of the original program instead of intermediate
representation is preferred in most cases. Moreover, we can further narrow the notion of slice, which contains statements that influence the value of a
variable occurrence for speci c program inputs. This is referred as dynamic program slicing [1]. However, there is no previous work implemented in LLVM
which solved the two problems.

There are two public projects which implement the backwards static slicing in LLVM.

  • Giri Written by John Criswell from UIUC, a subproject of LLVM. The Giri code contains the static backwards intra-procedure slicing passes, and runs with an older version of LLVM. It also only backtracks until it hits a load. Additional code must be written to backtrack further to find potentially reaching stores.
  • LLVMSlicer This implementation is a complete static backwards slicer from Masaryk University. It works on the well de ned data and control flow equations in a white paper by F. Tip [4]. However, this code was written for special purpose, thus it’s not general enough to be use by others. They implemented the Andersen’s alias algorithm [2], callgraph, and modi es analysis to support the slicer, instead of using the LLVM APIs.

They eliminate the useless IR statements and keep the ones a ect the values of the criteria. However, neither of them generates the compilable source code
slice, which is heavily needed in reality. There are several ways to do this. One is to generate the source code from sliced IR using llc tool. The issue is that
the IR is not concerned with high-level semantic. The generated source code is different from the original program and not suitable for human reading. An-
other approach is deleting the source code according to sliced IR, with line number information (in meta-data of each instruction). The naive script deleting
sliced source code one by one fails to handle tricky cases. I think a better source code slicer is to take use of the AST info.

There are three main parts of this work.

  1. First, borrow an implementation of static back-wards slicing from Giri or LLVMSlicer, and use the LLVM callgraph, mod/ref and alias interfaces as much as possible.
  2. Second, implement the dynamic program slicing using the approach 3 in the paper [1].
  3. Third, generate the source code of the sliced program. To make the sliced source code compile directly, we need to employ clang front-end

The final result of this summer of code is to make this pass work e effectively and documented well. Further, I’ll write test cases and behave as the active
maintainer for this project. My long-term plan is to add more features, e.g. Objective-C/C++ support, thing slicing [3], to this project.

Any comment is highly appreciated.

[1] H AGRAWAL. Dynamic Program Slicing. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI),1990.

2] L.O. Andersen. Program analysis and specialization for the C programming language. PhD thesis, University of Cophenhagen, Germany, 1994.
[3] M. Sridharan, S.J. Fink, and R. Bodik. Thin slicing. In PLDI’07, 2007.
[4] F. Tip. A survey of program slicing techniques.Journal of Programming Languages, 1995.
[5] M. Weiser. Program slicing. In Proceedings of the 5th International Conference on Software Engineering, ICSE, pages 439{449. IEEE, 1981.

I wanted to comment up-front to our GSoC mentors this year that an LLVM dynamic slicing pass is one of the most often requested components we get from the research community. Actually, there is a dynamic slicing tool built using LLVM, and it maps LLVM IR statements to source-level statements for its output using the debug metadata. Swarup Sahoo and I built it and used it in our recent ASPLOS paper (). The tool is named Giri. As an aside, the “Giri” code you described below is a static slicing pass that we originally intended to add to our dynamic slicing code and release as one project called “Giri.” That project has stalled, and if it doesn’t get revitalized, I think it should be moved elsewhere (perhaps github?). What I think would be good for your project would be to take the Giri code that we’ve developed, update it to LLVM mainline, and make it more robust. We can provide the code to you if you get selected for GSoC. Unfortunately, I won’t be mentoring this year. Perhaps Swarup (CC’ed above) or someone else will be interested. If you want to generate compilable source code, then I think you either need to go with the first option (using llc with the C backend) or you need to implement part or all of the slicing code within Clang. The second option you describe (using debug meta-data) will probably be too fragile to work in practice; debug information can be removed, and it may not capture things like the inclusion of header files. That said, can you provide citations showing that being able to compile the slice is needed? Does the slice need to be readable or just compilable? It is not clear to me that a compilable and human-readable slice is needed by most applications needing dynamic slicing. I suspect only one or the other is needed. For what purpose would you use the static analysis? Would you use it to optimize the instrumentation needed for dynamic slicing? You should take a look at Giri’s algorithm in our ASPLOS paper. While not novel, it does have a nifty feature that uses the SSA graph to reduce the number of events that it needs to trace to create the dynamic slice. What is your experience with LLVM IR and/or Clang ASTs? Taking our Giri code and making it robust seems doable over the summer. Building something with Clang AST’s and LLVM IR and having it solid and well documented by end of summer seems a bit ambitious. I think your proposal should make a case that dynamic slicing is a very useful thing. For example, you should cite some papers that make use of it. While I know that dynamic slicing is useful (we keep making one-off distributions of our dynamic slicing code for researchers), that is not the most convincing argument. If you can find real-world tools that use it (or could benefit from it), that’s even better. Second, your proposal should list your qualifications for the project. Why are you the right person for this task? Do you have experience writing code? Working with dynamic slicing? Have you written LLVM/Clang passes before? Third, you should explain how all the parts fit together. If you’re working on dynamic slicing, it’s not clear to me why you want a static inter-procedural slicing pass. Fourth, you should discuss how your passes will integrate into the LLVM toolchain. Will you be adding an option to clang? Will you build a replacement libLTO to do the inter-procedural slicing, or do you expect people to use opt to run it? – John T.

Hi all,

I had a second thought of the dynamic slicing, as well as the source code generating.

Firstly, the dynamic slicing is very useful to software community (I’ll illustrate more in the refined proposal later), but it’s already implemented by Swarup and John Criswell from UIUC. The static slicing code has been released as Giri project in LLVM, and they would kindly release the dynamic slicing too if this proposal is accepted.

I’d like to study and enhance the Giri code to make it better. However, I don’t know the exact part to which I can contribute. I’ll ask Swarup (I cc this email to him) for help. I would be honored to be directed by him if this proposal is selected.

Moreover, I don’t know how the static and dynamic slicing fit together. Does the static code of Giri need some improvements? For example, taking into consideration the pointer aliases. Our group need the static slicing to generate I/O benchmarks (see below please).

Secondly, to generate source code, which is human-readable and compilable seems a bit ambitious. My previous scripts in python can generate the source code of the original program by deleting useless line of code. It’s kind of tricky. We used dozens of regular expressions to match the boundary of blocks, loops and functions to avoid deleting lines unexpectedly. I thought employing clang was a better idea. I don’t know whether you think the source code genration is a good idea or not. I can release our simple script and improve it later. So I have more time/energy to contribute to the dynamic slicer.

I think the slicing pass can be loaded by the opt. There is no need to change the front-end or other passes. I’m not sure whether the link time optimization can be exploited. We borrowed the LLVMSlicer code previously without LTO.

It is not immediately obvious to me that dynamic slicing can be improved by a static slicing analysis. As far as improvements to the dynamic slicing Giri code, I can think of several: 1) Adding support to correctly handle asynchronous events (e.g., signal handlers). 2) Updating the code to LLVM mainline and putting it into the giri SVN repository 3) Reducing the trace size. Giri currently records the execution of each basic block, load and store, call, and return instruction. There are things you can do to make the trace smaller (e.g., creating one trace record for control-equivalent basic blocks). 4) Making the giri run-time library thread safe and the slicing thread/process aware. Right now, I think the events of all processes and threads get thrown together in one trace file. Each should have a separate trace file, or trace records should indicate which thread is performing a particular operation. 5) Improving the giri run-time. The current run-time library maps a portion of the trace file into memory, writes trace records into it, and then synchronously munmaps it and then maps in the next portion of the trace file. This design ensures that we don’t swamp memory (the application can produce trace records faster than the OS can write them to disk), but it doesn’t overlap computation with I/O very well. The run-time also has a static size for how many trace records to hold in memory before flushing to disk; this value should be computed dynamically. In short, I think you would need to: a) Make Giri up to date with LLVM and make it robust. b) Profile it to see where to improve performance c) Make improvements that improve performance or reduce trace size Your final proposal should cite other applications that use (or could benefit from) dynamic slicing. Our ASPLOS 2013 paper would be an example, but you should look for and cite several papers about several different applications from several different groups. Industry groups would be a plus. Saying that one or two people use dynamic slicing isn’t all that convincing; saying that x different groups use it, including y industry groups, would be far more compelling. To get you started, you may want to look at this paper by Johnson et. al.: . – John T.

We already have a way to handle asynchronous events though it is not perfect.

One more improvement I can think of is handling external library calls which is not complete for some calls.

Thanks,
Swarup.