LLVM instrumentation overhead

Hi,

I need to write a transform pass which instruments the target program to output the name of each function executed, and the rdtsc counter along with it.

Can anyone give me an idea of how to go about it?(I've worked around with LLVM pass framework and opt to do static analysis, but would like to do a lightweight instrumentation). Also can anyone give an approximate idea of the overhead for such instrumentation?

Thanks
Nipun

Hi,

I need to write a transform pass which instruments the target program to
output the name of each function executed, and the rdtsc counter along
with it.

Doing this in LLVM is really straightforward. You simply iterate through all the functions in a module and add instructions to their entry basic blocks to do whatever it is that you want to do.

I believe you already know how to find all the functions and their entry blocks. Review the Programmer's Guide and the doxygen docs on llvm::Module and llvm::Function if there's something you don't understand.

The only other question is how to insert instructions. For that, you can take one of two approaches. First, you can use the IRBuilder class (http://llvm.org/doxygen/classllvm_1_1IRBuilder.html). Second, you can simply use the appropriate constructor/new methods of the Instruction classes to create and insert the instructions that you want. I believe IRBuilder is now the preferred way to do things as its API changes less often.

For the instrumentation that you want to do, the easiest thing to do would be to insert a call in every function to a function that you implement in a run-time library that does whatever the instrumentation should do. This makes the compiler transform very simple.

Can anyone give me an idea of how to go about it?(I've worked around
with LLVM pass framework and opt to do static analysis, but would like
to do a lightweight instrumentation). Also can anyone give an
approximate idea of the overhead for such instrumentation?

To make things faster, you could compile your run-time library as a static library linked in using clang's/libLTO's link-time optimization. Your run-time library can then be inter-procedurally inlined with the program that you are instrumenting.

-- John T.

Hi John,

Thanks for the detailed answer, this gives me a good starting point to look into.

I was also wondering if you could give an idea (in terms of %ge) the overhead one can expect with such an instrumentation. I want something really lightweight and simple which can possible be applied to production systems, so overhead is a concern.

Thanks
Nipun

Hi John,

Thanks for the detailed answer, this gives me a good starting point to look into.

I was also wondering if you could give an idea (in terms of %ge) the overhead one can expect with such an instrumentation. I want something really lightweight and simple which can possible be applied to production systems, so overhead is a concern.

I don't really know what the overhead would be (I'm terrible at guessing these things), but I imagine it would degrade performance sufficiently that at least some people would consider it too slow for production.

On a related note, we built a dynamic tracing tool called giri which records, in a file, the execution of functions and basic blocks. The code is available from the llvm.org SVN repository (https://llvm.org/svn/llvm-project/giri/trunk) but is not actively maintained at present.

There are a few ideas from the Giri work that you may find useful:

1) We used mmap() to map the log file into application memory instead of using the write() system call to write data to the log. Using mmap() should improve performance because the OS doesn't have to copy the data between user-space and kernel-space; instead, when you unmap or msync the virtual page, the OS kernel just dumps the data to disk directly.

2) A significant issue with giri was controlling how much RAM the instrumentation used. We opted to build giri so that it would mmap() part of the log file into memory, write it that memory, and then unmap that region of the log and map in the next. Since RAM is always faster than disk, we found that if we let the OS sync the data to disk asynchronously whenever it wanted, we would exhaust memory and slow things down. What we opted to do instead was to make the unmap synchronous, meaning that all data would be written to disk before proceeding to the next section of the log. This made controlling the memory consumption easier.

Some other ideas:

1) Don't log function names; assign each function a numeric ID and log that ID. That will reduce the amount of data you need to log during execution to a 32-bit number and the RDTSC value.

2) Consider using a helper thread to write data to disk.

3) You might be able to play some games using the call graph. For example, if you know that function A, when called, will always call function B which will always call function C, then you only need to instrument function A instead of A, B, and C.

4) There was work at PLDI 2010 (IIRC) on creating hashes of the call stack (i.e., a single hash value could tell you the current function, its caller, the caller's caller, etc). Utilizing this technique may reduce the number of instrumentation points in the program.

-- John T.