Hi John,
Thanks for the detailed answer, this gives me a good starting point to look into.
I was also wondering if you could give an idea (in terms of %ge) the overhead one can expect with such an instrumentation. I want something really lightweight and simple which can possible be applied to production systems, so overhead is a concern.
I don't really know what the overhead would be (I'm terrible at guessing these things), but I imagine it would degrade performance sufficiently that at least some people would consider it too slow for production.
On a related note, we built a dynamic tracing tool called giri which records, in a file, the execution of functions and basic blocks. The code is available from the llvm.org SVN repository (https://llvm.org/svn/llvm-project/giri/trunk) but is not actively maintained at present.
There are a few ideas from the Giri work that you may find useful:
1) We used mmap() to map the log file into application memory instead of using the write() system call to write data to the log. Using mmap() should improve performance because the OS doesn't have to copy the data between user-space and kernel-space; instead, when you unmap or msync the virtual page, the OS kernel just dumps the data to disk directly.
2) A significant issue with giri was controlling how much RAM the instrumentation used. We opted to build giri so that it would mmap() part of the log file into memory, write it that memory, and then unmap that region of the log and map in the next. Since RAM is always faster than disk, we found that if we let the OS sync the data to disk asynchronously whenever it wanted, we would exhaust memory and slow things down. What we opted to do instead was to make the unmap synchronous, meaning that all data would be written to disk before proceeding to the next section of the log. This made controlling the memory consumption easier.
Some other ideas:
1) Don't log function names; assign each function a numeric ID and log that ID. That will reduce the amount of data you need to log during execution to a 32-bit number and the RDTSC value.
2) Consider using a helper thread to write data to disk.
3) You might be able to play some games using the call graph. For example, if you know that function A, when called, will always call function B which will always call function C, then you only need to instrument function A instead of A, B, and C.
4) There was work at PLDI 2010 (IIRC) on creating hashes of the call stack (i.e., a single hash value could tell you the current function, its caller, the caller's caller, etc). Utilizing this technique may reduce the number of instrumentation points in the program.
-- John T.