Measure execution time of each basic block

Guys,

Someone have any idea how could I measure the execution time of each LLVM basic block of a program?

I tried to use some profiling tools like gcov and perf, but as far as I know they can only give me the frequency that each basic block is executed.

I was thinking about writing a pass to add PAPI instructions in each LLVM basic block. Do you think is a good idea?

Thanks

Vanderson M. Rosario

I think if you do this, you're quickly going to realize that there's quite a lot of overhead in getting the time stamps needed to record basic block duration, so you're not going to get accurate results except for really big basic blocks.

Cheers,

Jon

Jon,
I need to create a database of basics blocks and their execution time. The only thing I’m concerned is if a block A is more expensive than a block B.
Do you think that even with the overhead I would be able to get the A > B information?
Like: overhead + time(A) > overhead + time(B) => A > B.
If so, I’m not too much concerned about the accuracy.

Not sure if I was clear,

Thanks again

Jon,
I need to create a database of basics blocks and their execution time. The only
thing I'm concerned is if a block A is more expensive than a block B.
Do you think that even with the overhead I would be able to get the A > B
information?
Like: overhead + time(A) > overhead + time(B) => A > B.
If so, I'm not too much concerned about the accuracy.

Depends on your platform. Different counters have different resolutions, anywhere from miliseconds to 10's of cycles, and the overhead can have a bit of variability too, which means that 'overhead + time(A) > overhead + time(B) => A > B' is not necessarily true. When overhead >> time(A) (or B), then you're not going to get a meaningful measurement.

Cheers,

Jon

If you are on X86, you can use the rdtsc/rdtscp llvm intrinsics -
http://permalink.gmane.org/gmane.comp.compilers.llvm.cvs/185208

with RDTSCP you can double fence your basic block so you can be sure that the OOO scheduling does not schedule instructions that you don’t intend to measure, however note that there is a huge overhead as a result of this.

RDTSC has a much lower overhead since it only reads your TSC reg but if you are measuring a small basic block it won’t be accurate.

thanks,
sathvik

rdtsc is pretty good but keep in mind that the tsc register is not kept consistent across cores. In general, it is consistent across cores on one cpu socket but not across sockets. But I’ve seen cases where there is a huge difference across cores on the same cpu socket. (this was because of bios settings used during the boot cycle.) If you can lock your thread(s) on specific cores, you can avoid these issues.

-bean

Hi,

I tried to use some profiling tools like gcov and perf, but as far as I know they can only give me the frequency that each basic block is executed.

For perf, this really depends on what you’re sampling on. For your use case, it seems that sampling CPU cycles would give the data that you want.

If you run something like perf record -e cycles -c 1000000 /path/to/program, is that perf will collect one sample every 1’000’000 cpu cycles. If you have a 2GHz CPU, that would mean one sample every 0.5ms. So the time taken by a basic block is roughly the number of samples in that block times 0.5ms.

GCOV on the other hand will give you the number of times each basic block was executed, which does not tell much about the running time.

Cheers,
Jonas

PS: try to just run perf stat /path/to/program to see the relationships between cycles, runtime, and other metrics.