Motivation
Clang profiling instrumentation inserts eight byte counters that are incremented when a part of the program is executed to collect source-based code coverage. When we are only interested in whether a line is executed instead of its execution count, we can use a one byte counter instead of an eight byte counter. Lightweight Instrumentation idea was originally proposed to reduce the overhead of IR instrumentation for PGO. Since PGO and coverage use the same IR instrumentation, we repurpose this idea for coverage.
Proposal
We propose adding a mode to use single byte counters in source-based code coverage. We implemented a prototype in [InstrProf] Single byte counters in coverage by gulfemsavrun Ā· Pull Request #75425 Ā· llvm/llvm-project Ā· GitHub (previously ā D126586 [InstrProf] Single byte counters in coverage). Eight byte counter mode infer counters by adding or subtracting two execution counts whenever there is a control-flow merge. For example, the picture below shows an example of inserting counters for an if statement. Clang inserts counter 0 in the parent block and counter 1 in if.then block, and it can infer the counter for if.else block by subtracting counters between parent and if.then blocks.
However, we cannot infer counters in single byte coverage mode. Therefore, we conservatively insert additional counters for the cases where we need to add or subtract counters. This might be improved with a better algorithm that inserts less counters in single byte counters mode in future work.
Evaluation
We evaluated the performance and size impact of our prototype. We used Clang version 18.0.0 with no instrumentation as our baseline, where we refer as āNo coverageā. We reported the performance speedups from optimization level O2, but we measured the performance in O1 and O3 as well. We have seen greater speedups in O1, and O3 performance is very similar to O2. We presented the results of single byte counters from x86 and ARM architectures on 32-bit and 64-bit modes. We used an x86-based Intel Xeon system, and an ARMv7 system that uses Cortex-A53 cores, and an AArch64 system that uses ThunderX2 cores in our hardware configuration. We used Dhrystone from llvm-test-suite and CoreMark to measure the performance benefit of single byte counters. We also measured binary size for Clang binary to evaluate the size impact of single byte counters.
Code Snippets
The table above shows the instrumentation code snippets generated for the Proc4 method in Dhrystone in our implementation.
- x86-32
We generate two instructions (addl and adcl) for instrumentation on x86 32-bit mode, and we generate a single instruction (movb) in single byte coverage mode.
- x86-64
Instrumentation code corresponds to a single instruction (inc), and it still corresponds to a single instruction (movb) in single byte coverage mode.
- ARMv7
We generate a sequence of nine instructions for instrumentation code, and when we enable single byte counters mode, we generate four instructions (ldr, add, mov and strb).
- AArch64
Instrumentation code consists of a sequence of four instructions (adrp, ldr, add, and str). When we enable single byte counters mode, it corresponds to two instructions (adrp and strb).
Performance
The table above reports the speedups of single byte counters over eight byte counters on Dhrystone. We increased the default iteration count (LOOPS) in Dhrystone from 100,000,000 to 1,000,000,000. We achieved a 3% speedup on x86-32, and 5% speedup on x86-64. We see greater speedups on ARM, 62% on ARMv7 and 42% on AArch64, respectively.
The table above shows our results from the CoreMark benchmark. CoreMark is a performance benchmark that is designed to stress CPU pipeline, and it reports a score (Iterations/Sec) upon execution. We observed a 29% better score on x86 32-bit mode, but a 7% worse score on 64-bit mode. The reason for the poor performance on x86 64-bit mode is that it is a significantly control-heavy benchmark and we insert many more counters in single byte counters mode. This hurts the performance, but might be improved if we come up with a better algorithm that inserts less counters in single byte counters mode. Single byte counters achieved a 27% and 39% better score in ARMv7 and AArch64, respectively.
Code Size
We compared the code (.text) size of clang binary to measure the size impact of single byte counters. As shown in the table above, single byte counters reduce the code size by 20% on x86 32-bit mode because we generate fewer instructions. In contrast, it increases the code size by 4% on x86 64-bit mode. The reason is that we do not reduce the number of static instructions in single byte coverage mode, and we insert additional instrumentation code, which increases the code size in x86 64-bit mode. Single byte counters reduce the code size by 8% on AArch64 since we typically save multiple instructions per instrumentation on AArch64.