[RFC] Single Byte Counters for Source-based Code Coverage

Motivation

Clang profiling instrumentation inserts eight byte counters that are incremented when a part of the program is executed to collect source-based code coverage. When we are only interested in whether a line is executed instead of its execution count, we can use a one byte counter instead of an eight byte counter. Lightweight Instrumentation idea was originally proposed to reduce the overhead of IR instrumentation for PGO. Since PGO and coverage use the same IR instrumentation, we repurpose this idea for coverage.

Proposal

We propose adding a mode to use single byte counters in source-based code coverage. We implemented a prototype in [InstrProf] Single byte counters in coverage by gulfemsavrun · Pull Request #75425 · llvm/llvm-project · GitHub (previously ⚙ D126586 [InstrProf] Single byte counters in coverage). Eight byte counter mode infer counters by adding or subtracting two execution counts whenever there is a control-flow merge. For example, the picture below shows an example of inserting counters for an if statement. Clang inserts counter 0 in the parent block and counter 1 in if.then block, and it can infer the counter for if.else block by subtracting counters between parent and if.then blocks.

However, we cannot infer counters in single byte coverage mode. Therefore, we conservatively insert additional counters for the cases where we need to add or subtract counters. This might be improved with a better algorithm that inserts less counters in single byte counters mode in future work.

Evaluation

We evaluated the performance and size impact of our prototype. We used Clang version 18.0.0 with no instrumentation as our baseline, where we refer as “No coverage”. We reported the performance speedups from optimization level O2, but we measured the performance in O1 and O3 as well. We have seen greater speedups in O1, and O3 performance is very similar to O2. We presented the results of single byte counters from x86 and ARM architectures on 32-bit and 64-bit modes. We used an x86-based Intel Xeon system, and an ARMv7 system that uses Cortex-A53 cores, and an AArch64 system that uses ThunderX2 cores in our hardware configuration. We used Dhrystone from llvm-test-suite and CoreMark to measure the performance benefit of single byte counters. We also measured binary size for Clang binary to evaluate the size impact of single byte counters.

Code Snippets

The table above shows the instrumentation code snippets generated for the Proc4 method in Dhrystone in our implementation.

  • x86-32

We generate two instructions (addl and adcl) for instrumentation on x86 32-bit mode, and we generate a single instruction (movb) in single byte coverage mode.

  • x86-64

Instrumentation code corresponds to a single instruction (inc), and it still corresponds to a single instruction (movb) in single byte coverage mode.

  • ARMv7

We generate a sequence of nine instructions for instrumentation code, and when we enable single byte counters mode, we generate four instructions (ldr, add, mov and strb).

  • AArch64

Instrumentation code consists of a sequence of four instructions (adrp, ldr, add, and str). When we enable single byte counters mode, it corresponds to two instructions (adrp and strb).

Performance

The table above reports the speedups of single byte counters over eight byte counters on Dhrystone. We increased the default iteration count (LOOPS) in Dhrystone from 100,000,000 to 1,000,000,000. We achieved a 3% speedup on x86-32, and 5% speedup on x86-64. We see greater speedups on ARM, 62% on ARMv7 and 42% on AArch64, respectively.

The table above shows our results from the CoreMark benchmark. CoreMark is a performance benchmark that is designed to stress CPU pipeline, and it reports a score (Iterations/Sec) upon execution. We observed a 29% better score on x86 32-bit mode, but a 7% worse score on 64-bit mode. The reason for the poor performance on x86 64-bit mode is that it is a significantly control-heavy benchmark and we insert many more counters in single byte counters mode. This hurts the performance, but might be improved if we come up with a better algorithm that inserts less counters in single byte counters mode. Single byte counters achieved a 27% and 39% better score in ARMv7 and AArch64, respectively.

Code Size

We compared the code (.text) size of clang binary to measure the size impact of single byte counters. As shown in the table above, single byte counters reduce the code size by 20% on x86 32-bit mode because we generate fewer instructions. In contrast, it increases the code size by 4% on x86 64-bit mode. The reason is that we do not reduce the number of static instructions in single byte coverage mode, and we insert additional instrumentation code, which increases the code size in x86 64-bit mode. Single byte counters reduce the code size by 8% on AArch64 since we typically save multiple instructions per instrumentation on AArch64.

3 Likes

@davidxl @evodius96 @ellishg @ayzhao @ZequanWu @rnk @hansw2000 @petrhosek @MaskRay

Since PGO and coverage use the same IR instrumentation, we repurpose this idea for coverage.

Is the idea to do this for coverage in addition to PGO, or just for coverage now?

We propose adding a mode to use single byte counters in source-based code coverage.

Sounds great to me :slight_smile:

Someone might ask: if it’s just 0 or 1, why not 1-bit counters?

Binary counters also bring the possibility of doing conditional updates (if a counter is already reset, don’t reset it again) to reduce write contention when multiple threads update counters on the same cache lines. IIRC @ayzhao experimented with that but I don’t remember the exact results.

I’m not familiar with how the profiling counting works, but my instinctive answer is multithreading can lose info in a read-update-write race. I don’t know whether the current counter scheme guards against this, but the data loss is less harmful in that case (because the counter will always end up non-zero, even if the count is wrong). Raciness on 1-bit updates can reset ones to zero, which is very bad.

This concern goes away if the instrumentation uses atomic updates, does it?

1 Like

We’ve studied how to find the smallest number of blocks needed to be instrumented to infer coverage for the whole CFG. Details are in [2208.13907] Minimum Coverage Instrumentation. We also implemented this for IRPGO in ⚙ D124490 [InstrProf] Minimal Block Coverage ([InstrProf] Minimal Block Coverage · llvm/llvm-project@167e8f8 · GitHub) and in the end, we found that only ~60% of basic blocks needs to be instrumented. That implementation uses BlockCoverageInference as a framework to decide which blocks to instrument and compute the coverage of the whole CFG. Right now it uses LLVM Functions and BasicBlocks, but I think it could be generalized to work with Clang too.

I actually considered doing this, but the size and performance overhead would likely be worse. The size overhead of instrumenting N blocks on arm64 is 8*N bytes for code and N bytes for data. So saving 7 bits of data out of 9 bytes of overhead isn’t a lot.
Also, I don’t know of an arm64 instruction that will store a single bit. Instead, you’ll need to do a load, an and, then a store, which has way more code overhead. See this example: Compiler Explorer. This is also what prevents single-bit counters from being thread-safe. A nice advantage of single-byte counters is that it is already thread-safe without needing atomics.
As far as I can tell, LLVM doesn’t even support single-bit data anyway. When I try to use a i1 global, it treats it as a byte: Compiler Explorer.

1 Like

Can this algorithm be adapted for Clang’s AST, or does it require a CFG? Clang doesn’t have a canonical CFG.

Consider that the coverage mapping data is constructed in the frontend. If we ran your algorithm as a mandatory pass on the unoptimized IR, it would need to refer back to the AST to recover source locations. I think, for implementation practicality, we should leave this counter placement optimization work for later. I think it will be challenging to adapt this to the context of the frontend, but I’m open to being wrong.

I’ll note that if we had clang-ir.org in tree, that would give us a CFG, and we would use that to place counters and construct the coverage mapping.


Regarding the conversation around bits-vs-bytes and concurrency concerns, I think it’s a code size trade-off. Both x86_64 and ARM64 can update a byte flag with a single additional instruction, reducing instrumentation size. This may be the right tradeoff for size-constrained embedded and mobile apps.

If we are willing to spend the extra code size, then it makes sense to explore a conditional double-checked locking and bits over bytes, since the steady state fastpath is to check for “has this been executed, no, continue”, and the slowpath can use expensive atomic bitset instructions or a compare-and-swap loop.

In conclusion, I think bits, conditional updates, and atomic instructions are good future directions, but there is a place for unconditional, racy, byte-sized coverage flags, and we should move forward with the current plan.

2 Likes

From what I remember, every block must reach some terminal node, i.e., there must be a single entry block and a single exit block. Otherwise we can bail and instrument every block. It would be interesting to see whether ineligible CFGs are common in Clang’s AST and if they can be somehow fixed.

Yes, we would need to somehow store the CFG so coverage could be inferred. This is a trade off of complexity for size overhead. I just wanted to make sure this work was known, it’s certainty not required.

I didn’t think about that. Yes, I think clang-ir would greatly simply things. :slight_smile: CC @bcardosolopes @lanza

1 Like

Instrumentation is shared between coverage and PGO, but they are done at different stages:

  1. Coverage adds instrumentation during Clang Codegen before IR transformations, so that’s why I had to change several Codegen functions to integrate single byte counters into coverage.
  2. PGO adds instrumentation during a transformation pass, so it operates on the LLVM IR.

PGO already supports single byte counters, but we cannot just reuse it in coverage. That’s why we need to repurpose the idea for coverage. Here is the original proposal that can help us to learn the history on that.

Thanks for point that out @ellishg, and we are aware of that work. AFAIK, PGO uses minimal spanning tree approach on a CFG to reduce the instrumentation cost, and this algorithm expect a CFG. So, I don’t think it can be directly applied to AST.

Chrome’s use case is that our coverage builds run our tests with online counter merging (i.e. %m in LLVM_PROFILE_FILE) enabled, and our tests run in parallel, with the number of jobs equal to the number of CPU threads. I discovered that the cache coherency / write contention costs of unconditionally writing to counters is much greater (~4.5x more elapsed time) then reading and conditionally updating a counter’s value.

Methodology
The test was performed on a AMD Ryzen Threadripper PRO 3995WX 64-Core (128 threads).

For both conditional and nonconditional writes, I built Chrome’s base_unittests with the following gn args:

# Build arguments go here.
# See "gn args <out_dir> --list" for available build arguments.
dcheck_always_on = true
devtools_skip_typecheck = false
enable_backup_ref_ptr_feature_flag = true
enable_dangling_raw_ptr_checks = true
enable_dangling_raw_ptr_feature_flag = true
ffmpeg_branding = "Chrome"
is_component_build = false
is_debug = false
proprietary_codecs = true
symbol_level = 1
use_remoteexec = false
llvm_force_head_revision=true
clang_use_chrome_plugins=false
clang_base_path="/path/to/llvm-project/build"
use_clang_coverage=true

To generate conditional writes, I applied the following quick and dirty patch to [InstrProf] Single byte counters in coverage by gulfemsavrun · Pull Request #75425 · llvm/llvm-project · GitHub

diff --git a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
index fe5a0578bd97..fea37e1c5c71 100644
--- a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
+++ b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
@@ -601,6 +601,7 @@ bool InstrLowerer::lowerIntrinsics(Function *F) {
       } else if (auto *IPC = dyn_cast<InstrProfCoverInst>(&Instr)) {
         lowerCover(IPC);
         MadeChange = true;
+        break; // because this splits and terminates the current basic block
       } else if (auto *IPVP = dyn_cast<InstrProfValueProfileInst>(&Instr)) {
         lowerValueProfileInst(IPVP);
         MadeChange = true;
@@ -908,7 +909,16 @@ Value *InstrLowerer::getBitmapAddress(InstrProfMCDCTVBitmapUpdate *I) {

 void InstrLowerer::lowerCover(InstrProfCoverInst *CoverInstruction) {
   auto *Addr = getCounterAddress(CoverInstruction);
+  auto &Ctx = CoverInstruction->getParent()->getContext();
+  Instruction *SplitBefore = CoverInstruction->getNextNode();
   IRBuilder<> Builder(CoverInstruction);
+  auto *Int8Ty = llvm::Type::getInt8Ty(Ctx);
+  Value *Load = Builder.CreateLoad(Int8Ty, Addr, "pgocount");
+  Value *CondV = Builder.CreateICmpNE(Load, ConstantInt::get(Int8Ty, 0),
+                                      "pgocount.ifnonzero");
+  Instruction *ThenBranch =
+      SplitBlockAndInsertIfThen(CondV, SplitBefore, false);
+  Builder.SetInsertPoint(ThenBranch);
   // We store zero to represent that this block is covered.
   Builder.CreateStore(Builder.getInt8(0), Addr);
   CoverInstruction->eraseFromParent();

After building, I ran the tests with

$ export LLVM_PROFILE_FILE=/tmp/default/default%4m%c.profraw
$ perf stat path/to/chrome/build/directory perf_unittests

Results

Tests with unconditional counter writes finish in 299 seconds:

 Performance counter stats for 'out/bool/base_unittests':

      8,043,730.05 msec task-clock:u                     #   26.919 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
         8,481,198      page-faults:u                    #    1.054 K/sec
30,675,466,783,723      cycles:u                         #    3.814 GHz                         (83.35%)
 1,044,934,579,313      stalled-cycles-frontend:u        #    3.41% frontend cycles idle        (83.35%)
21,705,281,344,861      stalled-cycles-backend:u         #   70.76% backend cycles idle         (83.35%)
 1,317,213,595,303      instructions:u                   #    0.04  insn per cycle
                                                  #   16.48  stalled cycles per insn     (83.35%)
   158,169,143,258      branches:u                       #   19.664 M/sec                       (83.35%)
     1,700,438,177      branch-misses:u                  #    1.08% of all branches             (83.35%)

     298.811929869 seconds time elapsed

    7758.253230000 seconds user
     254.695185000 seconds sys

while tests with conditional writes finish in 66 seconds:

 Performance counter stats for 'out/condbool/base_unittests':

      1,184,423.49 msec task-clock:u                     #   17.979 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
         8,190,784      page-faults:u                    #    6.915 K/sec
 3,272,203,056,799      cycles:u                         #    2.763 GHz                         (83.39%)
   774,378,413,185      stalled-cycles-frontend:u        #   23.67% frontend cycles idle        (83.38%)
    99,682,562,762      stalled-cycles-backend:u         #    3.05% backend cycles idle         (83.45%)
 2,059,281,971,699      instructions:u                   #    0.63  insn per cycle
                                                  #    0.38  stalled cycles per insn     (83.38%)
   615,859,467,850      branches:u                       #  519.966 M/sec                       (83.40%)
    12,000,644,695      branch-misses:u                  #    1.95% of all branches             (83.39%)

      65.878658223 seconds time elapsed

     871.587761000 seconds user
     303.454896000 seconds sys

Can you try using select instead of a branch:

%r1 = load i8, i8* %p
%cmp  = icmp ne i8 %r1, 0  
%r2 = select i1 %cmp, i64 0, i64 %r1

I’d expect it to be lowered to cmov which could further improve performance.

It looks like select gets lowered to an unconditional write

LLVM diff:

diff --git a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
index fe5a0578bd97..f1b68417d575 100644
--- a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
+++ b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
@@ -908,9 +908,16 @@ Value *InstrLowerer::getBitmapAddress(InstrProfMCDCTVBitmapUpdate *I) {

 void InstrLowerer::lowerCover(InstrProfCoverInst *CoverInstruction) {
   auto *Addr = getCounterAddress(CoverInstruction);
+  auto &Ctx = CoverInstruction->getParent()->getContext();
+  auto *Int8Ty = llvm::Type::getInt8Ty(Ctx);
   IRBuilder<> Builder(CoverInstruction);
+  Value *Load = Builder.CreateLoad(Int8Ty, Addr, "pgocount");
+  Value *CondV = Builder.CreateICmpNE(Load, ConstantInt::get(Int8Ty, 0),
+                                      "pgocount.ifnonzero");
+  Value *Sel = Builder.CreateSelect(CondV, Builder.getInt8(0), Load,
+                                    "pgocount.select");
   // We store zero to represent that this block is covered.
-  Builder.CreateStore(Builder.getInt8(0), Addr);
+  Builder.CreateStore(Sel, Addr);
   CoverInstruction->eraseFromParent();
 }

I have a test file simple.cc which contains a function foo() that does nothing:

void foo() {
}

If we compile with -O0 we see that there is indeed a select instruction:

 ~/src/llvm-project/build/bin/clang -c -S -emit-llvm -O0 -fprofile-instr-generate -mllvm -runtime-counter-relocation=true -mllvm  -enable-single-byte-coverage=true ~/src/tests/simple.cc -o -

(other bits omitted)

; Function Attrs: mustprogress noinline nounwind optnone uwtable
define dso_local void @_Z3foov() #0 {
  %1 = load i64, ptr @__llvm_profile_counter_bias, align 8
  %2 = add i64 ptrtoint (ptr @__profc__Z3foov to i64), %1
  %3 = inttoptr i64 %2 to ptr
  %4 = load i8, ptr %3, align 1
  %5 = icmp ne i8 %4, 0
  %6 = select i1 %5, i8 0, i8 %4
  store i8 %6, ptr %3, align 1
  ret void
}

-O3 lowers the select to an unconditional store in LLVM IR

define dso_local void @_Z3foov() local_unnamed_addr #0 {
  %1 = load i64, ptr @__llvm_profile_counter_bias, align 8
  %2 = add i64 %1, ptrtoint (ptr @__profc__Z3foov to i64)
  %3 = inttoptr i64 %2 to ptr
  store i8 0, ptr %3, align 1
  ret void
}

which gets lowered into mov 0 in x86:

_Z3foov:                                # @_Z3foov
        .cfi_startproc
# %bb.0:
        movq    __llvm_profile_counter_bias(%rip), %rax
        leaq    .L__profc__Z3foov(%rip), %rcx
        movb    $0, (%rax,%rcx)
        retq

OTOH, LLVM preserves the branch instruction:

_Z3foov:                                # @_Z3foov
        .cfi_startproc
# %bb.0:
        movq    __llvm_profile_counter_bias(%rip), %rax
        leaq    .L__profc__Z3foov(%rip), %rcx
        cmpb    $0, (%rax,%rcx)
        je      .LBB0_2
# %bb.1:
        movb    $0, (%rax,%rcx)
.LBB0_2:
        retq

perf stat on base_unittests also confirms that no conditional counter updates were made:

 Performance counter stats for 'out/selectcondbool/base_unittests':

      8,138,966.68 msec task-clock:u                     #   28.346 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
         8,400,131      page-faults:u                    #    1.032 K/sec
31,061,247,596,477      cycles:u                         #    3.816 GHz                         (83.35%)
 1,312,779,320,466      stalled-cycles-frontend:u        #    4.23% frontend cycles idle        (83.35%)
22,075,624,483,530      stalled-cycles-backend:u         #   71.07% backend cycles idle         (83.36%)
 1,330,903,366,283      instructions:u                   #    0.04  insn per cycle
                                                  #   16.59  stalled cycles per insn     (83.34%)
   159,776,601,808      branches:u                       #   19.631 M/sec                       (83.34%)
     1,694,924,175      branch-misses:u                  #    1.06% of all branches             (83.35%)

     287.130240302 seconds time elapsed

    7853.346829000 seconds user
     258.224291000 seconds sys

EDIT: missed a retq instruction

I thought the rule of thumb was to prefer branches when they’re likely to be predictable. Also, wouldn’t a cmov potentially defeat the purpose of avoiding writing to the cache line?

Anyway, I don’t want to derail the thread with implementation details, but I think Alan’s numbers highlight that conditional updates are important for concurrent workloads, so we should keep it in mind as a future extension.

Agree.

Using bit counters directly seems bad for code size (4+ instructions for an update for many architectures), but we can postpone counter updates to the function exit. (This is similar to counter promotion in PGO).
One 64-bit register can serve 64 counters! Many functions just need one register.

// function entry
delta0 = 0;
delta1 = 0;

// enter a region
delta0 |= 1;

// enter another region
delta0 |= 4;

// function exit, write back
counter[0] |= delta0;
counter[1] |= delta1; // This is a no-op. Is a conditional write beneficial here?

If Clang could utilize block coverage inference, it would be better:)

Setting a bit in a register takes 4 bytes on AArch64, 4/6/8 bytes on x86-64 (-Oz), 4/6 bytes on RISCV64, 4/8 bytes on PPC64.

For small (with fewer basic blocks)/cheap functions, byte counters seem optimal.
For larger/expensive functions, bit counters with postponed updates may be better.

I don’t know whether we want to support mixed counters, which might achieve an optimal solution for both cases.
However, making this optimization decision in clang codegen could be awkward without an IR.

Moreover, bit counters somewhat negate the advantages of continuous mode or runtime counter relocation, both of which are beneficial for code coverage in death tests.