Motivation
When getting profile counts on code in a tight loop in a multithreaded program, there is a lot of counter contention. When counters are not incremented atomically, the counter values can be inaccurate. In fact, because the count of an ‘else’ block in an if statement is the parent block’s count minus the if block count, this can lead to negative counter values, but even without that occurring this can affect what appears to be the hotpath.
There are solutions to this problem currently: -fprofile-update=atomic, or if you’re only interested in whether a line is covered and not how often. you can use -mllm -enable-singly-byte-coverage. However, these may result in worsened performance or a lower resolution of data.
Proposal
This patch explores a no-compromise approach which can deliver fully accurate profile counts with no reduction in performance- thread local profile counters, atomically added to the global counts on thread exit.
This is accomplished by generating a new __llvm_tls_prf_cnts section for each module and generating writes to an offset within that section rather than the global __llvm_prf_cntts section.
A new compiler rt runtime composed of a static library for handling module-local state and a shared library for handling global state atomically adds the values in __llvm_tls_prf_cnts to __llvm_prf_cnts on thread exit (as determined by a pthread_create interceptor). This runtime is used in addition to the regular profiling runtime.
Evaluation
Given an example program where the profiling counters are highly contended, I ran the program with various coverage options: -fprofile-thread-local, -fprofile-update=atomic, -mllvm -enable-single-byte-coverage, and no additional options.
This was a 16-threaded program run on an AMD Ryzen 9 7940HS (8 cores, 16 vcores):
clang -o liblib.so -fuse-ld=lld -O2 <extra_profile_option> -fprofile-instr-generate -fcoverage-mapping lib.c -shared -fPIC
clang -o performance_test -fuse-ld=lld -lpthread -O2 -L$(readlink -f .) -Wl,-rpath,. <extra_profile_option> -fprofile-instr-generate -fcoverage-mapping main.c
lib.c:
unsigned char determine_value_dyn(unsigned char c) {
if (c < 0x80) {
return c;
} else {
return -c;
}
}
main.c:
#include <pthread.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <dlfcn.h>
struct thread_arg {
uint64_t buf_size;
char const *buf;
uint64_t iteration_counter;
uint64_t output;
};
unsigned char (*determine_value_dyn)(unsigned char) = NULL;
void *thread_fn(void *arg_ptr) {
struct thread_arg *arg = (struct thread_arg *) arg_ptr;
for (uint64_t i = 0; i < arg->buf_size; i++) {
unsigned char c = (unsigned char) arg->buf[i];
arg->output += determine_value_dyn(c);
arg->iteration_counter++;
}
return NULL;
}
int main() {
const uint64_t n_threads = 16;
const uint64_t len = 10000000;
const char *dynlib_name = "liblib.so";
const char *dynlib_sym = "determine_value_dyn";
void *handle = dlopen(dynlib_name, RTLD_LAZY);
if (handle == NULL) {
printf("dlopen error on: %s: %s\n", dynlib_name, dlerror());
exit(2);
}
determine_value_dyn = dlsym(handle, dynlib_sym);
if (handle == NULL) {
printf("dlsym error on: %s : %s\n", dynlib_name, dynlib_sym);
exit(2);
}
pthread_t threads[n_threads] = {0};
struct thread_arg args[n_threads] = {0};
char *example_string = (char *) malloc(sizeof(char) * len);
int high = 0;
for (uint64_t i = 0; i < len; i++) {
if (high == 2) {
example_string[i] = 0xff;
high = 0;
} else {
example_string[i] = 0x0;
high++;
}
}
for (uint64_t i = 0; i < n_threads; i++) {
struct thread_arg a = {
len,
example_string,
0,
0,
};
args[i] = a;
if (pthread_create(&threads[i], NULL, thread_fn, &args[i]) != 0) {
printf("Failed to spawn thread %lu, exiting\n", i);
exit(1);
}
}
int rc = 0;
for (uint64_t i = 0; i < n_threads; i++) {
void *retval = NULL;
if (pthread_join(threads[i], &retval) != 0) {
printf("Failed to join thread %lu, continuing\n", i);
rc = 1;
}
printf("Thread %lu output:\n"
"iteration_counter: %lu\n"
"output: %lx\n\n",
i,
args[i].iteration_counter,
args[i].output);
}
return rc;
}
tls: -fprofile-thread-local
real 0m0.121s
user 0m1.610s
sys 0m0.011s
coverage of liblib.so:
FN:14,determine_value_dyn
FNDA:160000000,determine_value_dyn
FNF:1
FNH:1
DA:14,160000000
DA:15,160000000
DA:16,106666672
DA:17,106666672
DA:18,53333328
DA:19,53333328
DA:20,160000000
LF:7
LH:7
atomic: -fprofile-update=atomic
real 0m3.380s
user 0m52.238s
sys 0m0.014s
coverage of liblib.so:
FN:14,determine_value_dyn
FNDA:160000000,determine_value_dyn
FNF:1
FNH:1
DA:14,160000000
DA:15,160000000
DA:16,106666672
DA:17,106666672
DA:18,53333328
DA:19,53333328
DA:20,160000000
LF:7
LH:7
standard: (no additional options)
real 0m3.453s
user 0m52.418s
sys 0m0.012s
coverage of liblib.so:
FN:14,determine_value_dyn
FNDA:24782378,determine_value_dyn
FNF:1
FNH:1
DA:14,24782378
DA:15,24782378
DA:16,22864224
DA:17,22864224
DA:18,1918154
DA:19,1918154
DA:20,24782378
LF:7
LH:7
singlebyte: -mllvm -enable-single-byte-coverage
real 0m1.668s
user 0m26.032s
sys 0m0.010s
coverage of lilib.so:
FN:14,determine_value_dyn
FNDA:1,determine_value_dyn
FNF:1
FNH:1
DA:14,1
DA:15,1
DA:16,1
DA:17,1
DA:18,1
DA:19,1
DA:20,1
LF:7
LH:7
I was surprised to see just how much a difference it made for performance. I only expected it to make it on par with standard un-synchronized counters, but surprisingly in this example there was no difference between that an -fprofile-update=atomic. I now understand that this is likely because x86_64 makes some memory model guarantees even for un-synchronized operations to the same address.
Which -fprofile-thread-local, coverage ran much faster than even -enable-single-byte-coverage did.
Limitations
This patch is currently in an experimental state. As of this patch it is so far only supported on Linux and tested on x86_64, though it may work as well on other 64-bit hardware.
However, the real glaring issue is that it only works on a program with fairly tame behavior- that is, it produces accurate coverage values when no threads are left running and the main thread calls ‘exit’ or returns. There is no guarantee that this instrumentation will work when not all modules are compiled with -fprofile-thread-local, though it seems to be well behaved when at least the executable itself is instrumented this way even if shared libs aren’t.
The main problem in the way of a more general solution is that when the program is exiting, I need to be able to iterate over the thread local storage for each module, for each thread that hasn’t already exited. Because any arbitrary thread can call exit or otherwise handle module destructors, it will be one thread attempting to access the TLS of the other threads. I believe that this functionality will require relying on ld.so-internal behavior. TSAN already does that on some platforms. Alternatively, other solutions involving signal handlers or ptrace may be explored to get each thread to do its own cleanup.
Libraries which are dlclosed may cause the current cleanup handler to use memory after free. This should be fixable by intercepting dlopen and adding the RTLD_NODELETE flag. Doing so could interfere with a tiny subset of programs which intentionally unload libraries to re-initialize static variables, but is probably the best way to handle this issue.
I plan on exploring and resolving these limitations, but I want to get community feedback on the current architecture of the change before committing additional time to the current solution.
Possible future work
I think it may be useful to have an option to specify the TLS model used. Some libraries may be better off writing to thread local counters with the initial-exec model than general dynamic.