[RFC] Profiling counters in thread local storage

ajwock · June 14, 2024, 2:37am

Motivation
When getting profile counts on code in a tight loop in a multithreaded program, there is a lot of counter contention. When counters are not incremented atomically, the counter values can be inaccurate. In fact, because the count of an ‘else’ block in an if statement is the parent block’s count minus the if block count, this can lead to negative counter values, but even without that occurring this can affect what appears to be the hotpath.

There are solutions to this problem currently: -fprofile-update=atomic, or if you’re only interested in whether a line is covered and not how often. you can use -mllm -enable-singly-byte-coverage. However, these may result in worsened performance or a lower resolution of data.

Proposal

github.com/llvm/llvm-project

[InstrProf] Created Thread local counter instrumentation, compiler-rt runtimes

llvm:main ← ajwock:tls_cnts_squashed

opened 02:36AM - 14 Jun 24 UTC

ajwock

+1110 -4

LLVM can now generate increments to counters in thread local storage. Use a n…ew compiler-rt runtime to atomically add thread local counters to global counters on thread exit. The clang driver will link the new runtime libraries in when the new option -fprofile-thread-local is specified. More details available in the RFC on discourse.

This patch explores a no-compromise approach which can deliver fully accurate profile counts with no reduction in performance- thread local profile counters, atomically added to the global counts on thread exit.

This is accomplished by generating a new __llvm_tls_prf_cnts section for each module and generating writes to an offset within that section rather than the global __llvm_prf_cntts section.

A new compiler rt runtime composed of a static library for handling module-local state and a shared library for handling global state atomically adds the values in __llvm_tls_prf_cnts to __llvm_prf_cnts on thread exit (as determined by a pthread_create interceptor). This runtime is used in addition to the regular profiling runtime.

Evaluation
Given an example program where the profiling counters are highly contended, I ran the program with various coverage options: -fprofile-thread-local, -fprofile-update=atomic, -mllvm -enable-single-byte-coverage, and no additional options.

This was a 16-threaded program run on an AMD Ryzen 9 7940HS (8 cores, 16 vcores):

clang -o liblib.so -fuse-ld=lld -O2 <extra_profile_option> -fprofile-instr-generate -fcoverage-mapping lib.c -shared -fPIC

clang -o performance_test -fuse-ld=lld -lpthread -O2 -L$(readlink -f .) -Wl,-rpath,. <extra_profile_option> -fprofile-instr-generate -fcoverage-mapping main.c

lib.c:

unsigned char determine_value_dyn(unsigned char c) {
    if (c < 0x80) {
        return c;
    } else {
        return -c;
    }
}

main.c:

#include <pthread.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <dlfcn.h>

struct thread_arg {
    uint64_t buf_size;
    char const *buf;
    uint64_t iteration_counter;
    uint64_t output;
};

unsigned char (*determine_value_dyn)(unsigned char) = NULL;

void *thread_fn(void *arg_ptr) {
    struct thread_arg *arg = (struct thread_arg *) arg_ptr;
    for (uint64_t i = 0; i < arg->buf_size; i++) {
        unsigned char c = (unsigned char) arg->buf[i];
        arg->output += determine_value_dyn(c);
        arg->iteration_counter++;
    }

    return NULL;
}

int main() {
    const uint64_t n_threads = 16;
    const uint64_t len = 10000000;

    const char *dynlib_name = "liblib.so";
    const char *dynlib_sym = "determine_value_dyn";
    void *handle = dlopen(dynlib_name, RTLD_LAZY);
    if (handle == NULL) {
        printf("dlopen error on: %s: %s\n", dynlib_name, dlerror());
        exit(2);
    }

    determine_value_dyn = dlsym(handle, dynlib_sym);
    if (handle == NULL) {
        printf("dlsym error on: %s : %s\n", dynlib_name, dynlib_sym);
        exit(2);
    }

    pthread_t threads[n_threads] = {0};
    struct thread_arg args[n_threads] = {0};
    char *example_string = (char *) malloc(sizeof(char) * len);
    int high = 0;
    for (uint64_t i = 0; i < len; i++) {
        if (high == 2) {
            example_string[i] = 0xff;
            high = 0;
        } else {
            example_string[i] = 0x0;
            high++;
        }
    }

    for (uint64_t i = 0; i < n_threads; i++) {
        struct thread_arg a = {
            len,
            example_string,
            0,
            0,
        };
        args[i] = a;
        if (pthread_create(&threads[i], NULL, thread_fn, &args[i]) != 0) {
            printf("Failed to spawn thread %lu, exiting\n", i);
            exit(1);
        }
    }

    int rc = 0;
    for (uint64_t i = 0; i < n_threads; i++) {
        void *retval = NULL;
        if (pthread_join(threads[i], &retval) != 0) {
            printf("Failed to join thread %lu, continuing\n", i);
            rc = 1;
        }

        printf("Thread %lu output:\n"
                "iteration_counter: %lu\n"
                "output: %lx\n\n",
                i,
                args[i].iteration_counter,
                args[i].output);
    }

    return rc;
}

tls: -fprofile-thread-local

real 0m0.121s
user 0m1.610s
sys 0m0.011s
coverage of liblib.so:

FN:14,determine_value_dyn
FNDA:160000000,determine_value_dyn
FNF:1
FNH:1
DA:14,160000000
DA:15,160000000
DA:16,106666672
DA:17,106666672
DA:18,53333328
DA:19,53333328
DA:20,160000000
LF:7
LH:7

atomic: -fprofile-update=atomic

real 0m3.380s
user 0m52.238s
sys 0m0.014s
coverage of liblib.so:

FN:14,determine_value_dyn
FNDA:160000000,determine_value_dyn
FNF:1
FNH:1
DA:14,160000000
DA:15,160000000
DA:16,106666672
DA:17,106666672
DA:18,53333328
DA:19,53333328
DA:20,160000000
LF:7
LH:7

standard: (no additional options)

real 0m3.453s
user 0m52.418s
sys 0m0.012s
coverage of liblib.so:

FN:14,determine_value_dyn
FNDA:24782378,determine_value_dyn
FNF:1
FNH:1
DA:14,24782378
DA:15,24782378
DA:16,22864224
DA:17,22864224
DA:18,1918154
DA:19,1918154
DA:20,24782378
LF:7
LH:7

singlebyte: -mllvm -enable-single-byte-coverage

real 0m1.668s
user 0m26.032s
sys 0m0.010s
coverage of lilib.so:

FN:14,determine_value_dyn
FNDA:1,determine_value_dyn
FNF:1
FNH:1
DA:14,1
DA:15,1
DA:16,1
DA:17,1
DA:18,1
DA:19,1
DA:20,1
LF:7
LH:7

I was surprised to see just how much a difference it made for performance. I only expected it to make it on par with standard un-synchronized counters, but surprisingly in this example there was no difference between that an -fprofile-update=atomic. I now understand that this is likely because x86_64 makes some memory model guarantees even for un-synchronized operations to the same address.

Which -fprofile-thread-local, coverage ran much faster than even -enable-single-byte-coverage did.

Limitations

This patch is currently in an experimental state. As of this patch it is so far only supported on Linux and tested on x86_64, though it may work as well on other 64-bit hardware.
However, the real glaring issue is that it only works on a program with fairly tame behavior- that is, it produces accurate coverage values when no threads are left running and the main thread calls ‘exit’ or returns. There is no guarantee that this instrumentation will work when not all modules are compiled with -fprofile-thread-local, though it seems to be well behaved when at least the executable itself is instrumented this way even if shared libs aren’t.

The main problem in the way of a more general solution is that when the program is exiting, I need to be able to iterate over the thread local storage for each module, for each thread that hasn’t already exited. Because any arbitrary thread can call exit or otherwise handle module destructors, it will be one thread attempting to access the TLS of the other threads. I believe that this functionality will require relying on ld.so-internal behavior. TSAN already does that on some platforms. Alternatively, other solutions involving signal handlers or ptrace may be explored to get each thread to do its own cleanup.
Libraries which are dlclosed may cause the current cleanup handler to use memory after free. This should be fixable by intercepting dlopen and adding the RTLD_NODELETE flag. Doing so could interfere with a tiny subset of programs which intentionally unload libraries to re-initialize static variables, but is probably the best way to handle this issue.

I plan on exploring and resolving these limitations, but I want to get community feedback on the current architecture of the change before committing additional time to the current solution.

Possible future work
I think it may be useful to have an option to specify the TLS model used. Some libraries may be better off writing to thread local counters with the initial-exec model than general dynamic.

rnk · July 2, 2024, 7:57pm

@xur and @mtrofin implemented something very similar in spirit to what you’ve described, I believe, and would be the right people to give feedback.

mtrofin · July 2, 2024, 8:16pm

I think @rnk is referring to [Tracking] contextual instrumented-based PGO · Issue #89287 · llvm/llvm-project · GitHub from my side - the RFC and landed patches are all tracked there. There is also a EuroLLVM’24 talk.

I think PR #69535 is @xur 's.

It would help frame this RFC in the above context.

Topic		Replies	Views
multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters) LLVM Dev List Archives	50	258	April 25, 2014
[RFC] Single Byte Counters for Source-based Code Coverage CompilerRT	14	999	June 14, 2024
RFC: Pass to prune redundant profiling instrumentation LLVM Dev List Archives	26	201	March 12, 2016
[compiler-rt][InstrProf][CoverageMap] Invalid counters when I use runtime api to get coverage data Runtimes clang , llvm	1	117	December 22, 2023
RFC - Improvements to PGO profile support LLVM Dev List Archives	61	203	May 29, 2015

[RFC] Profiling counters in thread local storage

tls: -fprofile-thread-local

atomic: -fprofile-update=atomic

standard: (no additional options)

singlebyte: -mllvm -enable-single-byte-coverage

Related topics