Non-deterministic runtime of addq instruction

I have the following (admittedly somewhat odd) function written in assembly:

        .p2align 4
        .globl  f2
        .type   f2, @function
f2:
.LFB5700:
        .cfi_startproc
        addq    $1, (%rdi)
        addq    $1, (%rsi)
        addq    $1, (%rdi)
        addq    $1, (%rsi)
        addq    $1, (%rdi)
        addq    $1, (%rsi)
        addq    $1, (%rdi)
        addq    $1, (%rsi)
        ret
        .cfi_endproc
.LFE5700:
        .size   f2, .-f2

I call this function a few million times from a loop written in C and measure the runtime using
__rdtscp. To remove the overhead for the loop and function calls I then subtract the runtime of another loop that calls an empty function (not doing anything, only ret). This difference is then divided by the number of operations to obtain the average runtime of this function.

The result seems quite surprising. Each time the program is run the resulting average is either 3 cycles or approximately 1.68 cycles. It’s one of these two values, never anything else. It feels as if at the start of the program a random switch is set that determines the cost of this function.

Any insights into this would be much appreciated!

I’m using clang 11.1 on a AMD Ryzen 9 5950X in case that matters.

Just noticed that this depends on how the program is compiled. The behaviour described above manifests itself when compiling with cc. When instead using clang the average is neither of the two previous numbers, but always 0.68.

Quite strange, I thought the two commands were the same…

$ cc --version
clang version 11.1.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nix/store/lw5h02if67ypzghacqjja6b6q4wj4qbf-clang-11.1.0/bin
$ clang --version
clang version 11.1.0
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nix/store/lw5h02if67ypzghacqjja6b6q4wj4qbf-clang-11.1.0/bin