AVX2 popcount regression

With -march=x86-64-v3 clang compiles the code

#include <inttypes.h>

int popcount8(uint64_t data[8]) {
  int count = 0;
  for (int i = 0; i < 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}

to some fairly complicated AVX2 code (compiler explorer), while gcc just issues eight popcnt instructions (with an extra instruction to clear the output register because of an old false dependency bug). At least on my machine (AMD Ryzen 7 3700U) the popcnt version is 3x faster in a simple benchmark:

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>

extern int popcount8_gcc(uint64_t *);
extern int popcount8_clang(uint64_t *);

int main() {
        uint64_t xs[8];
        FILE *fd = fopen("/dev/urandom", "rb");
        if (fread(xs, 1, 64, fd) != 64) exit(1);
        for (long i = 0; i < 1000000000; ++i) {
                popcount8_clang(xs);
        }
        return 0;
}

llvm-mca in some cases estimates that the gcc alternative is faster. Is there any way to encourage LLVM to just use popcnt with march=x86-64-v3?

It looks like the issue isn’t the vectorized popcount expansion but the truncation feeding the reduction:

define i32 @popcount8(ptr nocapture noundef readonly %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = trunc <8 x i64> %1 to <8 x i32>
  %3 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %2)
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>) #1

if we did this instead we get a notable boost:

define i32 @popcount8(ptr nocapture noundef readonly %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = tail call i64 @llvm.vector.reduce.add.v8i64 (<8 x i64 > %1)
  %3 = trunc i64 %2 to i32
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)

Please can you try this instead:

#include <inttypes.h>

uint64_t popcount8(uint64_t data[8]) {
  uint64_t count = 0;
  for (int i = 0; i < 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}

I’ve raised [X86] Prefer trunc(reduce(x)) over reduce(trunc(x)) · Issue #81469 · llvm/llvm-project · GitHub for the missing reduce(trunc(x)) -> trunc(reduce(x)) fold.

1 Like

Thanks for looking into this. The non-truncating version doesn’t appear to make a difference in runtime on my machine (3.09s for gcc, 9.41s for clang). llvm-mca reports that the clang version will take 51.1 cycles per iteration and that the gcc alternative (8 popcnt and 7 add instructions) will take 17.9 cycles.

Please can you post a godbolt link to your mca codegen?

This is what I’m seeing: Compiler Explorer (x86-64-v3 defaults to Haswell model stats)

Sure, sorry about the late response. Here are some examples with the regression:

  • zen3
  • zen1; this is closer to the numbers I posted.

You can also see a discrepancy for Haswell with -iterations=1.