AVX2 popcount regression

user16251 · February 11, 2024, 4:37pm

With -march=x86-64-v3 clang compiles the code

#include <inttypes.h>

int popcount8(uint64_t data[8]) {
  int count = 0;
  for (int i = 0; i < 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}

to some fairly complicated AVX2 code (compiler explorer), while gcc just issues eight popcnt instructions (with an extra instruction to clear the output register because of an old false dependency bug). At least on my machine (AMD Ryzen 7 3700U) the popcnt version is 3x faster in a simple benchmark:

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>

extern int popcount8_gcc(uint64_t *);
extern int popcount8_clang(uint64_t *);

int main() {
        uint64_t xs[8];
        FILE *fd = fopen("/dev/urandom", "rb");
        if (fread(xs, 1, 64, fd) != 64) exit(1);
        for (long i = 0; i < 1000000000; ++i) {
                popcount8_clang(xs);
        }
        return 0;
}

llvm-mca in some cases estimates that the gcc alternative is faster. Is there any way to encourage LLVM to just use popcnt with march=x86-64-v3?

RKSimon · February 12, 2024, 11:09am

It looks like the issue isn’t the vectorized popcount expansion but the truncation feeding the reduction:

define i32 @popcount8(ptr nocapture noundef readonly %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = trunc <8 x i64> %1 to <8 x i32>
  %3 = tail call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %2)
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i32 @llvm.vector.reduce.add.v8i32(<8 x i32>) #1

if we did this instead we get a notable boost:

define i32 @popcount8(ptr nocapture noundef readonly %data)  {
entry:
  %0 = load <8 x i64>, ptr %data, align 8
  %1 = tail call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %0)
  %2 = tail call i64 @llvm.vector.reduce.add.v8i64 (<8 x i64 > %1)
  %3 = trunc i64 %2 to i32
  ret i32 %3
}
declare <8 x i64> @llvm.ctpop.v8i64(<8 x i64>) #1
declare i64 @llvm.vector.reduce.add.v8i64(<8 x i64>)

RKSimon · February 12, 2024, 11:12am

Please can you try this instead:

#include <inttypes.h>

uint64_t popcount8(uint64_t data[8]) {
  uint64_t count = 0;
  for (int i = 0; i < 8; ++i)
    count += __builtin_popcountll(data[i]);
  return count;
}

RKSimon · February 12, 2024, 1:17pm

I’ve raised [X86] Prefer trunc(reduce(x)) over reduce(trunc(x)) · Issue #81469 · llvm/llvm-project · GitHub for the missing reduce(trunc(x)) -> trunc(reduce(x)) fold.

user16251 · February 13, 2024, 2:26am

Thanks for looking into this. The non-truncating version doesn’t appear to make a difference in runtime on my machine (3.09s for gcc, 9.41s for clang). llvm-mca reports that the clang version will take 51.1 cycles per iteration and that the gcc alternative (8 popcnt and 7 add instructions) will take 17.9 cycles.

RKSimon · February 13, 2024, 8:03am

Please can you post a godbolt link to your mca codegen?

This is what I’m seeing: Compiler Explorer (x86-64-v3 defaults to Haswell model stats)

user16251 · March 11, 2024, 7:10pm

Sure, sorry about the late response. Here are some examples with the regression:

zen3
zen1; this is closer to the numbers I posted.

You can also see a discrepancy for Haswell with -iterations=1.

Topic		Replies	Views
AVX code gen LLVM Dev List Archives	2	85	December 12, 2013
__builtin_parity assembly without popcnt Clang Frontend	8	126	August 2, 2018
Zmm and ymm register usage in llvm generated assembly Code Generation	2	421	May 30, 2023
Vector trunc code generation difference between llvm-3.9 and 4.0 LLVM Dev List Archives	6	78	March 9, 2017
Another compiler shootout LLVM Dev List Archives	6	86	December 16, 2008

AVX2 popcount regression

Related Topics