Help needed on vectorization strangeness and stuff

I happened to have a program that seemed to compile into quite non
optimal machine code and some spare time, so I decided this was a good
opportunity to learn more about optimization passes.

Now I think I figured out what is going on - but I’m stuck and would
appreciate some help on how to continue. Please check that my
conclusions are correct and answer my questions towards the end - or
tell me that I’m asking the wrong questions. Or just what I can do to
fix the bug.

Given this simple program:

// —8<---------- [interesting.c]
unsigned char dst[DEFINEME] attribute((aligned (64)));
unsigned char src[DEFINEME] attribute((aligned (64)));

void copy_7bits(void)
{
for (int i = 0; i < DEFINEME; i++)
dst[i] = src[i] & 0x7f;
}
// —8<----------------------------------

compiled with:

clang -march=haswell -O3 -S -o - interesting.c -DDEFINEME=160

it generates some interesting stuff which basically amounts to:

vmovaps .LCPI0_0(%rip), %ymm0 # ymm0 = [127,…,127]
vandps src(%rip), %ymm0, %ymm1
vandps src+32(%rip), %ymm0, %ymm2
vandps src+64(%rip), %ymm0, %ymm3
vandps src+96(%rip), %ymm0, %ymm0
vmovaps %ymm1, dst(%rip)
vmovaps %ymm2, dst+32(%rip)
vmovaps %ymm3, dst+64(%rip)
vmovaps %ymm0, dst+96(%rip)

This looks ok, and when -DDEFINEME=128 this is the actual result. But
now I compiled with -DDEFINEME=160 so there is another 32 bytes to be
processed. What it looks like is like this:

movb src+128(%rip), %al
andb $127, %al
movb %al, dst+128(%rip)
movb src+129(%rip), %al
andb $127, %al
movb %al, dst+129(%rip)

… Guess I don’t need to show 87 more instructions
… here to get my point accross

movb src+158(%rip), %al
andb $127, %al
movb %al, dst+158(%rip)
movb src+159(%rip), %al
andb $127, %al
movb %al, dst+159(%rip)

From what I can tell the loop vectorizer comes to the conclusion that
it is a good idea to interleave the loop 4 times. As the loop has a
trip count of 160, which is 5 trips after vectorization it leaves a
remainder of 32 trips which does not get vectorized.

Then this remainder then gets unrolled in a later unrolling pass.

  • Question on interleaving

TinyTripCountInterleaveThreshold what is the reasoning behind this
128? This measures in number of trips before vectorization? Shouldn’t
this be number of trips after vectorization, as that is the trip count
that is relevant after vectorization?

E.g:

— a/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6382,7 +6382,7 @@ unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,

// Do not interleave loops with a relatively small trip count.
unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);

  • if (TC > 1 && TC < TinyTripCountInterleaveThreshold)
  • if (TC > 1 && (TC / VF) < TinyTripCountInterleaveThreshold)
    return 1;

unsigned TargetNumRegisters = TTI.getNumberOfRegisters(VF > 1);