SCEV and LoopStrengthReduction Formulae

I am attempting to implement a minor loop strength reduction optimization for
targets that support compare and jump fusion, specifically
TTI::canMacroFuseCmp(). My approach might be wrong; however, I am soliciting
the idea for feedback, so that I can implement this correctly. My plan is to
add a Supplemental LSR formula to LoopStrengthReduce.cpp that optimizes the
following case, but perhaps this should stand alone as its own pass:

// Example which can be optimized via cmp/jmp fusion.
// clang -O3 -S test.c
extern void g(int);
void f(int *p, long long n) {
    do {
        g(*p++);
    } while (--n);
}

LLVM currently generates the following sequence for x86_64 targets:
LBB0_1:
  movl (%r15,%rbx,4), %edi
  callq g
  addq $1, %rbx
  cmpq %rbx, %r14
  jne .LBB0_1

LLVM can perform compare-jump fusion, it already does in certain cases, but not
in the case above. We can remove the cmp above if we were to perform
the following transformation:
1.0) Initialize the induction variable, %rbx, to be 'n' instead of zero.
1.1) Negate the induction variable, so that the loop increments from -n to zero.
2.0) Set the base pointer, %r15, to be the last address that will be accessed
       from 'p'
2.1) In the preheader of the loop set %r15 to be: %r15 = <base> + n * <ele size>
3.0) Remove the cmpq

We can remove the cmp, since we are now counting from -n to zero. The
termination case for the loop will now be zero. The add instruction will set
the zero-flag when %rbx reaches zero (target specific of course).

The pointer passed to g() will also be correct, since we access the same items
we normally would have, but the access is with respect to the base
being the last item and subtracting from the last item, via a negative induction
variable. I imagine the result would look something like the following:
    movq %rsi, %rbx ; rbx becomes the value n
    leaq (%rdi, %rbx, 4), %r15 ; <base> = p + (n * 4)
    negq %rbx ; set rbx to be -n, so we count from -n to 0
LBB0_1:
    movl (%r15,%rbx,4), %edi
    callq g
    addq $1, %rbx
    jne .LBB0_1

I realize this is a micro-op saving a single cycle. But this reduces the instruction count, one less
instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition
to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass?

-Matt

  cmpq %rbx, %r14
  jne .LBB0_1

LLVM can perform compare-jump fusion, it already does in certain cases, but
not in the case above. We can remove the cmp above if we were to perform
the following transformation:

Do you mean branch-fusion (https://en.wikichip.org/wiki/macro-operation_fusion)?
Is there any more limitation why these two or not fused?

From: Gopalasubramanian, Ganesh <Ganesh.Gopalasubramanian@amd.com>
Sent: Wednesday, April 4, 2018 1:11 AM
To: Davis, Matthew <Matthew.Davis@sony.com>

> cmpq %rbx, %r14
> jne .LBB0_1
>
> LLVM can perform compare-jump fusion, it already does in certain
> cases, but not in the case above. We can remove the cmp above if we
> were to perform the following transformation:

Do you mean branch-fusion (https://en.wikichip.org/wiki/macro-
operation_fusion)?

Thanks for your reply.

After looking at your link and the x86 instruction manual, Intel Architecture
Optimization manual (3.4.2.2), I realized that the fusion is performed
automatically by the hardware, on the cmp+jne instructions. What I am trying to
achieve is the removal of the cmp. Which would result with one less instruction
in the add/cmp/jmp sequence. Since 'add' sets the zero flag and the jmp checks
that flag. So the change would eliminate the fusible cmp+jmp, resulting
in just add+jmp, in this case.

On Sandy Bridge architectures, which introduce the capability of fusing ADD (and
a few other instructions), this proposed optimization would remove the cmp from
the add/cmp/jmp sequence, leaving just the add+jmp, which is fusible. This cmp
removal would eliminate a single fetch-decode, in a potentially hot path (loop).
The benefit we get from the cmp removal is one less instruction to decode, and
less pressure on the decoder if the decoded instruction cache were filled-up or
flushed.

Is there any more limitation why these two or not fused?

Excuse my confusion here. The hardware might have been fusing the cmp+jmp
I was hoping to just remove the cmp and avoid the instruction entirely.

-Matt

I realize this is a micro-op saving a single cycle. But this reduces the instruction count, one less
instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition
to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass?

This seems reasonable to me so long as rbx has no other uses that would complicate the problem; I’m not sure how much this occurs in hot code (a loop with an induction variable that isn’t used in the loop), but if it does, I don’t see why not.

As a side note, in a past life, when I used to do x86 SIMD optimization for a living, I did similar tricks pretty much everywhere in DSP functions. It’d be pretty nice if the compiler could do it too.

There is one alternate approach that I recall, which looks like this:

Original code (example, pseudocode):

int add_delta_256(uint8 *in1, uint8 *in2) {
int accum = 0;
for (int i = 0; i < 16; ++i) {
uint8x16 a = load16(in1 + i *16); // NOTE: takes an extra addressing op because x86
uint8x16 b = load16(in2 + i *16); // NOTE: takes an extra addressing op because x86
accum += psadbw(a, b);
}
return accum;
}

end of loop:
inc i
cmp i, 16
jl loop

LSR’d code:

int add_delta_256(uint8 *in1, uint8 *in2) {
int accum = 0;
for (int i = 0; i < 16; ++i, in1 += 16, in2 += 16) {
uint8x16 a = load16(in1);
uint8x16 b = load16(in2);
accum += psadbw(a, b);
}
return accum;
}

end of loop:
add in1, 16
add in2, 16
inc i
cmp i, 16
jl loop

your code:

int add_delta_256(uint8 *in1, uint8 *in2) {
int accum = 0;
for (int i = -16; i < 0; ++i, in1 += 16, in2 += 16) {
uint8x16 a = load16(in1);
uint8x16 b = load16(in2);
accum += psadbw(a, b);
}
return accum;
}

end of loop:
add in1, 16
add in2, 16
inc i
jl loop

ideal code:

int add_delta_256(uint8 *in1, uint8 *in2) {
int accum = 0;
in1 += 256;
in2 += 256;
for (int i = -256; i < 0; ++i) {
uint8x16 a = load16(in1 + i);
uint8x16 b = load16(in2 + i);
accum += psadbw(a, b);
}
return accum;
}

end of loop:
inc i
jl loop

I don’t know, however, if it’s reasonable to teach the compiler to do the clever nonsense necessary to do the last one (requires enough understanding of x86 addressing modes, for one).

—escha

From: fglaser@apple.com <fglaser@apple.com> On Behalf Of escha@apple.com
Sent: Saturday, April 7, 2018 8:22 AM

I realize this is a micro-op saving a single cycle. But this reduces the instruction count, one less
instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition
to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass?

This seems reasonable to me so long as rbx has no other uses that would complicate the problem; I’m not sure how much this occurs in hot code (a loop with an induction variable
that isn’t used in the loop), but if it does, I don’t see why not.

Thanks for this response and the examples! I too think it could be a win, and I've explored both: implementing this as a pass (machine function pass) and as a SCEV expression for LSR (an LSR formula). I'm still on the fence about which approach is the 'best', or more widely accepted. To me it seems to be more advantageous to add this as an LSR formula, that pass provides a cost model and is target independent. However, it does rely on the fact that X86 sets the zero flag for inc/add, I'm not sure about other architectures. Additionally, this change would be fairly similar to what LSR already does.

-Matt

Changes to x86 LSR can lead to surprising perf outcomes. Test any changes carefully across multiple uarches.

I added a small bit to LSR recently:
https://reviews.llvm.org/rL324289
to solve:
https://bugs.llvm.org/show_bug.cgi?id=35681

…and as noted in the trailing comments, that change needs refinement to account for uarch diffs.

Later, I hit this (still undocumented AFAIK) perf problem related to LSR:

https://bugs.llvm.org/show_bug.cgi?id=36180