Spills and values present in both registers & stack

One piece of code I'm writing has a lot of intermediates, and I'm
trying to optimize down the number of memory accesses. Here's a
snippet from the start of the function, where I think there is some
low-hanging fruit:

# BB#0:
  pushq %rbp
  pushq %r15
  pushq %r14
  pushq %r13
  pushq %r12
  pushq %rbx
  movq %rdx, %rcx
  movq %rdi, -16(%rsp) # 8-byte Spill
  movq (%rsi), %rdi
  movq 8(%rsi), %r8
  movq 8(%rcx), %rax
  movq %rax, -24(%rsp) # 8-byte Spill
  movq 16(%rcx), %rax
  movq %rax, -8(%rsp) # 8-byte Spill
  movq %rdi, %rax
  mulq -24(%rsp) # 8-byte Folded Reload

You'll note that rbx,r12,r13,r14,r15,rbp are all dead after the
pushes. But the spill code still insists on using rax to load the
spilled values, forcing them to be reloaded later. Is the register
allocator (pbqp, I think) capable of having values in registers and on
the stack at the same time?

One piece of code I'm writing has a lot of intermediates, and I'm
trying to optimize down the number of memory accesses. Here's a
snippet from the start of the function, where I think there is some
low-hanging fruit:

# BB#0:
pushq %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
movq %rdx, %rcx
movq %rdi, -16(%rsp) # 8-byte Spill
movq (%rsi), %rdi
movq 8(%rsi), %r8
movq 8(%rcx), %rax
movq %rax, -24(%rsp) # 8-byte Spill
movq 16(%rcx), %rax
movq %rax, -8(%rsp) # 8-byte Spill
movq %rdi, %rax
mulq -24(%rsp) # 8-byte Folded Reload

You'll note that rbx,r12,r13,r14,r15,rbp are all dead after the
pushes. But the spill code still insists on using rax to load the
spilled values, forcing them to be reloaded later.

I'm not the most familiar with this sort of thing - but a small
example (of llvm bitcode) & the optimization flags you used, etc,
might be helpful (& I might be able to have a go at explaining it, if
no one else does).

Is the register
allocator (pbqp, I think) capable of having values in registers and on
the stack at the same time?

I can at least confirm that the default register allocator on x86
isn't PBQP, it's the greedy allocator (unless your'e compiling with
-O0, in which case it's the fast allocator).

- David

I'm not the most familiar with this sort of thing - but a small
example (of llvm bitcode) & the optimization flags you used, etc,
might be helpful (& I might be able to have a go at explaining it, if
no one else does).

The actual C code is shorter. (You'll have to unwrap the lines.) Bonus
points if you can identify the algorithm. }:>

#include <stdint.h>
typedef unsigned int uint128_t __attribute__((mode(TI)));

typedef struct{
    uint64_t l[5];
} s;

void f(s * restrict r, const s * restrict x, const s * restrict y) {
    uint128_t t[5] = {0, 0, 0, 0, 0};
    #define BODY(i,j) { int i_ = i < j ? i : j; int j_ = i < j ? j :
i; uint128_t m = (uint128_t) x->l[i_] * (y->l[j_] * (i + j > 4 ? 19 :
1)); if (i + j > 4) { t[i + j - 5] += m; } else { t[i + j] += m; } }
    #define LOOP(i) BODY(i, 0); BODY(i, 1); BODY(i, 2); BODY(i, 3); BODY(i, 4);
    LOOP(0); LOOP(1); LOOP(2); LOOP(3); LOOP(4);
    #define FOLD1(i) r->l[i] = ((uint64_t) t[i] & (1LL << 51) - 1) +
(i == 0 ? 19 : 1) * (uint64_t)(t[(i + 4) % 5] >> 51)
    FOLD1(0); FOLD1(1); FOLD1(2); FOLD1(3); FOLD1(4);
    #define FOLD2(i) r->l[(i + 1) % 5] += (i == 4 ? 19 : 1) * (r->l[i]

51); r->l[i] &= (1LL << 51) - 1;

    FOLD2(0); FOLD2(1); FOLD2(2); FOLD2(3); FOLD2(4);
}

% clang -O4 -S -o f.l f.c
% llc -O3 --pre-RA-sched=list-hybrid --regalloc=pbqp -o f.s f.l

Hi Taral,

You'll note that rbx,r12,r13,r14,r15,rbp are all dead after the
pushes. But the spill code still insists on using rax to load the
spilled values, forcing them to be reloaded later. Is the register
allocator (pbqp, I think) capable of having values in registers and on
the stack at the same time?

There are a couple of problems here:
(1) PBQP doesn't split live intervals, so those array elements get
spilled everywhere, even in places where there are free registers.
(2) The PBQP allocator always introduces a stack slot for spilled
values, even when they're already on the stack.

I definitely want to tackle these issues, but they'll have to wait for
a month or so while I write up my thesis.

Cheers,
Lang.