Hello List!

I am looking at top-down vs. bottom-up list scheduling for simple(r) in-order

cores. First, for some context, below is a fairly representative pseudo-code

example of the sort of DSP-like codes I am looking at:

uint64_t foo(int *pA, int *pB, unsigned N, unsigned C) {

uint64_t sum = 0;

while (N-- > 0) {

A1 = *pA++;

A2 = *pA++;

B1 = *pB++;

B2 = *pB++;

...

sum += ((uint64_t) A1 * B1) >> C;

sum += ((uint64_t) A2 * B2) >> C;

...

}

return sum;

}

These kernels are very tight loops. In this sort of legacy codes it's not

uncommon that loops are manually tuned and unrolled with the available

registers in mind, so that all values just about fit in registers. In the

example above, we read from input streams A and B, and do 2 multiply-accumulate

operations. The loop is unrolled 2 times in this example, but 4, 8 or 16 would

more realistic, resulting in very high register pressure.

But legacy apps written in this way are not the only culprit; we see that with

(aggressive) loop unrolling we can end up in exactly the same situation: the

loopunroller does exactly what it needs to do, but it results in very high

register pressure.

Here's the (obvious) problem: all live values in these loops should just about

fit in registers, but suboptimal/wrong decisions are made resulting in a lot of

spills/reloads; the machine scheduler is making the life of the register allocator

very difficult. I am looking at regressions of about 30 - 40% for more

than a handful of kernels, and am thus very interested in what I could do about

this.

The first observation is that it looks like we default to bottom-up scheduling.

Starting bottom-up, all these "sum" variables are scheduled first, and after

that the loads (this is simplifying things a bit). And thus it looks like we

create a lot of very long live-ranges, causing the problems mentioned above.

When instead we start top-down, the loads are picked up first, then the MAC, and

this repeated for all MACs. The result is a sequence LD, MAC, LD, MAC, etc.,

which is let's say a sequence more in program-order, also with shorter

live-ranges. This does exactly what I want for these sort of kernels, it generates

exactly the code we want.

While playing with this bottom-up/top-down order, it didn't take long to see

that the top-down approach is excellent for these kind of codes, but not for

some other benchmarks, and so solving this issue not just a matter of

defaulting to a new scheduling policy.

I am thinking about prototyping an approach where we start with the bottom-up

approach (under a new misched option), but when a register-pressure/live-range

trashhold value is reached, we bail and fall back to the top-down approach.

This is my first rough idea, but I haven't looked at this problem for very

long. My first few data points is suggesting this might be benificial, but I

could be missing a lot here. And also I am sure that I am looking at an

old/classic problem here and I'm sure others have looked at this problem

before. Thus I am wondering if people have experiences/opinions on this.

Cheers,

Sjoerd.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.