thinking about timing-test-driven scheduler

Hi,

I've been thinking about how to implement a framework for attempting
instruction scheduling of small blocks of code by using (GA/simulated
annealing/etc) controlled timing-test-evaluations of various
orderings. (I'm particularly interested small-ish numerical inner loop
code in low-power CPUs like Atom and various ARMs where there CPU
doesn't have the ability to "massage" the compiler's scheduling.) I
had been thinking that the minimal changes approach would be by
attaching the desired ordering to the bit-code as metadata nodes,
using the standard JIT interface and just adding a tiny bit of code in
the scheduler to recognise the metadata and add those as extra
dependencies via addPred().

However, looking at things in detail it looks like instruction
selection does things that would make propagating bitcode-instruction
ordering through to MachineInstr ordering actually pretty invasive. It
looks like it may actually be less invasive overall to put the "try
different orderings" controller code at the point where the
SelectionDAG is just about to be converted to a list of MachineInstr,
essntially repeatedly calling reg-alloc, executing and using the
run-time info to suggest a new possible ordering. Are there any
obvious reasons why this is actually a bad approach to the problem? On
the assumption that its a reasonable idea, it involves repeatedly
calling reg-alloc and machine code execution from an unexpected point
in the llvm system, so is there any documentation about any
"destructive updating" or "single-use" assumptions in the
compilation-and-execution codepaths? (I haven't found anything obvious
on the llvm website.)

Many thanks for any suggestions,
David Steven Tweed

Hi,

I've been thinking about how to implement a framework for attempting
instruction scheduling of small blocks of code by using (GA/simulated
annealing/etc) controlled timing-test-evaluations of various
orderings.

This sounds interesting.

(I'm particularly interested small-ish numerical inner loop
code in low-power CPUs like Atom and various ARMs where there CPU
doesn't have the ability to "massage" the compiler's scheduling.)

Have you tried to do some benchmarking here? E.g. high-end x86 vs. Atom
processors? Does the current scheduler do a bad job here?

Are there any
obvious reasons why this is actually a bad approach to the problem? On
the assumption that its a reasonable idea, it involves repeatedly
calling reg-alloc and machine code execution from an unexpected point
in the llvm system,

This sort of framework would be nice to have also for (especially for)
processors that one must cross compile for (or don't have JIT).
Just a thought...

kalle

Some caveats: I was looking at LLVM because it's jit capability ought
to make it useful for doing GA based stuff rather than the intrinsic
quality of the LLVM scheduler. I'm more of an image
processing/scientific computing guy rather than a compiler guy, but
I'd expect that for common unpredictable branchy code (especially on
OOO cpus) the difference between a good scheduling and a close to
optimal schedule is probably in the noise. For the kind of non-branchy
numeric code I'd expect that IF THE STOCHASTIC SEARCH CAN BE MADE TO
ACTUALLY WORK you might be able to get 5-10 percent consistently.

I don't have access to a high-end x86 at the moment; here's some
benchmark results on an Atom N270 that I don't completely understand,
and may well be wrong. I need to dig in further and see precisely
what's happening. The program evaluates both

uint64_t
intrinsicsFn(__m128i* __restrict in,__m128i* __restrict out){
    uint64_t startTSC,endTSC;
    int i=0;
    rdtscll(startTSC);
    do{
        __m128i xmm0=in[0/16];
        __m128i xmm1=in[16/16];
        __m128i xmm2=in[32/16];
        __m128i xmm3=in[48/16];
        xmm0=_mm_add_epi16(in[64/16],xmm0);
        xmm1=_mm_add_epi16(in[80/16],xmm1);
        xmm2=_mm_add_epi16(in[96/16],xmm2);
        xmm3=_mm_add_epi16(in[112/16],xmm3);
        xmm0=_mm_add_epi16(xmm1,xmm0);
        xmm2=_mm_add_epi16(xmm3,xmm2);
        xmm0=_mm_add_epi16(xmm2,xmm0);
        out[0/16]=xmm0;
        in+=8;out+=1;
    }while(++i<LIM);
    rdtscll(endTSC);
    return endTSC-startTSC;
}

and the 9072 different valid reorderings of the hopefully equivalent
inline assembly style function

uint64_t f9072(__m128i * __restrict in,__m128i * __restrict out){
uint64_t startTSC,endTSC;
int i=0;
rdtscll(startTSC);
do{
__asm__ __volatile__(
"\tmovdqa 0(%1),%%xmm0\n"
"\tmovdqa 16(%1),%%xmm1\n"
"\tmovdqa 32(%1),%%xmm2\n"
"\tmovdqa 48(%1),%%xmm3\n"
"\tpaddw 64(%1),%%xmm0\n"
"\tpaddw 80(%1),%%xmm1\n"
"\tpaddw 96(%1),%%xmm2\n"
"\tpaddw 112(%1),%%xmm3\n"
"\tpaddw %%xmm1,%%xmm0\n"
"\tpaddw %%xmm3,%%xmm2\n"
"\tpaddw %%xmm2,%%xmm0\n"
"\tmovdqa %%xmm0,(%0)\n"
:"=r" (out):"r" (in):"xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7");
in+=8;out+=1;
}while(++i<LIM);
rdtscll(endTSC);
return endTSC-startTSC;
}

Note that my inline asm isn't that good, so the array ptr increments
are forced to be at end for the asm versions, which is probably not
good for speed.

I'd like to try a bigger inner loop but the number of combinations
obviously grows too rapidly to try this way rather than via a
stochastic sampling approach. Running those through g++-4.4 (which
DOESN'T have an atom pipeline model) and clang++ trunk gives the
attached graph (y is proportional to runtime whilst x is scheduling
instance, ordered by increasing y value).

The minimum g++ inline assembly: 73200
The time for g++ -O3 optimisation of intrinsics fn: 93712
The minimum clang++ inline assembly: 79275
The time for clang++ -O3 optimisation of intrinsics fn: 72581

Speculations: the clang intrinsics is finding some optimisation that
it is being blocked on the asm version, so it's not just the SSE
scheduling that's being measured. g++ is doing systematically some
optimisation on the asm versions that clang isn't, hence the rough
"curve offset". For some reason g++ is picking a noticeably worse
ordering for the intrinsics; I need to find an ubuntu package for
g++-4.5 and see if the atom pipeline model improves things.

(I can make the hacked together benchmark code available by email if
anyone wants it.)

This probably raises as many questions as it answers... but it
suggests the idea may be worthwhile for really intense inner loop
code.

Regards,
David Steven Tweed

timeForSchedules.ps (250 KB)