Instruction Scheduling

Hi, guys,

     I am comparing the performance of the default scheduler (seems to be the one that minimizes register pressure) with no scheduler (-pre-RA-sched=none), and I got these numbers. The ratio is low_reg_pressure/none, that is, the lower the number, the better the performance with low register pressure:

CFP2000/177.mesa/177.mesa 1.00
CFP2000/179.art/179.art 0.98
CFP2000/183.equake/183.equake 1.00
CFP2000/188.ammp/188.ammp 0.98
CINT2000/164.gzip/164.gzip 0.97
CINT2000/175.vpr/175.vpr 0.97
CINT2000/176.gcc/176.gcc n/a // crashed!
CINT2000/181.mcf/181.mcf 1.02
CINT2000/186.crafty/186.crafty 1.00
CINT2000/197.parser/197.parser 1.01
CINT2000/252.eon/252.eon n/a // never runs
CINT2000/253.perlbmk/253.perlbmk 1.05
CINT2000/254.gap/254.gap 0.97
CINT2000/255.vortex/255.vortex 1.00
CINT2000/256.bzip2/256.bzip2 0.98
CINT2000/300.twolf/300.twolf 0.92

In three cases, I got a ratio above 1 [Must mean: scheduling had a negative impact on performance.] I just run it once, but I was wondering if this could make sense, or if I am setting the tests wrongly. I am running the nightly test Makefile, in a x86 linux 32 bits machine.

best,

Fernando

It's hard to say. I remember -pre-RA-sched=none (when it used to exist) does a depth first traversal on the dag and translates the nodes in that order. It's not particularly good at anything.

I assume your target of choice is x86. In that case, yes the default is burr. On modern x86 cpu's, it's far more important to avoid register spills / restores. Scheduling for latency before register allocation hasn't proven to be a win. Benchmarking x86 is very very tricky. On many cases where obviously better code ended up being slower. Hidden hazards like loop alignment, instructions crossing instruction dispatch buffer, etc. are very hard to model. If the scheduler ended up reducing the number of instructions (and loads and stores), then it's doing its job. It's probably more important to those than the actual runtime.

Also, all x86 cpu's do not perform the same. Are you seeing these results on current generation of x86 cpu's? Are you using the latest llvm release (when I guess is not since -pre-RA-sched=none is gone)?

Evan