costing optimisations

Hi, I've been having some of the usual problems with C compiler optimisations:
they work fine on small functions but drop dead on big ones.

I think this is roughly because some function level optimisations are
worse than O(N) in the number of instructions.

Changing the command line -O level is not a solution.
The optimiser should adapt to ensure reasonable compile times.

Has anyone thought about costing optimisation passes and using
a linear optimisation formula to select when to apply them?

TeX does something like this to calculate the best places to put
line breaks when typesetting.

Roughly, a pass would calculate the cost of running on some
unit (translation unit, function), and a "goodness" estimating
how much improvement might be expected. The optimiser
then tries to simultaneously maximise goodness and minimise
cost by selecting some sequence of passes.

Another way to do this would be to calculate a time limit on
a pass and abort it if it exceeded the limit.

Some optimisations could be coded to do this kind of thing
internally, i.e. partition a function and optimise the pieces,
rather than the whole function, to avoid time overruns.

FYI: I have a product which generates big functions.
It does this for two reasons: performance is one,
and lack of suitable semantics in C, roughly this
is coroutine calls (exchange of control), and probably
continuation passing. LLVM also lacks these fundamentals
and is too high level to implement them, so the implementation
is done natively but the result is big functions.

Adding LLVMdev, since this is intimately related to the optimization passes.

I think this is roughly because some function level optimisations are
worse than O(N) in the number of instructions.

Please profile this and mail llvmdev regarding passes with
significantly superlinear behavior (e.g. O(n^2)). My understanding is
that these kinds of algorithmic problems are generally considered bugs
in the optimizer and will be fixed.

Roughly, a pass would calculate the cost of running on some
unit (translation unit, function), and a "goodness" estimating
how much improvement might be expected. The optimiser
then tries to simultaneously maximise goodness and minimise
cost by selecting some sequence of passes.

I'm doubt there is a way to compute expected benefit that is
significantly cheaper than actually performing the optimization
itself. You also have to consider that passes interact with each other
in complicated and "poorly understood" ways so it's hard to
extrapolate one pass's estimated benefit to overall code improvement
in the end. I say "poorly understood" because as far as I know there
is no formal or scientific underpinning to the choice and ordering of
passes which are run as the "standard compile opts" (e.g. -O1, -O2,
-O3, etc.) besides very high-level, extremely rough heuristics.

Another way to do this would be to calculate a time limit on
a pass and abort it if it exceeded the limit.

No way. This would make the optimizations nondeterministic.

-- Sean Silva

Hi,

Here's a personal perspective on the issue:

Firstly I don't think TeX does pre-estimation of work, it's just the
standard stepwise "refinement stuff". However, the idea itself isn't
necessarily a bad one.

In the academic community there's a set of people (John Cavazos, Lous-Noel
Pouchet, others I've forgotten) doing work on estimating which sets of
optimizations are most likely to produce a good speed up using estimation
functions. (Note they're concerned primarily with outputted code speed,
followed by the fact there are just too many combinations of optimizations
to evaluate directly.) There's a couple of important differences: they're
using estimation functions found by machine learning on "scientific-ish"
code, which probably has more uniformity than completely unrestricted
programs. It's also based on the idea that there's just not the manpower
available to understand the complicated interactions between components on
the plethora of modern machines. From reading papers (and I've not yet
implemented enough of this stuff to test how well it really works) it looks
like a reasonable argument.

However, it's also potentially introducing highly
non-predictable/principle-of-least-surprise behaviour into the compiler,
particularly since llvm can be used as a JIT. So I suspect it wouldn't be
something most people would want for mainstream clang/llvm, but if might be
interesting if someone were to implement some passes like this (probably
out-of-tree) and see if it proves beneficial.

Of course Sean's point that superlinear behaviour ought to be reported as
it's probably a bug is very true,

Regards,
Dave

Adding LLVMdev, since this is intimately related to the optimization passes.

I think this is roughly because some function level optimisations are
worse than O(N) in the number of instructions.

Please profile this and mail llvmdev regarding passes with
significantly superlinear behavior (e.g. O(n^2)). My understanding is
that these kinds of algorithmic problems are generally considered bugs
in the optimizer and will be fixed.

I'll try, but at the moment I'm having trouble building Clang/LLVM.
Bug report in Bugzilla. 10 hours build time followed by failure
on the Release version isn't much fun (and also uses a lot of
power -- everything is solar/wind powered). The Debug+Asserts
version did build, that's what is taking so long. I thought perhaps
the assertions or lack of optimisation might slow things down
so I tried to build a release version.

I think I built Debug+Asserts using clang 3.1. However once installed,
the Release version is being built by 3.3svn/Debug+Asserts.

You also have to consider that passes interact with each other
in complicated and "poorly understood" ways so it's hard to
extrapolate one pass's estimated benefit to overall code improvement
in the end

I agree. It's also hard to tell how good linear passes are compared
to recursive optimisations. However one has to have some way
of choosing which optimisation passes to run, when to run them,
and how many times, so some judgement has to be made anyhow.
Making a numerical judgment and using formula to choose by
weighting can't be worse (once suitable numbers are found).

Another way to do this would be to calculate a time limit on
a pass and abort it if it exceeded the limit.

No way. This would make the optimizations nondeterministic.

Not if you set the time to infinity.

At present I have a test suite which imposes a timeout on each test.
This is in case a bug make the test run forever. But it also kills
the test if the compile runs too long.

So I have non-deterministic behaviour already. Sometimes a test
runs and sometime not (because part of the compilation is cached,
it may succeed on a second attempt).

Firstly I don't think TeX does pre-estimation of work, it's just the
standard stepwise "refinement stuff".

I didn't mean that example to be taken so literally.
When TeX formats a paragraph, it has to decide
where to put line breaks. Breaking a line on a space
has less badness that hyphenating a word. If I recall
hyphenations can have different badnesses. The thing is
that stretching whitespace also has a badness dependent on
how much you stretch it.

So TeX actually considers several possible break points,
and calculates a badness for each one. but there's more.
The choice of a break point in a line affects the next line.
And the one after that, etc. TeX tries to solve for optimal layout
of the whole paragraph. That's why it produces such awesome
results.

However, it's also potentially introducing highly
non-predictable/principle-of-least-surprise behaviour into the compiler,
particularly since llvm can be used as a JIT. So I suspect it wouldn't be
something most people would want for mainstream clang/llvm, but if might be
interesting if someone were to implement some passes like this (probably
out-of-tree) and see if it proves beneficial.

My problem is that I get quantum non-determinism :slight_smile:
Either the compiler finishes and does a "reasonable job"
or it takes so long the whole thing gets killed.

Of course, I could re-run the compiler with a lower optimisation
switch in that case. I think I might try that.

Of course Sean's point that superlinear behaviour ought to be reported as
it's probably a bug is very true,

Isn't data flow analysis O(N^3)?
You cannot do proper alias analysis without it.

So TeX actually considers several possible break points,
and calculates a badness for each one. but there's more.
The choice of a break point in a line affects the next line.
And the one after that, etc. TeX tries to solve for optimal layout
of the whole paragraph. That's why it produces such awesome
results.

Yep; the project part of my MSc was on dynamic programming problems similar to the TeX one. However, it doesn't actually first attempt to estimate how much work a paragraph is likely to take to break, it just uses a clever dynamic programming algorithm to build up a globally optimal set of breaks regardless of work cost. Only if the badness is still too high does it try to do extra breaks with hyphenations (and it was a long time ago I looked at this, but I think once it's decided to hyphenate a given word it stops caring about the whitespace badness of the break in favour of a "non-misleading" hyphenation -- such as avoiding weeknights becoming wee-knights). Anyway, this is becoming a digression.

However, it's also potentially introducing highly
non-predictable/principle-of-least-surprise behaviour into the compiler,
particularly since llvm can be used as a JIT. So I suspect it wouldn't be
something most people would want for mainstream clang/llvm, but if might be
interesting if someone were to implement some passes like this (probably
out-of-tree) and see if it proves beneficial.

My problem is that I get quantum non-determinism :slight_smile:
Either the compiler finishes and does a "reasonable job"
or it takes so long the whole thing gets killed.

That really sounds like a bug (either software or hardware): whether it's fast or slow, the compilation time (for the same options) should be essentially the same for the same piece of code.

Regards,
Dave

{Veering a bit off-topic}

Adding LLVMdev, since this is intimately related to the optimization passes.

I think this is roughly because some function level optimisations are
worse than O(N) in the number of instructions.

Please profile this and mail llvmdev regarding passes with
significantly superlinear behavior (e.g. O(n^2)). My understanding is
that these kinds of algorithmic problems are generally considered bugs
in the optimizer and will be fixed.

I cannot do that yet, however here's some more info:

Felix compile time: 5 seconds.
C++ compile and run times:

With -O3:

/usr/local/bin/clang++ -c -fno-common -fno-strict-aliasing -std=c++11 -fPIC ....
Elapsed: 106.058, Result code 0

/usr/local/bin/clang++ -fno-common -fno-strict-aliasing -std=c++11 -dynamiclib -O3 ....
Elapsed: 0.049, Result code 0

env DYLD_LIBRARY_PATH=build/release/lib/rtl:$DYLD_LIBRARY_PATH build/release/bin/flx_run '/Users/johnskaller/felix/test/regress/rt/tuple-02.dylib'
Elapsed: 0.007, Result code 0

With -O2:
Compile: Elapsed: 106.918, Result code 0
Link: Elapsed: 0.048, Result code 0
Run: Elapsed: 0.010, Result code 0

With -O1:
Compile: Elapsed: 13.664, Result code 0
Link: Elapsed: 0.042, Result code 0
Run: Elapsed: 0.007, Result code 0

So on this particular test case, -O3 and -O2 take the same time, but -O1 is 8 times faster.
Link and run times are roughly the same in all cases.

I made a modified program, this is Felix code but should be comprehensible.
The Felix compiler generates C++. This particular test checks printing and
comparison of tuples and arrays. What is relevant here is the "inline* keyword.
This tells Felix to inline the procedure, so the result is a single big C++ function.

/////////////////
var z = 1, (2.2, "xx");
println$ str z;
var z2 = 1, 2.2, "xx", 3, 4.3;
println$ z2;

var z3 = 1,23.3, z, z2, 99, "Jah";
println$ z3;

inline proc A () {
println$ 1,2;
println$ 1,2,3;
println$ 1,2,3,4;
println$ 1,2,3,4,5;
println$ (a, a) :>> (int ^ 20);
}
A;

inline proc B () {
println$ (1,"x") == (1,"x");
println$ (1,"x",4.2) == (1,"x",4.2);
println$ (1,"x",4.2,"x") == (1,"x",4.2,"y");
x1 := (1,"x",4.2,"x");
x2 := (1,"x",4.2,"y");
println$ x1 == x2;
}
B;

inline proc C() {
println$ "-" * 20;
y1 := (1,"x",4.2,"x",55,"hello",4.222);
y2 := (1,"x",4.2,"y",55,"hello",4.222);
println$ y1 == y2; // false
println$ y1 != y2; // true
println$ y1 < y2; // true
println$ y1 <= y2; // true
println$ y1 > y2; // false
println$ y1 >= y2; // false
}
C;

println$ a == a;

b:= 1,2;
println$ b == b;

inline proc D() {
println$ (1,2) == (1,2);
println$ (1,2,3) == (1,2,3);
println$ (1,2,3,4) == (1,2,3,4);
println$ (1,2,3,4,5) == (1,2,3,4,5);
println$ (1,2,3,4,5,6) == (1,2,3,4,5,6);

println$ ("1",2,3,4,5,6) == ("1",2,3,4,5,6);
}
D;
/////////////////

The elapsed time for the compile on this variant with -O2 is 120 seconds.

Now, watch what happens if I change "inline" to "noinline" which makes
each procedure A,B,C,D a separate C++ function:

Elapsed time: 47 seconds.

With -O1:

inline version: 14 seconds.
noinline version: 11 seconds

So the O1 compilation takes a bit longer with a single big function.
The O2 compilation is almost a decimal order of magnitude slower
with the inline version, but twice as fast when the function is split up
into 4 pieces (plus some other stuff).

The actual C++ here:

    1729 5498 93095 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.cpp
     753 2246 17518 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.hpp

plus library files. Felix optimises the use of headers so only required library
headers are actually used.

100 seconds for 2000 lines is only 20 lines a second.
At that speed a 2 million line program would take 100K seconds to compile
(about one day).

Actually, Clang + LLVM took over 10 hours (using Clang as the compiler).

OK .. so I split it up again like this:

/////////////////////////////
noinline proc A1 () {
println$ 1,2;
}
noinline proc A2() {
println$ 1,2,3;
}

noinline proc A3() {
println$ 1,2,3,4;
}
noinline proc A4() {
println$ 1,2,3,4,5;
}
noinline proc A5() {
println$ (a, a) :>> (int ^ 20);
}
A1; A2; A3; A4; A5;
......
////////////////////////

Now with -O1 the compilation time is the same: 11 seconds.
But with -O2 we're down to 30 seconds ( -O3 is the same).

The thing to note here is that at a coarse grain, the program
contains no branches. It's straight line code. Of course there
may be local branches at a lower level.

Roughly: the optimiser isn't smart enough to recognise sequential
computations. It should be able to partition the control flow graph
of a function into a sequence of "super blocks" if you like, which
can be optimised independently and the results merged.
This would reduce the cost of a non-linear algorithm significantly.

I have no idea how to do this at the moment. However in the bug finding
code I worked on, there was a natural way to "clump" the control flow graph.
The graph was built directly on the C AST, so in general each assignment

  x = expr

would be a clump. Inside the expr there is control flow, but it is usually
entirely linear. So when doing path analysis, you could basically look
at the C level control flow, and disregard the flow in the expression.
Most of the time, data flow followed because sub-expressions only
assigned to temporaries (so these variables are created and destroyed
within the statement and can usually be ignored).

Obviously LLVM cannot do that (the higher level stuff has been compiled
away) but it could use heuristics to search for the patterns that would
be left behind. I do that in my compiler with some success.

Adding LLVMdev, since this is intimately related to the optimization passes.

I think this is roughly because some function level optimisations are
worse than O(N) in the number of instructions.

Please profile this and mail llvmdev regarding passes with
significantly superlinear behavior (e.g. O(n^2)). My understanding is
that these kinds of algorithmic problems are generally considered bugs
in the optimizer and will be fixed.

I cannot do that yet, however here's some more info:

Felix compile time: 5 seconds.
C++ compile and run times:

With -O3:

/usr/local/bin/clang++ -c -fno-common -fno-strict-aliasing -std=c++11 -fPIC ....
Elapsed: 106.058, Result code 0

/usr/local/bin/clang++ -fno-common -fno-strict-aliasing -std=c++11 -dynamiclib -O3 ....
Elapsed: 0.049, Result code 0

env DYLD_LIBRARY_PATH=build/release/lib/rtl:$DYLD_LIBRARY_PATH build/release/bin/flx_run '/Users/johnskaller/felix/test/regress/rt/tuple-02.dylib'
Elapsed: 0.007, Result code 0

With -O2:
Compile: Elapsed: 106.918, Result code 0
Link: Elapsed: 0.048, Result code 0
Run: Elapsed: 0.010, Result code 0

With -O1:
Compile: Elapsed: 13.664, Result code 0
Link: Elapsed: 0.042, Result code 0
Run: Elapsed: 0.007, Result code 0

So on this particular test case, -O3 and -O2 take the same time, but -O1 is 8 times faster.
Link and run times are roughly the same in all cases.

I made a modified program, this is Felix code but should be comprehensible.
The Felix compiler generates C++. This particular test checks printing and
comparison of tuples and arrays. What is relevant here is the "inline* keyword.
This tells Felix to inline the procedure, so the result is a single big C++ function.

/////////////////
var z = 1, (2.2, "xx");
println$ str z;
var z2 = 1, 2.2, "xx", 3, 4.3;
println$ z2;

var z3 = 1,23.3, z, z2, 99, "Jah";
println$ z3;
a := 1,2,3,4,5,6,7,8,9,10;

inline proc A () {
println$ 1,2;
println$ 1,2,3;
println$ 1,2,3,4;
println$ 1,2,3,4,5;
println$ (a, a) :>> (int ^ 20);
}
A;

inline proc B () {
println$ (1,"x") == (1,"x");
println$ (1,"x",4.2) == (1,"x",4.2);
println$ (1,"x",4.2,"x") == (1,"x",4.2,"y");
x1 := (1,"x",4.2,"x");
x2 := (1,"x",4.2,"y");
println$ x1 == x2;
}
B;

inline proc C() {
println$ "-" * 20;
y1 := (1,"x",4.2,"x",55,"hello",4.222);
y2 := (1,"x",4.2,"y",55,"hello",4.222);
println$ y1 == y2; // false
println$ y1 != y2; // true
println$ y1 < y2; // true
println$ y1 <= y2; // true
println$ y1 > y2; // false
println$ y1 >= y2; // false
}
C;

println$ a == a;

b:= 1,2;
println$ b == b;

inline proc D() {
println$ (1,2) == (1,2);
println$ (1,2,3) == (1,2,3);
println$ (1,2,3,4) == (1,2,3,4);
println$ (1,2,3,4,5) == (1,2,3,4,5);
println$ (1,2,3,4,5,6) == (1,2,3,4,5,6);

println$ ("1",2,3,4,5,6) == ("1",2,3,4,5,6);
}
D;
/////////////////

The elapsed time for the compile on this variant with -O2 is 120 seconds.

Now, watch what happens if I change "inline" to "noinline" which makes
each procedure A,B,C,D a separate C++ function:

Elapsed time: 47 seconds.

With -O1:

inline version: 14 seconds.
noinline version: 11 seconds

So the O1 compilation takes a bit longer with a single big function.
The O2 compilation is almost a decimal order of magnitude slower
with the inline version, but twice as fast when the function is split up
into 4 pieces (plus some other stuff).

The actual C++ here:

   1729 5498 93095 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.cpp
    753 2246 17518 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.hpp

How about providing a preprocessed version of those files as a test case? I don't think there are many people on this list that speak Felix fluently and/or are willing to install your compiler. You may have found a pathological case in the optimizer that can be fixed by putting in a clever heuristic, but we'll never know without a test case.

You can also pass -ftime-report to clang to get a breakdown on where the compile time goes.

- Ben

Here we go:

/usr/local/bin/clang++ -c -fno-common -fno-strict-aliasing -std=c++11 -fPIC -Wall -Wfatal-errors -Wno-invalid-offsetof -Wno-logical-op-parentheses -Wno-bitwise-op-parentheses -Wno-parentheses-equality -Wno-return-stack-address -Wno-tautological-compare -Wno-return-type-c-linkage -Wno-unused-variable -fPIC -O2 -ftime-report -DTARGET_BUILD '-Ibuild/release/lib/rtl' '-Ibuild/release/config/target' -Ilib/rtl '/Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/test/regress/rt/tuple-02.cpp' -o '/Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/test/regress/rt/tuple-02.os'

I have attached the pre-processed file. Hope sending it to the list
is not the wrong thing to do.

tmp.cpp (946 KB)