Improving performance with optimization passes

I'm toying with benchmarks on my HLVM and am unable to get any performance
improvement from optimization passes. Moreover, some of my programs generate
a lot of redundant code (e.g. alloca a struct, store a struct into it and
read only one field without using the rest of the struct) and this does not
appear to be optimized away.

I simply copied the use of PassManager from the Kaleidoscope tutorial:

  let pm = PassManager.create_function mp in
  TargetData.add (ExecutionEngine.target_data ee) pm;

  add_constant_propagation pm;

  (* Do simple "peephole" optimizations and bit-twiddling optzn. *)
  add_instruction_combining pm;

  (* reassociate expressions. *)
  add_reassociation pm;

  (* Eliminate Common SubExpressions. *)
  add_gvn pm;

  (* Simplify the control flow graph (deleting unreachable blocks, etc). *)
  add_cfg_simplification pm;

  add_memory_to_register_promotion pm;

and then I apply "PassManager.run_function" to every function after it is
validated.

Any idea what I might be doing wrong? Has anyone else got this functionality
giving performance boosts from OCaml?

Hi Jon,

I'm toying with benchmarks on my HLVM and am unable to get any performance improvement from optimization passes. I simply copied the use of PassManager from the Kaleidoscope tutorial:

Any idea what I might be doing wrong? Has anyone else got this functionality giving performance boosts from OCaml?

That's a pretty barren optimization pipeline.

http://llvm.org/docs/Passes.html

See opt --help and look into what -std-compile-opts is. That's the usual starting point for new front ends, although it's a whole-module pass pipeline. But it's only a starting point, since it's tuned for llvm-gcc's codegen and yours will probably differ.

Moreover, some of my programs generate a lot of redundant code (e.g. alloca a struct, store a struct into it and read only one field without using the rest of the struct) and this does not appear to be optimized away.

I think first-class aggregates are mostly used for passing arguments in llvm-gcc and clang; maybe mem2reg can't see unravel the loads. Have you tried emitting loads and stores of the scalar elements to see if mem2reg can eliminate the allocas then?

— Gordon

P.S. This is not a trivial problem domain. Here's an interesting paper on the subject.

COLE: Compiler Optimization Level Exploration

I just disassembled some of the IR before and after optimization. This example
function squares a complex number:

  let zsqr(r, i) = (r*r - i*i, 2*r*i)

My compiler is generating:

define fastcc i32 @zsqr({ double, double }*, { double, double }) {
entry:
  %2 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  %3 = getelementptr { double, double }* %2, i32 0 ; <{ double, double }*>
[#uses=1]
  store { double, double } %1, { double, double }* %3
  %4 = getelementptr { double, double }* %2, i32 0, i32 0 ; <double*>
[#uses=1]
  %5 = load double* %4 ; <double> [#uses=1]
  %6 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  %7 = getelementptr { double, double }* %6, i32 0 ; <{ double, double }*>
[#uses=1]
  store { double, double } %1, { double, double }* %7
  %8 = getelementptr { double, double }* %6, i32 0, i32 0 ; <double*>
[#uses=1]
  %9 = load double* %8 ; <double> [#uses=1]
  %10 = mul double %5, %9 ; <double> [#uses=1]
  %11 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  %12 = getelementptr { double, double }* %11, i32 0 ; <{ double, double }*>
[#uses=1]
  store { double, double } %1, { double, double }* %12
  %13 = getelementptr { double, double }* %11, i32 0, i32 1 ; <double*>
[#uses=1]
  %14 = load double* %13 ; <double> [#uses=1]
  %15 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  %16 = getelementptr { double, double }* %15, i32 0 ; <{ double, double }*>
[#uses=1]
  store { double, double } %1, { double, double }* %16
  %17 = getelementptr { double, double }* %15, i32 0, i32 1 ; <double*>
[#uses=1]
  %18 = load double* %17 ; <double> [#uses=1]
  %19 = mul double %14, %18 ; <double> [#uses=1]
  %20 = sub double %10, %19 ; <double> [#uses=1]
  %21 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  %22 = getelementptr { double, double }* %21, i32 0 ; <{ double, double }*>
[#uses=1]
  store { double, double } %1, { double, double }* %22
  %23 = getelementptr { double, double }* %21, i32 0, i32 0 ; <double*>
[#uses=1]
  %24 = load double* %23 ; <double> [#uses=1]
  %25 = mul double 2.000000e+00, %24 ; <double> [#uses=1]
  %26 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  %27 = getelementptr { double, double }* %26, i32 0 ; <{ double, double }*>
[#uses=1]
  store { double, double } %1, { double, double }* %27
  %28 = getelementptr { double, double }* %26, i32 0, i32 1 ; <double*>
[#uses=1]
  %29 = load double* %28 ; <double> [#uses=1]
  %30 = mul double %25, %29 ; <double> [#uses=1]
  %31 = alloca { double, double } ; <{ double, double }*> [#uses=3]
  %32 = getelementptr { double, double }* %31, i32 0, i32 0 ; <double*>
[#uses=1]
  store double %20, double* %32
  %33 = getelementptr { double, double }* %31, i32 0, i32 1 ; <double*>
[#uses=1]
  store double %30, double* %33
  %34 = getelementptr { double, double }* %31, i32 0 ; <{ double, double }*>
[#uses=1]
  %35 = load { double, double }* %34 ; <{ double, double }> [#uses=1]
  %36 = getelementptr { double, double }* %0, i32 0 ; <{ double, double }*>
[#uses=1]
  store { double, double } %35, { double, double }* %36
  ret i32 0
}

But those LLVM optimization passes only reduce it to:

define fastcc i32 @zsqr({ double, double }*, { double, double }) {
entry:
  %2 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  store { double, double } %1, { double, double }* %2, align 8
  %3 = getelementptr { double, double }* %2, i32 0, i32 0 ; <double*>
[#uses=1]
  %4 = load double* %3, align 8 ; <double> [#uses=1]
  %5 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  store { double, double } %1, { double, double }* %5, align 8
  %6 = getelementptr { double, double }* %5, i32 0, i32 0 ; <double*>
[#uses=1]
  %7 = load double* %6, align 8 ; <double> [#uses=1]
  %8 = mul double %4, %7 ; <double> [#uses=1]
  %9 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  store { double, double } %1, { double, double }* %9, align 8
  %10 = getelementptr { double, double }* %9, i32 0, i32 1 ; <double*>
[#uses=1]
  %11 = load double* %10, align 8 ; <double> [#uses=1]
  %12 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  store { double, double } %1, { double, double }* %12, align 8
  %13 = getelementptr { double, double }* %12, i32 0, i32 1 ; <double*>
[#uses=1]
  %14 = load double* %13, align 8 ; <double> [#uses=1]
  %15 = mul double %11, %14 ; <double> [#uses=1]
  %16 = sub double %8, %15 ; <double> [#uses=1]
  %17 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  store { double, double } %1, { double, double }* %17, align 8
  %18 = getelementptr { double, double }* %17, i32 0, i32 0 ; <double*>
[#uses=1]
  %19 = load double* %18, align 8 ; <double> [#uses=1]
  %20 = mul double %19, 2.000000e+00 ; <double> [#uses=1]
  %21 = alloca { double, double } ; <{ double, double }*> [#uses=2]
  store { double, double } %1, { double, double }* %21, align 8
  %22 = getelementptr { double, double }* %21, i32 0, i32 1 ; <double*>
[#uses=1]
  %23 = load double* %22, align 8 ; <double> [#uses=1]
  %24 = mul double %20, %23 ; <double> [#uses=1]
  %25 = alloca { double, double } ; <{ double, double }*> [#uses=3]
  %26 = getelementptr { double, double }* %25, i32 0, i32 0 ; <double*>
[#uses=1]
  store double %16, double* %26, align 8
  %27 = getelementptr { double, double }* %25, i32 0, i32 1 ; <double*>
[#uses=1]
  store double %24, double* %27, align 8
  %28 = load { double, double }* %25, align 8 ; <{ double, double }> [#uses=1]
  store { double, double } %28, { double, double }* %0
  ret i32 0
}

So the optimization passes are at least doing something but they are a long
way from generating optimal code. Does LLVM have any optimization passes that
would promote these structs out of the stack and replace the loads with
extractvalue instructions?

The ideal result is probably:

define fastcc i32 @zsqr({ double, double }*, { double, double }) {
entry:
        %1 = extractvalue {double, double} %1, 0
        %2 = extractvalue {double, double} %1, 1
        %3 = mul double %1, %1
        %4 = mul double %2, %2
        %5 = sub double %3, %4
  %6 = getelementptr { double, double }* %0, i32 0, i32 0
  store double %5, double* %6, align 8
        %7 = mul double %1, 2.0
        %8 = mul double %7, %2
  %9 = getelementptr { double, double }* %0, i32 0, i32 1
  store double %8, double* %9, align 8
  ret i32 0
}

To add to what Gordon said, the SROA pass, aka -scalarrepl, aka
scalar replacement of aggregates, is the main pass which splits
struct allocas into fields that the rest of the optimizer can
work with.

Dan

Hi Jon,

> I'm toying with benchmarks on my HLVM and am unable to get any
> performance improvement from optimization passes. I simply copied
> the use of PassManager from the Kaleidoscope tutorial:
>
> Any idea what I might be doing wrong? Has anyone else got this
> functionality giving performance boosts from OCaml?

That's a pretty barren optimization pipeline.

Right but am I correct in believing that what I have done is pretty much all
that you can do from OCaml right now?

http://llvm.org/docs/Passes.html

See opt --help and look into what -std-compile-opts is. That's the
usual starting point for new front ends, although it's a whole-module
pass pipeline. But it's only a starting point, since it's tuned for
llvm-gcc's codegen and yours will probably differ.

Thanks.

> Moreover, some of my programs generate a lot of redundant code (e.g.
> alloca a struct, store a struct into it and read only one field
> without using the rest of the struct) and this does not appear to be
> optimized away.

I think first-class aggregates are mostly used for passing arguments
in llvm-gcc and clang; maybe mem2reg can't see unravel the loads. Have
you tried emitting loads and stores of the scalar elements to see if
mem2reg can eliminate the allocas then?

If I could do that I wouldn't be in this mess! :wink:

I am only generating these temporary structs on the stack because I cannot
load and store struct elements any other way because the OCaml bindings do
not yet have insertvalue and extractvalue.

P.S. This is not a trivial problem domain. Here's an interesting paper
on the subject.

COLE: Compiler Optimization Level Exploration
http://users.elis.ugent.be/~leeckhou/papers/cgo08.pdf

I'll check it out, thanks.

I’m toying with benchmarks on my HLVM and am unable to get any performance

improvement from optimization passes…

I just disassembled some of the IR before and after optimization. This example
function squares a complex number:

Something is definitely wrong with the way you’re using optimization passes.

The ideal result is probably:

It is indeed so:

./opt -std-compile-opts test.bc | ./llvm-dis
; ModuleID = ‘’

define fastcc i32 @zsqr({ double, double }* nocapture, { double, double }) nounwind {
entry:
%2 = extractvalue { double, double } %1, 0 ; [#uses=1]
%3 = extractvalue { double, double } %1, 0 ; [#uses=1]
%4 = mul double %2, %3 ; [#uses=1]
%5 = extractvalue { double, double } %1, 1 ; [#uses=1]
%6 = extractvalue { double, double } %1, 1 ; [#uses=1]
%7 = mul double %5, %6 ; [#uses=1]
%8 = sub double %4, %7 ; [#uses=1]
%9 = extractvalue { double, double } %1, 0 ; [#uses=1]
%10 = mul double %9, 2.000000e+00 ; [#uses=1]
%11 = extractvalue { double, double } %1, 1 ; [#uses=1]
%12 = mul double %10, %11 ; [#uses=1]
%insert = insertvalue { double, double } undef, double %8, 0 ; <{ double, double }> [#uses=1]
%insert2 = insertvalue { double, double } %insert, double %12, 1 ; <{ double, double }> [#uses=1]
store { double, double } %insert2, { double, double }* %0
ret i32 0
}

Path of least resistance would definitely be to add some new bindings.

— Gordon

Jon, make sure you always emit allocas into the entry block of the function. Otherwise mem2reg and SROA won't hack on them.

-Chris

Anton Korobeynikov wrote:

I'm toying with benchmarks on my HLVM and am unable to get any performance
improvement from optimization passes...

I just disassembled some of the IR before and after optimization. This example
function squares a complex number:

Something is definitely wrong with the way you're using optimization passes.

The ideal result is probably:

It is indeed so:

./opt -std-compile-opts test.bc | ./llvm-dis
; ModuleID = '<stdin>'

define fastcc i32 @zsqr({ double, double }* nocapture, { double, double }) nounwind {
entry:
%2 = extractvalue { double, double } %1, 0 ; <double> [#uses=1]
%3 = extractvalue { double, double } %1, 0 ; <double> [#uses=1]

I've filed llvm.org/PR3623 to track the fact that these aren't merged.

Nick