Hoisting elements of array argument into registers

Hi,

The Glasgow Haskell Compiler LLVM backend is generating a lot of code which
appears to be optimised by LLVM quite poorly. The problem is demonstrated by
this C source code:

int wf(int sp) {
    if (sp[0] == 0) {
        return sp[2] + sp[3];
    } else {
        sp[3] = sp[3] + (sp[1] * 5);
        sp[2] = (sp[2] + sp[0]) + 1;
        sp[1] = sp[1] - 1;
        sp[0] = sp[0] - 1;
        return wf(sp);
    }
}

int g(int a) {
    int sp = { a, a, 0, 0 };
    return wf(sp);
}

int main(void) {
    printf("%d\n", g(5));
    return 0;
}

As you can see, wf takes its arguments via a pointer. Nonetheless, we would do
like to loop optimisations on wf -- in particular, we would like to eliminate
the duplicate computations in sp[0] and sp[1] (GHC's own optimiser does not do
any classical loop optimisations like this).

Unfortunately, LLVM (opt -O3: I tried v2.7 and HEAD) doesn't seem to get very
far. It eliminates the tail call for wf() correctly, but the arguments are
carried between each iteration of the loop by storing intothe sp array. It
would be better if the arguments were kept in registers during the loop and
only stored into the array upon exit. (These stores could then be eliminated
easily since there is no later load from sp).

If I manually rewrite the code to replace the array argument with its composite
scalars in the following manner LLVM obviously does the right thing:

int wf(int a, int b, int c, int d) {
    if (a == 0) {
        return c + d;
    } else {
        return wf(a - 1, b - 1, (c + a) + 1, d + (b * 5));
    }
}

int g(int a) {
    return wf(a, a, 0, 0);
}

int main(void) {
    printf("%d\n", g(5));
    return 0;
}

The question is, is there any way to get LLVM to do this rewriting for us? It
seems like this would be a generally useful optimisation.

I guess you could also achieve this by sinking the stores to sp into next loop
iteration/the loop exit, where they will meet the corresponding loads and so
good things will happen. The current instruction sinking pass doesn't seem to
get this case, though.

Thanks in advance for any help!
Max

p.s. Our real code is a little more complicated because sp is really a pointer
to the top of the Haskell runtimes stack (which is of unknown length), not a
4-element array, which might make the optimisation a bit harder. Perhaps LLVM
would have to promote just those elements of the input array which were
accessed in the loop into scalars.

Hi Max,

The Glasgow Haskell Compiler LLVM backend is generating a lot of code which
appears to be optimised by LLVM quite poorly. The problem is demonstrated by
this C source code:

if I run this at -O3 under llvm-gcc from top-of-tree on x86-64 linux then
(1) it computes that g(5) is equal to 95, and main is transformed to just print
95.
(2) g is transformed into code with no loop, i.e. the result is computed
directly from the value of "a" in straight line code (there is a special
case when a is zero).
(3) wf is not transformed as well: values are loaded from memory and stored
back on each loop. And there is still a loop.

I see the same with clang. I'm not sure why the optimizers do so much better
when they can see that sp is a local array (the special initial values don't
matter).

Ciao,

Duncan.

I see the same with clang. I'm not sure why the optimizers do so much better
when they can see that sp is a local array (the special initial values don't
matter).

It is the scalar replacement of aggregates pass that puts everything into
registers when sp is a local array. What happens is: the tail recursion in
wf is eliminated. wf is inlined into g. scalarrepl turns the (local) array
accesses in g into registers. Once everything is in registers, later passes
clean everything up.

Ciao,

Duncan.

Duncan Sands <baldrick <at> free.fr> writes:

> I see the same with clang. I'm not sure why the optimizers do so much better
> when they can see that sp is a local array (the special initial values don't
> matter).

It is the scalar replacement of aggregates pass that puts everything into
registers when sp is a local array.

Yes, I had hoped that scalar replacement would get the array case. What
surprised me is that it didn't. However, I can reproduce your result (good
optimisation for g()) with LLVM HEAD. My earlier tests used "opt -O2", but
once I tried again with (HEAD) Clang -O3 I got an optimised g().

However, scalar replacement can't help with functions like a (non-inlined) wf()
where the structure of sp is unknown. Is there any hope for LLVM optimising
such functions? Some combination of passes that will do what I want? This
problem is essentially killing all opportunities for loop optimisation in
Haskell right now, so we would dearly like to have a solution :slight_smile:

Cheers,
Max

I am seeing the wf loop get optimized just fine with llvm 2.8 (and almost as good with head). I'm running on Mac OS X 10.6. I have an apple supplied llvm-gcc and a self compiled llvm 2.8. When I run

    $ llvm-gcc -emit-llvm -S M.c
    $ opt -O2 M.s | llvm-dis

I see that:

1. Tail recursion has been eliminated from wf
2. The accesses to sp have been promoted to registers
3. The loop has been translated to straight line code that computes the result directly based off of the induction variables.

I am surprised that others are not seeing the same thing. If I download the llvm-gcc binary from LLVM Download Page and run

    $ llvm-gcc -O2 -emit-llvm -S

I see the same results as running `opt -O2` by hand. However, if I use the apple supplied llvm-gcc and run `llvm-gcc -O2`, I see that the loop does not get optimized. I am assuming this is because the apple supplied llvm-gcc links with an older version of the llvm optimizations.

If I optimize with LLVM head I see basically the same results, except that it leaves in one redundant load. I'm curious why others are seeing such different results.

I don't see anything in wf that should prevent the sp accesses from being promoted to registers. I would think they should be promoted to registers, but we will still have to write the results back to sp before returning from wf because in general we can't prove that sp is not accessed after returning from wf.

-David

David Peixotto <dmp <at> rice.edu> writes:

I am seeing the wf loop get optimized just fine with llvm 2.8 (and almost

as good with head).

I rechecked this and am I actually seeing the same results as you. I think I
must have made a stupid mistake in my tests before - sorry for the noise.

However, I found that we have a phase ordering problem which is preventing us
getting as much optimisation from LLVM as we might expect. For example, I get
this code for g(int):

"""
define i32 @g(i32 %a) nounwind readnone ssp {
entry:
  %sp = alloca [4 x i32], align 4
  %0 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 0
  store i32 %a, i32* %0, align 4
  %1 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 1
  store i32 %a, i32* %1, align 4
  %2 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 2
  store i32 0, i32* %2, align 4
  %3 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 3
  store i32 0, i32* %3, align 4
  %4 = icmp eq i32 %a, 0
  br i1 %4, label %wf.exit, label %bb.nph.i

bb.nph.i: ; preds = %entry
  %.promoted1.i = load i32* %1, align 4
  %tmp12.i = add i32 %a, -1
  %tmp13.i = zext i32 %tmp12.i to i33
  %tmp14.i = add i32 %a, -2
  %tmp15.i = zext i32 %tmp14.i to i33
  %tmp16.i = mul i33 %tmp13.i, %tmp15.i
  %tmp17.i = lshr i33 %tmp16.i, 1
  %tmp18.i = trunc i33 %tmp17.i to i32
  %tmp20.i = mul i32 %.promoted1.i, 5
  %tmp21.i = add i32 %tmp20.i, -5
  %tmp22.i = mul i32 %tmp21.i, %tmp12.i
  %tmp9.i = mul i32 %a, %a
  %.promoted2.i = load i32* %2, align 4
  %tmp25.i = mul i32 %tmp18.i, 5
  %tmp.i = sub i32 %.promoted1.i, %a
  %tmp10.i = add i32 %tmp9.i, 1
  %tmp11.i = sub i32 %tmp10.i, %tmp18.i
  %tmp19.i = add i32 %tmp11.i, %.promoted2.i
  %tmp23.i = sub i32 %tmp20.i, %tmp25.i
  %tmp26.i = add i32 %tmp23.i, %tmp22.i
  store i32 0, i32* %0, align 4
  store i32 %tmp19.i, i32* %2, align 4
  store i32 %tmp.i, i32* %1, align 4
  store i32 %tmp26.i, i32* %3, align 4
  br label %wf.exit

wf.exit: ; preds = %entry, %bb.nph.i
  %5 = phi i32 [ %tmp26.i, %bb.nph.i ], [ 0, %entry ]
  %6 = load i32* %2, align 4
  %7 = add nsw i32 %6, %5
  ret i32 %7
}
"""

As you can see, we have "load i32* %1, align 4" which is dominated by
"store i32 %a, i32* %1, align 4", and similarly for the load/store to %2.
What is more, it is clear than no intervening store aliases with that
store because all the other stores are to different pointers - and indeed
LLVM can even see that the memory to which they store is allocaed.

In fact there are at least 3 missed opportunities:
* No forwarding of the stores to loads: if we did this, we could spot that
%tmp.i is just 0, and %tmp19.iis just the same as %tmp11.i
* The stores to %2 could be forwarded to the final "load i32* %2, align 4"
by introducing a new phi node for the wf.exit block
* All of the stores to %sp+N can then be eliminated since the memory is
allocaed, and there will be no loads from any of that memory before the
function returns

However, it turns out that if you just run "opt -O3" twice then all of this
stuff gets eliminated and you get truly optimal code - pretty awesome!

Cheers,
Max

However, it turns out that if you just run "opt -O3" twice then all of this
stuff gets eliminated and you get truly optimal code - pretty awesome!

Yes, if I pipe the output from "dragonegg -O3" into "opt -O3" then the loop
is eliminated in "wf". It would be interesting to find out why, and improve
the current list of passes.

Ciao,

Duncan.