Impressive performance result for LLVM: complex arithmetic

Following a discussion about numerical performance on comp.lang.functional
recently I just tried running a simple C mandelbrot benchmark that uses C99's
complex arithmetic using gcc and llvm-gcc on a 2.1GHz Opteron 2352 running

gcc: 5.727s
llvm-gcc: 1.393s

There is still 20% room for improvement but LLVM is >4x faster than gcc here.

Here's the code:

#include <stdio.h>
#include <stdlib.h>
#include <complex.h>

int max_i = 65536;

double sqr(double x) { return x*x; }

double cnorm2(complex z) { return sqr(creal(z)) + sqr(cimag(z)); }

int loop(complex c) {
    complex z=c;
    int i=1;
    while (cnorm2(z) <= 4.0 && i++ < max_i)
        z = z*z + c;
    return i;

int main() {
    for (int j = -39; j < 39; ++j) {
        for (int i = -39; i < 39; ++i)
            printf(loop(j/40.0-0.5 + i/40.0*I) > max_i ? "*" : " ");
    return 0;

On gcc's side, this is a simple missed opt on the part of builtin lowering.
As a result, the gcc code ends up with a call to muldc3 (complex = 2x2
multiply double) and the llvm code doesn't.
GCC should be fixed in a second, and with that, there is no
appreciable performance difference between the two.

FYI, gcc 4.3.3 gets the same performance with -O3. I reproduced Jon's gcc
results on 4.3.3 with -O0.