Performance regression on ARM

Folks,

First win of the benchmark buildbot!

http://llvm.org/perf/db_default/v4/nts/graph?plot.0=49.128.2&highlight_run=31861

It seems mandel-2 had a huge regression a few commits ago, and based
on a quick look, it may have to do with the inst combine changes. I
haven't investigated yet, but this is the first time I spot
regressions on test-suite, so I'd like to first congratulate the
people that worked on making that a reality!

Now, I'd like to have been warned of that automatically, by email,
based on some simple heuristics... But that's a topic for the
Performance/Benchmarking BoF in a few weeks... :slight_smile:

cheers,
--renato

From: "Renato Golin" <renato.golin@linaro.org>
To: "LLVM Dev" <llvmdev@cs.uiuc.edu>
Sent: Thursday, October 16, 2014 2:47:11 AM
Subject: [LLVMdev] Performance regression on ARM

Folks,

First win of the benchmark buildbot!

http://llvm.org/perf/db_default/v4/nts/graph?plot.0=49.128.2&highlight_run=31861

It seems mandel-2 had a huge regression a few commits ago, and based
on a quick look, it may have to do with the inst combine changes. I
haven't investigated yet, but this is the first time I spot
regressions on test-suite, so I'd like to first congratulate the
people that worked on making that a reality!

Interesting. Looks like the problem is in (219545, 219569].

Based on the commit logs, it does seem like the best candidates are:
  r219566, r219567, r219568 - InstCombine bug fixes re: no-wrap flags
  r219550, r219562 - Fixes to trip-count computations in the vectorizer/unroller

and given the magnitude of the change, I think that the trip-count changes are more likely. Of course, all of these things are bug fixes :frowning: -- So how do we follow-up on this?

-Hal

Interesting. Looks like the problem is in (219545, 219569].

Yes.

and given the magnitude of the change, I think that the trip-count changes are more likely.

Good point.

Of course, all of these things are bug fixes :frowning: -- So how do we follow-up on this?

Correctness before performance. Always.

I'll create a bug on this pointing to the delta for someone to
investigate. It doesn't have to be me, or the committer, the idea is
that this is not high priority. At least, not for now.

Once we have a way to track this more consistently, I think a good
approach is to be pragmatic. So, something along the lines of working
with the implementer, trying to understand the reason of the
regression. It could be a bad implementation or just a target-specific
reason for the regression. Depending on the importance of the
regression and of the patch, I'd consider turning it off for ARM (for
example, PR18996) and later investigating why.

I'd only consider reverting the patch in extreme circumstances, for
example, if we're close to a release AND the regression is big AND the
patch is a new feature, etc. I believe that's what you were concerned
about. :slight_smile:

cheers,
--renato

Interesting. Looks like the problem is in (219545, 219569].

Yes.

Chandler’s complex arithmetic changes are also in the range: r219557 in clang. We saw it change the code in mandel-2 significantly.

Adam

Folks,

First win of the benchmark buildbot!

Yay!

Out of curiosity, how long has it/you been monitoring for regressions?

-- Sean Silva

I would not at all be surprised by regression in some cases here.

Complex arithmetic with reals got *much* faster. Complex arithmetic with
other complex numbers may have gotten slower. If there were fastmath flags,
*crazy preposterously slower*.

Note that I have a patch out needing review which addresses most of the
regression in this case, and should address all of the regressions if
fastmath flags are in play.

I'm not. :slight_smile:

I was looking for a few examples for my talk at LPC and found this by
accident. That's why I wanted to have some automated warnings on a
low-noise regression score.

cheers,
--renato

Chandler’s complex arithmetic changes are also in the range: r219557 in clang. We saw it change the code in mandel-2 significantly.

mandel-2 is broken on hard FP ABI systems, btw. The reason is simply:
we're emitting a call to __muldc3 with AAPCS VFP calling convention,
however, the function expects softfp (AAPCS) calling conv and reads
garbage from GP registers.

I'm working on fix.

I submitted r219517 (not in the suspect range) which improved SCEV by enabling computation of trip counts if the loop has multiple exits. One of the things this enables is loop unrolling for loops with multiple exits. Then Chandler submitted r219550 and r219562 (in the suspect range) to fix some latent bugs that my change exposed (thanks, btw). The regressing benchmark (http://llvm.org/klaus/test-suite/blob/release_32/SingleSource/Benchmarks/Misc/mandel-2.c) does have a loop with multiple exits (the core loop of the benchmark). However, it’s not unrolled with -O3 because the trip count of the loop latch is not computable at compile time. Nor is it vectorizable (the vectorizor uses trip count computation). So, I’d speculate the reason for the regression lies elsewhere (complex arithmetic?).

Mark

Thanks for looking at this Anton.

I don't really know what signal should be used here. Several initial
attempts at this didn't work because of ABI and calling convention issues
and I thought I had figured out a sound solution by directly forming and
calling the library routines using Clang's code generation to handle all of
the ABI issues.... but apparently not.

If you need to back out the patch temporarily or take any other steps to
mitigate, by all means. =/ Sorry for the fallout.

Hi Chandler,

That’s embarrassing how weird this part of clang is. I have a provisional patch which fixes the problem but underlines clang’s problems. I will submit it tonight for comments.

One possible approach is the attached patch. It is not completely
correct as it doesn't handle the possible exceptions for WoA and iOS,
but it might be a starting point. The real question for me is whether
this logic belongs into clang at all. Who can speak up for the Fortran
related behavior of complex arithmetic? If it matters the desired C
rules, especially with regard to real and imaginary numbers, it would
strongly support putting it into IR properly.

Joerg

CGExprComplex.cpp.diff (968 Bytes)

I apologize that I haven't been able to follow this thread entirely, but if
someone gives me a Fortran testcase I can check what Fortran+llvm would do
currently and maybe give more feedback.

Short version: if you multiple or divide a complex number by a real
number, is it valid to just do the trivial component wise computation?
What about multiplying with / dividing by a pure imaginary number?

Joerg

I'll try to get a solid answer, but for clarity.. Fortran has types REAL
and CMPLX. I'm not sure if you mean INT or REAL when you say "real" number..
https://gcc.gnu.org/onlinedocs/gfortran/CMPLX.html
https://gcc.gnu.org/onlinedocs/gfortran/REAL.html

I am using "real" in the mathematical sense, so cast/promotion from
either INT or REAL would fit.

Joerg

I feel like opening a can full of worms here. Here is full story:

1. On ARM all the EABI library functions need to use AAPCS calling
convention even on hard floating point targets
2. __mulsc3, __muldc3 and friends are not defined by EABI, however
both compiler-rt and libgcc emit them with AAPCS, not AAPCS-VFP
calling convention

Right now clang creates calls to __mul?c3 routines, but uses C calling
convention, which on hard FP targets got translated into AAPCS-VFP CC,
which is certainly not what the implementation expects.

The obvious fix is to make sure clang specifies proper calling
convention for such calls. And here we have plenty of funny details.

1. There is CodeGenModule::CreateRuntimeFunction() which is used to
emit calls to runtime functions. Surely, we cannot use it here,
because we would not like to force calling convention on routines like
__cxa_throw, etc. (while they are not using and FP arguments and
technically it's safe for ARM, but...)
2. Even more CodeGenModule::CreateRuntimeFunction() itself is not
enough. The reason is that we need to specify a calling convention on
*both* call and function. Usually we're using
CodeGenFunction::EmitRuntimeCall() simply because creating a function
requires one to specify LLVM calling convention and
CodeGenFunction::CreateCall() expects clang's calling convention :wink:
3. So, for now I decided to separate calling conventions used to emit
calls to runtime functions and compiler builtins.

Some patch which shows this approach (and the whole problem) is attached :slight_smile:

cc.diff (6.32 KB)

test.c (356 Bytes)

Some patch which shows this approach (and the whole problem) is attached :slight_smile:

I wonder if it wouldn't be better to add an intrinsic. LLVM already
deals quite happily with emitting compiler-rt calls having different
calling conventions (& names -- it's entirely possible ARM will add an
__aeabi_mul*c3 for example). It seems a bit nasty to make Clang
understand all that too, unless we have to.

Cheers.

Tim.

I also thought about intrinsic, however I decided not to go this way
mostly due to possible issues with ABI - it's much better to deal with
them into frontend, than in the backend. However, I'm not opposed to
this solution.

Mathematically, a*(b+ic) = ab +iac where a, b, and c are arbitrary real numbers. Division by nonzero a is identical to multiplication by 1/a. I have no idea what language specs say, however.

In general, multiplying complex numbers in rectangular form is pretty straight-forward:
(a+ib)(c+id) = ac - bd + i(ad+bc).

Division is a little more of a hassle:
(a+ib)/(c+id) = (a+ib)(c-id)/(c^2 + d^2)
= (ac + bd)/(c^2 + d^2) + i(bc - ad)/(c^2 + d^2).

(Polar form is more convenient for multiplication and division:
a*exp(is) * b*exp(it) = ab*exp(i(s+t))
and division by nonzero a*exp(is) is the same as multiplication by 1/a exp(-is). Addition and subtraction become annoying though.)