[RFC] Making -mcpu=generic the default for ARM armv7a and arm8a rather than -mcpu=cortex-a8 or -mcpu=cortex-a53

Motivation

At the moment, when targeting armv7a, clang defaults to generate code as if -mcpu=cortex-a8 was specified.When targeting armv8a, it defaults to generate code as if -mcpu=cortex-a53 was specified.

This leads to surprising code generation, by the compiler optimizing for a specific micro-architecture, whereas the intent from the user was probably to generate code that is “blended” for all the cores implementing the requested architecture. One example of a user being surprised like this is at https://bugs.llvm.org//show_bug.cgi?id=27219, where vmla’s are not produced to optimize for a Cortex-A8-specific micro-architectural behaviour, even though the user didn’t request to optimize specifically for Cortex-A8.

It would be much cleaner conceptually if clang would default to -mcpu=generic when no specific cpu is specified.

What is the impact of this change on execution speed?

I think the main reason to be hesitant to change the default CPU for ARM to -mcpu=generic is the potential impact on performance of generated code.

I’ve measured quite a wide selection of benchmarks with this change, on the following cores: Cortex-A9, Cortex-A53, Cortex-A57, Cortex-A72.

Impact on execution speed, for each core, when using -march=armv7a, after changing the default cpu from cortex-a8 to generic is as follows.
A positive numbers means speedup, a negative number means slow-down. These are the geomean results over 350 programs coming from benchmark suites such as the test-suite, SPEC2000, SPEC2006 and a range of proprietary suites.

Cortex-A9: 0.96%
Cortex-A53: -0.64%
Cortex-A57: 1.04%
Cortex-A72: 1.17%

Impact on execution speed, for each core, when using -march=armv8a, after changing the default cpu from cortex-a53 to generic:

(Cortex-A9 is an armv7a core, so can’t execute armv8a binaries)

Cortex-A53: -0.09%
Cortex-A57: -0.12%
Cortex-A72: 0.03%

Should we enable scheduling for an in-order core even for -mcpu=generic?

In the above measurements it shows that the biggest negative impact seen is with -march=armv7a on Cortex-A53: -0.64%.
It seems that the in-order Cortex-A53 core is losing quite a bit of performance when the instructions aren’t scheduled - which is to be expected.
Therefore, I also experimented with letting instructions be scheduled according to the Cortex-A8 pipeline model, even for -mcpu=generic, trying to figure out if it’s beneficial to schedule instructions for an in-order core rather than not trying to schedule them at all, for -mcpu=generic.

Measurement results:

-march=armv7a

Cortex-A9: 1.57% (up from 0.96%)
Cortex-A53: 0.47% (up from -0.64%)
Cortex-A57: 1.74% (up from 1.04%)
Cortex-A72: 1.72% (up from 1.17%)

-march=armv8a (Note that there isn’t a pipeline model for Cortex-A53 in the 32-bit ARM backend):

(Cortex-A9 is an armv7a core, so can’t execute armv8a binaries)

Cortex-A53: 0.49% (up from -0.09%)
Cortex-A57: 0.09% (up from -0.12%)
Cortex-A72: 0.20% (up from 0.03%)

Conclusion: for all the in-order and out-of-order cores I measured, it’s beneficial to get the instructions scheduled using the Cortex-A8 pipeline model in combination with -mcpu=generic.

Taking into account the above measurements, my conclusions are:

  1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53 for march=armv7a and march=armv8a.
  2. We probably want to let the compiler schedule instructions using the Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup on all cores tested.

Do people agree with these conclusions?
Any objections against implementing this?
Any other potential impact this may have that I forgot to consider above?

Thanks,

Kristof

Taking into account the above measurements, my conclusions are:
1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53
for march=armv7a and march=armv8a.

Using -mcpu=native makes more sense to me, if at all possible to
detect, falling back to generic, which doesn't hurt.

2. We probably want to let the compiler schedule instructions using the
Cortex-A8 pipeline model for -mcpu=generic, since it gives a bit of speedup
on all cores tested.

Same here, I'd use the schedule of the detected CPU, if any, or fall
back to A8 (which seems fine).

But yeah, it's time we get rid of the A8/A53 defaults.

While we're at it, we may think about ARMv7's NEON default. Generating
only VFP is slower on boards with NEON, but generating NEON crashes
with SIGILL on borads that don't have it.

I'd be happy if Clang could detect CPU/FPU and set the flags
accordingly, or fall back to "generic"/A8-schedule/VFP defaults.

cheers,
--renato

Taking into account the above measurements, my conclusions are:

  1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53
    for march=armv7a and march=armv8a.

Using -mcpu=native makes more sense to me, if at all possible to
detect, falling back to generic, which doesn’t hurt.

Ultimately either solution is fine with me. If Kristof wanted to switch it to generic while getting the autodetection stuff up that would also be ok.

-eric

Agreed.

For the sake of predictability, methinks that it'd make more sense for the default to always mean the same thing for everyone, as Kristof suggested.

Hi, Kristof.

I think that it makes sense. Your results also somehow corroborate the model adopted in GCC for the generic tuning, especially WRT scheduling in order.

Thank you,

Wow, these are some fantastic results! Android is definitely in favor of fixing the defaults, so this proposal looks great from our perspective.

Thanks,
Steve

I fully agree. Non-predictable builds are a major PITA for debugging.

Joerg

Date: Wed, 31 May 2017 10:23:14 -0500
From: Evandro Menezes via llvm-dev <llvm-dev@lists.llvm.org>

>> Taking into account the above measurements, my conclusions are:
>> 1. We should make -mcpu=generic the default cpu, not Cortex-A8 or Cortex-A53
>> for march=armv7a and march=armv8a.
> Using -mcpu=native makes more sense to me, if at all possible to
> detect, falling back to generic, which doesn't hurt.

For the sake of predictability, methinks that it'd make more sense for
the default to always mean the same thing for everyone, as Kristof
suggested.

Seconded. Predictable builds are pretty important for us on OpenBSD,
as people regularly do their own (native) builds of the complete OS on
different kinds of hardware.

Thanks for everyone giving their feedback!
I saw pretty unanimous support for making -mcpu=generic the default and making -mcpu=generic schedule for an in-order CPU (Cortex-A8 in this case).
I’ll be making those changes shortly.

I think the comments also make clear that it’s less obvious whether we’d want -mcpu=native to become a default. It’s probably good for some use cases, but really not good for other use cases. I won’t be making that change, nor advocate for it.

Thanks!

Kristof

That was just me and I am now thoroughly convinced it's not a good idea. :slight_smile:

Please, proceed as planned.

Thanks Kristof, for the detailed investigation and everyone for their comments.

cheers,
--renato

Hi, Kristof.

It sounds like a good plan, but one thing is not clear to me from your post. Which pipeline model will be used for AArch64, A53's (i.e., none)?

Thank you,

Hi Evandro,

For now, I’m only looking at AArch32, not AArch64.
Indeed, we could also perform in-order scheduling for -mcpu=generic on AArch64. Cortex-A53 indeed seems to be the best/only choice available.
But before making that change, that’ll require another round of lots of benchmarking.

So in summary: I’ll put the idea on my backlog, but I probably won’t have time to get all the benchmarking done in the very near future.

Thanks,

Kristof