[RFC][AArch64] Make -mcpu=generic schedule for an in-order core

Hello folks,

We would like to start pushing -mcpu=generic for AArch64 towards enabling a set of features that is believed to be beneficial in general - that improve performance the for some CPUs without hurting it on any others. A blend of the performance options hopefully beneficial to all CPUs.

The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from eecb353d0e25ba which made -mcpu=generic perform inorder scheduling using the Cortex-A8 scheduling model.

The idea is that in-order cpu's require the most help in instruction scheduling, whereas out-of-order cpus can for the most part out-of-order schedule around different codegen. Our benchmarking suggests that hypothesis holds, with in-order performance benefiting from the scheduling by between 1% and 4% geomean. Out of order performance was quite noisy and the results were within the noise margins, tending towards a slight improvement in general.

When specifying an Apple target, clang will set "-target-cpu apple-a7" on the command line, so should not be affected by this change when running from clang. This also doesn't enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used.

There is a patch to make the change in https://reviews.llvm.org/D110830, with extra details about performance changes and all the tests that are updated.

Let us know if you have comments.

Thanks
Dave

Hello folks,

We would like to start pushing -mcpu=generic for AArch64 towards enabling a set of features that is believed to be beneficial in general - that improve performance the for some CPUs without hurting it on any others. A blend of the performance options hopefully beneficial to all CPUs.

Hi David,

This is the usual LLVM definition of “generic”, so working on that goal is always good.

The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from eecb353d0e25ba which made -mcpu=generic perform inorder scheduling using the Cortex-A8 scheduling model.

I think this makes sense because the A55 scheduling model is more likely to benefit the chips produced nowadays than the A8’s.

When specifying an Apple target, clang will set “-target-cpu apple-a7” on the command line, so should not be affected by this change when running from clang. This also doesn’t enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used.

Thinking out loud, what do people think of creating an additional “ooo” target? So, “generic” is the same as “in-order”, but the “ooo” (or “unordered”, whatever) would pick a base OOO target, like A57, A72, etc.

A few years ago, when I was doing benchmarks for OpenBLAS changes on Arm, I realised doing that was beneficial to most targets, often only beaten by specifying the correct target.

cheers,
–renato

Hi Renato,

The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from eecb353d0e25ba which made -mcpu=generic perform inorder scheduling using the Cortex-A8 scheduling model.

I think this makes sense because the A55 scheduling model is more likely to benefit the chips produced nowadays than the A8’s.

Just to be explicit, eecb353d0e25ba was for the ARM backend, so AArch32, this is for AArch64. But I agree the ARM backend could benefit from an update too.

Thinking out loud, what do people think of creating an additional “ooo” target? So, “generic” is the same as “in-order”, but the “ooo” (or “unordered”, whatever) would pick a base OOO target, like A57, A72, etc.

Sounds interesting!