Proposal: AArch64/ARM64 merge from EuroLLVM

Hi all,

A bunch of us met at EuroLLVM to discuss the planned merge of the two
current AArch64 backends in the tree. The primary question was which
backend should form the basis of the merge (since the core .td files
aren't directly mergeable), with code being cherry-picked from the
other on a case-by-case basis.

There were factors to consider both ways, but I think the key points
of interest were agreed on by everyone:

1. That getting the merge done as quickly as possible was important to
avoid duplicated effort and confusion among our users.

2. That neither performance nor correctness were particularly useful
discriminators between the backends. Both are good enough to form the
basis on those grounds.

Ana Pazos had managed to run some benchmarks on Cortex-A53 (an
in-order CPU) which showed that porting a few simple cases across
could reduce differences to low single digits, with winners in both
directions.

Similarly, people from ARM had managed to resolve most known
correctness issues since the initial commits last week.

That leaves long-term maintenance and features as the remaining
factors to make the decision: we want to spend as little effort as
possible (in total) to do things on the backend, both now and in
future.

In the short term, ARM64 is the clear choice; it simply has more
features now: ELF, FastISel, and the two NEON syntaxes were mentioned.
On the other side there was incipient big-endian and CPUs with
different sub-features (NEON/FP/...).

Longer term, the question is much more difficult. Maintainability is
often a matter of taste and there are issues with both backends (which
we should do our best to resolve!): ARM64 has horrific handling of
aliases and hacks in the various MC components (AsmParser,
Disassembler, ...); AArch64 has similar contortions in the .td files
(see loads/stores & instruction proliferation for aliases). ARM64 has
a clean implementation of calling conventions; AArch64 has its sysreg
lookup. I don't think either has fundamental barriers to a clean
design in future, personally, though AArch64 probably needs a couple
more pushes to get there.

The tentative conclusion was that we probably have all the information
we need available and should propose using the ARM64 backend as the
basis on the list and continue discussion here.

Tim.

Hi again,

In my original message I was attempting to summarise the key arguments
as I saw them. Other points came up in the discussion, which Ana
kindly recorded and I'll summarise here:

First, extra arguments brought up in favour of each backend (I'll
mention duplicates too so that the list is as complete as possible):

+ Register class usage in ARM64 is cleaner.
+ FastISel is on ARM64, but not AArch64. Some TableGen work will be
needed to enable it because of how patterns are written there.
+ There is no macro support in AArch64.
+ Both NEON syntax variants (general & iOS) are supported by ARM64 now.
+ ARM64 assumes neon enabled by default, and indeed has no notion that
a CPU might not have NEON. Instructions will need to be predicated to
check NEON is present and probably some corresponding .cpp changes
where it was also assumed.
+ Inline asm is possibly better in ARM64.
+ Anecdotal evidence suggests it's easier to debug MC layer issues on
ARM64 than on AArch64.

Other important points that we discussed:

+ We need to setup a buildbot for performance using some real hardware
(volunteers with hardware?) so patches can be validated in the
supported targets. And also for correctness using qemu.

+ Google is working on a framework to build and run benchmarks – to be
available soon? And should enable the buildbot setup from item above.

+ We need to sort out differences between cortex-a53 and Cyclone model
descriptions (both use the new approach for MI scheduler, but one
requires annotating instructions and the other does not). We should
pin down Andy and get him to describe the perfect machine model.

Cheers.

Tim

Hi folks,

As Tim pointed out, we recently had the opportunity to collect 64-bit benchmark performance data for GCC 4.9, AArch64 and ARM64 compilers on a real hardware. It is a cortex-a53 device. Due to proprietary reasons we cannot share the full hardware configuration.

The preliminary results were shared at the hackers lab at EuroLLVM yesterday. For those who could not make it, below is the summarized performance data.

A positive number means the ARM64 run is better by the number %. A negative number means the baseline (GCC 4.9 or AArch64) is better by the number %.

Tuning of AArch64 backend on this processor has not been completely done yet (some initial work has started on modeling cortex-a53). But we quickly investigated the bad vectorized code in some of the tests (Linpack for example) and identified straightforward fixes that improved AArch64 performance (similar patches are present in ARM64, e.g. loop unroll default limit, unaligned memory accesses, etc.). These patches are going to the AArch64 commits list for review.

This experiment indicates that from the point of view of correctness and performance either ARM64 or AArch64 could be the base compiler of choice if the known correctness issues (in ARM64) and lack of performance tuning (in AArch64) are addressed.

However much more work has to be done to catch up with GCC 4.9 middle-end and backend optimizations.



Benchmark

|

ARM64 vs GCC 4.9 %

|

ARM64 vs AArch64 %

|

ARM64 vs AArch64 patched %

|

  • | - | - | - |


    EEMBC (no consumer) geomean

    |

    -17

    |

    1

    |

    -2

    |


    EEMBC (consumer only) geomean

    |

    -21

    |

    -2

    |

    -5

    |


    Linpack Double

    |

    -29

    |

    45

    |

    -1

    |


    Linpack Single

    |

    -51

    |

    40

    |

    1

    |


    SPEC2000 geomean

    |

    -6

    |

    0

    |

    1

    |

Thanks,

Ana.

Hi Ana,

could you share the SPEC2000 data per suite and per benchmark?

Thanks
Gerolf

Hi again,

Having heard no howls of protest, those of us remaining on the
Wednesday decided to get down to planning a few more details of the
merge.

David Kipping very kindly took notes, and we've produced the summary
of the discussion below:

On Wednesday after the EuroLLVM meeting, a group met to continue
discussing the ARMv8 backend merge and how to accelerate completion.
Attending was James, Bradley, Tim, Jiangning, Kristof, Vinod,
Chandler, Pierre, Ana, and David.

EuroLLVM provided a timely and convenient opportunity to meet in
person to discuss this topic. But it is important to note that this is
only one meeting and some issues likely have been missed, and that not
everyone involved in the discussion was at EuroLLVM; everything below
is open for further discussion and revision on the community lists.

Later in the mail are details on the work to complete the merge, but
there is a lot and participation from the community is warmly welcome.
This is an excellent opportunity if you want to learn more about
backends, the ARMv8 architecture, or just want to ensure that the
community ARMv8 backend is of this highest quality and performance.
Some of the areas that have been identified needing help are:

- Code reviews (there will be lots of changes and quality of review
and timeliness is critical)
- Merging regression tests from both ARMv8 backends - Tim will lead
this effort but is looking for help
- Inline ASM (I think Eric said at the Hackers Lab that he might be
willing to do this)
- Fix bugs
- For others who want to help test, compiling and running your
codebases on QEMU (no crypto extensions)
- Code coverage analysis of backend
- Clean up the codebase (C++11-ify it, for example) - J im will lead
this effort
- In addition, any of the work items identified later in this mail

- Inline ASM (I think Eric said at the Hackers Lab that he might be
willing to do this)

I am, yes.

- For others who want to help test, compiling and running your
codebases on QEMU (no crypto extensions)

Some reasonable description of how this works would be awesome.

- Feature parity - to the level found in the ARM64 and AArch64 backends today

As a note this should definitely be "Today", as in the day you sent
the email/had the meeting/etc. No new work should be considered part
of the final sign off - basically a gentle chide for people to stop
putting new features into the existing AArch64 backend :slight_smile:

-eric

On ubuntu 13.10, start from here

Nifty, thanks!

-eric

This sounds reasonable. Thanks, all.

- CSE of ADRP optimization (Jiangning)

Quentin may have some input here. He’s done quite a lot of optimizations for ADRP sequences.

-Jim

Hi Jim,

Hi Tim,

I just read this thread and I see that you mentioned the buildbot and my name. 

- LLVM test suite enabled in the buildbot and testing ARM64 (Gabor)

What exactly I can do to help you with the merge process?

Best regards,
Gabor Ballabas

Hi Garbor,

I think Kristof will talk to you after his vacation, basically we are expecting you could change current build bot to use QEMU. Current data shows foundation model is 30x~50x slower than QEMU.

Thanks,
-Jiangning

2014-4-15 下午8:30于 “Gabor Ballabas” <gaborb@inf.u-szeged.hu>写道:

Hi Garbor,

And this change needs to cover triple arm64-linux-gnuabi in short term, and for long term we need to only use aarch64-linux-gnuabi, I think.

Thanks,
-Jiangning

2014-4-15 下午9:16于 “Jiangning Liu” <liujiangning1@gmail.com>写道:

Hi Jiangning,

+Quentin.

Hi Jiangning,

Hi Jim,

I think this is orthogonal. If you happen to merge globals they will have the same base address (i.e., the same pseudo instruction) but different offsets.
CSE and such will work like a charm for the pseudos.

Assuming you emit the right instructions at isel time, you will create ADRP, LOADGot, or ADD with symbols. Since you do not know anything on the symbols, CSE will match only the ones that are identical.
You will have a finer granularity to do CSE, but I am not sure it will help that much.
On the other hand, you lose the rematerialization capability, because that feature can only handle one instruction at a time. So you will still be able to rematerialize ADRP but not the LOADGot and ADD with symbols.

The LOH solution is also orthogonal. You can see that as a last chance way to optimize those accesses.
That said, if you CSE the ADRP and not the LOADGot, you will indeed create far less candidates for the LOHs because you will have ADRPs with several uses, which is not supported by LOHs.

FYI, the LOH optimization is not a link-time optimization in LLVM, this is really a link-time optimization: on the binary.

The bottom line is whatever you are doing with merge globals, it is orthogonal with LOHs.
That said I think it is best to keep the pseudo instructions.

Of course I may be wrong and the best way to check would be to measure what happens if you get rid of the pseudo instructions. Do not be too concerned with the impact on the LOHs.

Thanks,
-Quentin

Hi Quentin,

Thanks for your feedback!

Hi Jiangning,

Hi Quentin,

Thanks for your feedback!

Interesting.
Looks like we are too clever here.
I would have expected ISel to generate one base address and one displacement.

I believe that if we fix that both the LOHs and the global merge become orthogonal. My guess is that we should be less aggressive at folding offset if there are several uses.

Sure, but it can help :).

Let us try to fix the codegen problem while keeping the pseudos.

Well, this is something that should be measured. Your patch does not kill the LOHs, it may just reduce the number of potential candidates. For each candidate that your patch removes, it means we at least spare one ADRP instruction. The trade-off does not seem bad.

I suggest we:

  1. Fix the ISel of pseudo (making the folding less aggressive).
  2. Measure the performance with your patch.

I can definitely help for the measurements with the LOHs enabled in parallel with your patch.
If you want I can help for #1 too.

Side question, did you happen to measure any performance improvement/regression with your patch?
I’d like to know which tests would be good candidates to measure the impact of your patch + LOHs enabled.

Thanks,
-Quentin

Hi Quentin,

Thanks for your kindly help!

Hi Quentin,

BTW, the command line option of enabling CSE ADRP should be,

-fno-common -mllvm -global-merge-on-external=true -mllvm -global-merge=true

If you want to measure the base, the command line should be,

-mllvm -global-merge=false

Thanks,
-Jiangning