Newer Cortex scheduling files for LLVM? A77/A78/X1?

Hello!
Sorry about the cold emailing here

I was looking to use llvm-mca for some static analysis of codegen I am
running and noticed that I was getting only A57 scheduling
information.
Turns out all Cortexs past A57 are just using the A57 scheduler.
I've started writing a custom Cortex-A77 scheduler file from the
guidance in the public optimization guide for the core.
Sadly I have very quickly run into the problem where any multi-uop
instruction isn't described how many uops per pipeline is generated.
Considering I don't have access to ARM's internal documentation here,
it's hard to generate a schedule file for these cores that will end up
being correct enough for static analysis.

I see David Penry (Who I've CC'd) has recently created a schedule file
for the M7.
Does anyone have schedule files coming for these CPU cores?
Or maybe the documentation about which uops are generated for which
instructions can be made available? That way schedule files can be
created publicly?
Cortex-A77 is the ideal choice in this case, but A78 and subsequently
X1 will be interesting targets as time moves forward.

Hi Ryan,

It may be worth trying to run llvm-exegesis on your target (assuming that you have access to the hardware).
As far as I understand, exegesis should already support aarch64. That being said, I personally never tried it on arm/aarch64, so I don’t know what is the level of support for your particular processor.

-Andrea

Hello,

There have been no newer scheduling models built for Cortex cores past Cortex-A57 because we haven't been able to see performance uplifts from providing more accurate schedule for these kinds of big out-of-order cores. Speaking very generally, in most cases the newer cortex cores are wide enough and have a large enough OoO window that most of the time scheduling roughly is almost as good (within 1% or so) of trying to schedule optimally.

Static analysis is, of course, quite a different use case and probably wants even more accurate models than has been contributed for Cortex-A57.

The most detailed information which Arm releases are the Software Optimization Guides, which from your message you already have access to but for others can be found on our website, for example for the Cortex-A78 processor at Documentation – Arm Developer . These detail the utilized pipelines, latency, and throughput for each instruction. While the correlation isn't always perfect, this information can allow you to calculate something approximating "uops" using something like (1 / (throughput / n_pipelines)). Of course, for operations like floating point division this approximation is not appropriate, and there are subtleties around operations which use different classes of pipeline.

This goes back to the question of how accurate our scheduling models need to be. For code-generation, this would probably be good enough. For llvm-mca it might give misleading answers.

llvm-exegesis is an excellent idea for measuring details directly. I know that it has worked in the past on AArch64 for latency calculations, but may have bit-rotted in the meantime, unfortunately.

Thanks,
Dave

Dang, llvm-exegesis is a great option but my board currently has
broken perf events to try and gather this data anyway.
I could probably try and fix this then start profiling but I'm already
burning a bunch of time just trying to generate a sched file.
Was worth a try, hopefully we can get more of these schedule files
generated in the future.