[GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Hi,

GlobalISel, the SelectionDAG replacement, has come a long way since initially discussed on the mailing list and its last discussion at the EuroLLVM BoF (https://etherpad.net/p/GlobalISel).
We believe we are close to the point of enabling it by default on AArch64 at O0. We now would like to enlist your help to get there.

*** Quick Status ***

On iOS we are at 100% pass rate in 00 g for the LLVM test suite, standard benchmarks and unit tests. In about 5% of all functions GlobalIsel falls back to SDIsel.
(Kristof Beyls would have the linux numbers.)
The self host compiler correctly builds and runs the LLVM test suite in O0.

*** We Need Your Help ***

Please try GlobalISel for AArch64 at O0 (preferably O0 g) and file PR for:

  • Performance problems (compile time, runtime, code size)
  • Miscompile
  • Crashers
  • Poor debug information
  • etc.

There is a component GlobalISel in llvm.org/bugs for that.

GlobalISel cannot handle some inputs (e.g., some vector construction), but it falls back to SDISel when it hits such cases. We are also interested to know whether or not GlobalISel fallback on your favorite workloads and why (see next section for the actual options to run). So please file PRs for that too.

As always, patches are welcome!

*** Concretely What Do I Run ***

Please test your favorite workload/scenario on AArch64 at O0 using one of the following additional category of options:
Recommended: (-llvm) -global-isel (-mllvm) -global-isel-abort=2
OkToFallBack:: (-llvm) -global-isel (-mllvm) -global-isel-abort=0
AbortOnFallBack: (-llvm) -global-isel

The Recommended way will issue a warning if it falls back to SelectionDAG, the OkToFallBack will silently fallback to SDISel if it needs to, and the AbortOnFallBack will kill the compiler if GlobalISel cannot handle the input completely.

*** What Is Next? ***

We would like to turn on GobalISel by default in the next couple of weeks. Please help identify any critical issue that needs to be resolved before that can happen. We expect that minor hiccups and outliers in any metrics can be fixed quickly, but are worried about harmful run-time (miscompile) or compile-time crashes.

Many thanks,
-Quentin

Having done no tests at all on my side, I think we need to have
similar numbers on Linux to be able to flip across the board.

I don't want to flip it only for Darwin and not Linux, as that will
fragment the effort too much.

I'll check with Diana and Kristof to know what's the best way forward,
but it should be reasonably quick.

cheers,
--renato

Hi Renato,

If Kristof is busy I can make runs on AArch64 Linux (Cortex-A53 and Cortex-57).

Thanks,
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

Hi,

Just a heads-up for anyone building things with CMake: if you use the
default build type (Release), it will put CMAKE_CXX_FLAGS_RELEASE
after the CMAKE_CXX_FLAGS, so you'll end up running -O3 instead of
-O0. You might want to check your command lines.

Kristof has also found an issue in the test-suite where it actually
runs -O2 due to some weirdness in the build system, we'll try to
commit a fix for that.

Cheers,
Diana

Hi Diana,

Thank you for information. Yes, I know about the problem with O options.
I've been hit by this whilst I was testing my changes for Oz.

Thanks,
Evgeny

I’ve kicked off a run to compare “-O0 -g” versus “-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2”.
I’ve selected the test-suite (albeit a version which is a couple of months old now) and a few short-running proprietary benchmarks to get data back quickly for an initial feel of where things are.
This was running on Cortex-A57 AArch64 Linux.

I saw one assertion failure in GlobalISel, see http://bugs.llvm.org/show_bug.cgi?id=32471. This is in a program compiled at -O2 (my out-dated test-suite still overrides -O0 and instead uses -O for that program). The root cause of the failure seems to be due to LowLevelType not supporting vectors of pointers. I think this demonstrates that for correctness, we should be trying to test more than -O0, or even more than just LLVM-IR produced by clang, as other front-ends could run into this even at -O0.

Due to this assertion failure and the infrastructure I used, the numbers below do not include test-suite/MultiSource/Benchmarks results.

On the non-correctness aspects, LNT tells me that:

  • The programs that report execution time, on geomean are about 17% slower.
  • The programs that report scores, on geomean are about 21% slower.
  • Code size is up on geomean about 11%.
    I’m afraid I don’t have compile time numbers, nor any feel for debug info quality.

I’ll need quite a bit more time to dig into the details to come up with something actionable, although the fact that LowLevelType doesn’t support vectors of pointers is already actionable.
Nevertheless, I thought to share what I see as is, to see if others see similar results so far.

I thought Diana was going to look into fallback rate on the test-suite on AArch64 linux?

Thanks,

Kristof

It does actually work if you add "-fno-vectorize -fno-slp-vectorize"
to CMAKE_CXX_FLAGS_RELASE (and C). Not a standard configuration, but
probably throwing more varied IR at it than just doing -O0.

Cheers.

Tim.

Hi,

Here's my results so far:

On the test-suite, we get 2 timeouts during execution (paq8p and
scimark2). Other than that, everything seems to run just fine. We have
about 7410 unique fallbacks out of 52367 unique functions. Assuming I
counted right (please let me know if this looks fishy, I personally
find the total number of functions terrifyingly small).

On a stage 2 build of clang, we run check-all successfully. We have
about 64784 unique fallbacks out of 661461 functions.

I'm currently trying to run a stage 3 build to compare binaries (just
for kicks) and I'll also try to do some runs without fallbacks and
count what problems we run into most often.

The way I've been counting the total number of functions was to run
objdump -t on all the .o files in the build directory, grab everything
with an "F" flag and remove duplicates. If anyone knows a better way
to do this I'm all ears.

Thanks,
Diana

On the test-suite, we get 2 timeouts during execution (paq8p and
scimark2).

Interesting. I'd not seen those failures in the configurations I'd
run. I'll look into them (other than that my best bet for debugging is
a kernel panic, this has to be easier!).

On the test-suite, we get 2 timeouts during execution (paq8p and
scimark2). Other than that, everything seems to run just fine. We have
about 7410 unique fallbacks out of 52367 unique functions. Assuming I
counted right (please let me know if this looks fishy, I personally
find the total number of functions terrifyingly small).

It's about 275000 before uniqueing (which roughly matches an earlier
measurement I did). The duplication seems to be dominated by the
halide tests, though you have to be a little careful with "main"
(which occurs about 500 times). At a glance I'd put the total closer
to 53000 (52420 + 492 extra copies of main), but that's a small
difference to your figure.

Tim.

I just ran the tests on Linux. Those two were last to finish, but they
did terminate eventually for me. Bother, back to the kernel.

Tim.

I’ll try a run here shortly, it’s just going to take a while.

-eric

Would this completely replace FastISel (on AArch64) then? So that we have to look for bugs only in GlobalISel and SelectionDAG?

  • Matthias

I’ve been digging a little bit deeper into the biggest performance regressions I’ve observed.

What I’ve observed so far is:

  • A lot of the biggest regressions are caused by unnecessarily moving floating point values through general purpose registers. I’ve raised http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one definitely needs fixing before enabling GlobalISel by default at -O0.
  • FastISel seems to transform division-by-constant-power-of-2 into right shift (see https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468). GlobalISel does not. It seems to me that at -O0 there may be reasons not perform this transformation, but maybe there is a good reason why FastISel does this?
  • FastISel doesn’t seem to handle functions with switch statements, so it falls back to DAGISel. DAGISel produces code that’s a lot better than GlobalISel for switch statement at -O0. I’m not sure if we need to do something here before enabling GlobalISel by default. I’m thinking we may need to add a smarter way to lower switch statements rather than just a cascaded sequence of conditional branches.

I’ll try to add the above content to the document Diana created at https://goo.gl/IS2Bdw too.

Thanks,

Kristof

I've been digging a little bit deeper into the biggest performance
regressions I've observed.

What I've observed so far is:
* A lot of the biggest regressions are caused by unnecessarily moving
floating point values through general purpose registers. I've raised
32550 – GlobalISel -O0 for AArch64 moves floating point values through GPRs way too often for this. I think this one
definitely needs fixing before enabling GlobalISel by default at -O0.
* FastISel seems to transform division-by-constant-power-of-2 into right
shift (see
https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468).
GlobalISel does not. It seems to me that at -O0 there may be reasons not
perform this transformation, but maybe there is a good reason why FastISel
does this?

So, FastISel on AArch64 isn't really an "O0" selector: it has a lot
of smarts and peepholes, because some JIT users had it as the main
optimizing selector for a while.

In that sense, it's a pretty aggressive target that IMO we don't have to match.

* FastISel doesn't seem to handle functions with switch statements, so it
falls back to DAGISel. DAGISel produces code that's a lot better than
GlobalISel for switch statement at -O0. I'm not sure if we need to do
something here before enabling GlobalISel by default. I'm thinking we may
need to add a smarter way to lower switch statements rather than just a
cascaded sequence of conditional branches.

D31080 seems promising, I've been wanting to take a look, hoping we
can use that to emit an optimized lowering. I'm not sure we want that
at O0 though (even if only for FastISel+DAGISel parity).

I'll try to add the above content to the document Diana created at
https://goo.gl/IS2Bdw too.

Thanks for the investigation! These are also some of the biggest
problems I've seen (in particular the FP regbanks).

I'll make sure I find the time to file bugs for all the other issues
I'm aware of. (sorry I haven't done that earlier!)

-Ahmed

On the test-suite, we get 2 timeouts during execution (paq8p and
scimark2).

Interesting. I'd not seen those failures in the configurations I'd
run. I'll look into them (other than that my best bet for debugging is
a kernel panic, this has to be easier!).

I just ran the tests on Linux. Those two were last to finish, but they
did terminate eventually for me. Bother, back to the kernel.

But how do the compile-times compare vs FastIsel?

I’ve been digging a little bit deeper into the biggest performance
regressions I’ve observed.

What I’ve observed so far is:

So, FastISel on AArch64 isn’t really an “O0” selector: it has a lot
of smarts and peepholes, because some JIT users had it as the main
optimizing selector for a while.

In that sense, it’s a pretty aggressive target that IMO we don’t have to match.

OK, that makes sense to me now, and indeed it doesn’t seem a good idea to try and do lots
of peepholes at -O0.

  • FastISel doesn’t seem to handle functions with switch statements, so it
    falls back to DAGISel. DAGISel produces code that’s a lot better than
    GlobalISel for switch statement at -O0. I’m not sure if we need to do
    something here before enabling GlobalISel by default. I’m thinking we may
    need to add a smarter way to lower switch statements rather than just a
    cascaded sequence of conditional branches.

D31080 seems promising, I’ve been wanting to take a look, hoping we
can use that to emit an optimized lowering. I’m not sure we want that
at O0 though (even if only for FastISel+DAGISel parity).

I wasn’t aware of D31080: good to know!
My thinking here is that one of the reasons people use -O0 is they want a pretty straightforward
mapping between source code and the generated assembly code. For switch statements,
mapping to a cascaded sequence of conditional branches, a jump table, a binary search tree,
or any of the other ways to lower switch statements is equally good from this
perspective, I think. So if one of these other lowering schemes is as good as a cascaded sequence
of branches for aspects such as compile time and debug info quality, I think it’s best to choose
one of these alternative lowering schemes.
It seems to me that e.g. on MultiSource/Applications/sqlite3/sqlite3, this may be the cause
of the almost 2x slowdown compared to non-globalisel -O0.

I’ll try to add the above content to the document Diana created at
https://goo.gl/IS2Bdw too.

Thanks for the investigation! These are also some of the biggest
problems I’ve seen (in particular the FP regbanks).

I’ll make sure I find the time to file bugs for all the other issues
I’m aware of. (sorry I haven’t done that earlier!)

I’ve seen you added 2 bugs so far. I’ve slotted them in to https://goo.gl/IS2Bdw.
I’m starting to think that it may be easiest if we had a “Meta bug” in bugzilla that combines
all the issues we think should be fixed before GlobalISel can be enabled by default at -O0
for AArch64. In the same style as e.g. http://bugs.llvm.org/show_bug.cgi?id=32061.

What do you think?

Thanks!

Kristof

+1, we already do this for lots of other things (inline assembly bugs,
bugs building the Linux kernel etc) and it's a good way to keep things
organized.

Hi Kristof,

I’ve been digging a little bit deeper into the biggest performance regressions I’ve observed.

What I’ve observed so far is:

  • A lot of the biggest regressions are caused by unnecessarily moving floating point values through general purpose registers. I’ve raised http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one definitely needs fixing before enabling GlobalISel by default at -O0.

I commented in the PR. This is a known problem and we have a solution. Given this is an optimization in the sense that it does not affect the correctness of the program, we didn’t push for fixing it now.

For O0 we wanted to focus ourselves on generating correct code. Unless the regressions you are seeing are preventing debugging/running of the program, I wouldn’t block the flip of the switch on that.

What do you think?

I think FastISel tries to generate the best code it can no matter what. For GISel O0 however, not doing this optimization sounds sensible to me.
Now, I would say that the same remark as the previous bullet point apply: we shouldn’t do it unless it gets in the way of running/debugging the program.

  • FastISel doesn’t\ seem to handle functions with switch statements, so it falls back to DAGISel. DAGISel produces code that’s a lot better than GlobalISel for switch statement at -O0. I’m not sure if we need to do something here before enabling GlobalISel by default. I’m thinking we may need to add a smarter way to lower switch statements rather than just a cascaded sequence of conditional branches.

Sounds optimization-ish to me. Same remark.

Hi Quentin,

Hi Kristof,

I’ve been digging a little bit deeper into the biggest performance regressions I’ve observed.

What I’ve observed so far is:

  • A lot of the biggest regressions are caused by unnecessarily moving floating point values through general purpose registers. I’ve raised http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one definitely needs fixing before enabling GlobalISel by default at -O0.

I commented in the PR. This is a known problem and we have a solution. Given this is an optimization in the sense that it does not affect the correctness of the program, we didn’t push for fixing it now.

For O0 we wanted to focus ourselves on generating correct code. Unless the regressions you are seeing are preventing debugging/running of the program, I wouldn’t block the flip of the switch on that.

What do you think?

For O0, I think most users care about fast compile time and an excellent debug experience.
Next to that, I am told that some users also use -O0 to have a straightforward, simple, mapping between source code and the generated assembly.

Out of the issues I’ve seen so far, I think this is the single “optimization” issue that I feel gives a negative first impression of GlobalISel.
If users look at the generated assembly for floating point code it looks more like active de-optimization rather than “no optimization”.
My guess is also that this may be the biggest reason for the about 20% performance slow-down and 11% code size increase I measured earlier.
OTOH, I clearly agree this is an optimization issue, not a correctness issue.
Combining the above, I think it would be best if this issue got fixed before we turn on GlobalISel by default at -O0, or we may end up hearing quite a few (non-critical) complaints about this from users.
Basically I think this is a tradeoff between giving a better first impression of GlobalISel vs getting more people to use and test it earlier.

Thanks for the write-up on the PR, that is very useful.
Do you have any rough idea how much effort would be involved in getting this fixed? I got the impression Daniel is making good progress on the tablegen generation, which is key to getting this issue fixed?
I think that no matter whether we decide to switch the default before or after fixing this issue, this is one of the most urgent issues to fix as far as I can see.

If there is some way I can help contribute to fixing PR32550, I would like to help; but with the dependency on tablegen generation, I’m not sure what the best way is to help make that PR get fixed faster?

I think FastISel tries to generate the best code it can no matter what. For GISel O0 however, not doing this optimization sounds sensible to me.
Now, I would say that the same remark as the previous bullet point apply: we shouldn’t do it unless it gets in the way of running/debugging the program.

I agree that these optimizations should not be done at -O0. I think not doing them is actually an improvement: you give the user what they asked, i.e. “no optimization”, and an as-straigthforward-as-possible mapping from source to assembly.

  • FastISel doesn’t\ seem to handle functions with switch statements, so it falls back to DAGISel. DAGISel produces code that’s a lot better than GlobalISel for switch statement at -O0. I’m not sure if we need to do something here before enabling GlobalISel by default. I’m thinking we may need to add a smarter way to lower switch statements rather than just a cascaded sequence of conditional branches.

Sounds optimization-ish to me. Same remark.

Agreed, optimization-ish. In comparison to the above point on FastISel “peepholes”, I think that lowering switch statements to something else than a cascaded sequence of conditional branches doesn’t make the generated code harder to map to the source. So, in comparison to the above point on FastISel “peepholes”, I’m not actively against being smarter about lowering switch statements at -O0.
But I agree, this shouldn’t hold up turning on GlobalISel by default at -O0.

Other than the above remarks, before turning on GlobalISel by default, we’d better test/verify that debug info quality remains good.
I haven’t looked into that at all, but am hoping to start looking into that soon, with the help of the DIVA tool (https://github.com/SNSystems/DIVA) presented at last EuroLLVM (http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf). I don’t recall anyone so far making any statements about the quality of the debug info they’ve measured or experienced with GlobalISel enabled?

Other than the all of the above, I just wanted to mention that Oliver recently started running csmith on AArch64 GlobalISel and found one issue so far that already got fixed (https://bugs.llvm.org//show_bug.cgi?id=32733).
If he finds other correctness issues, I’m sure he’ll keep on reporting them.

Hi Kristof,

Hi Quentin,

Hi Kristof,

I’ve been digging a little bit deeper into the biggest performance regressions I’ve observed.

What I’ve observed so far is:

  • A lot of the biggest regressions are caused by unnecessarily moving floating point values through general purpose registers. I’ve raised http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one definitely needs fixing before enabling GlobalISel by default at -O0.

I commented in the PR. This is a known problem and we have a solution. Given this is an optimization in the sense that it does not affect the correctness of the program, we didn’t push for fixing it now.

For O0 we wanted to focus ourselves on generating correct code. Unless the regressions you are seeing are preventing debugging/running of the program, I wouldn’t block the flip of the switch on that.

What do you think?

For O0, I think most users care about fast compile time and an excellent debug experience.
Next to that, I am told that some users also use -O0 to have a straightforward, simple, mapping between source code and the generated assembly.

Out of the issues I’ve seen so far, I think this is the single “optimization” issue that I feel gives a negative first impression of GlobalISel.
If users look at the generated assembly for floating point code it looks more like active de-optimization rather than “no optimization”.
My guess is also that this may be the biggest reason for the about 20% performance slow-down and 11% code size increase I measured earlier.
OTOH, I clearly agree this is an optimization issue, not a correctness issue.
Combining the above, I think it would be best if this issue got fixed before we turn on GlobalISel by default at -O0, or we may end up hearing quite a few (non-critical) complaints about this from users.
Basically I think this is a tradeoff between giving a better first impression of GlobalISel vs getting more people to use and test it earlier.

Thanks for the write-up on the PR, that is very useful.
Do you have any rough idea how much effort would be involved in getting this fixed?

I’d say 2-3 weeks. Could actually be shorter if we don’t do all the refactoring I have in mind to make the table for the alternative mappings smaller, but I don’t think it is worth taking any shortcut here.

I got the impression Daniel is making good progress on the tablegen generation, which is key to getting this issue fixed?

To be accurate, Daniel’s work on table gen is for the select phase, not regbankselect. In other words, right now, all the table for mappings for regbankselect are hand written and Daniel’s work is not changing that. The (weak) rationale is that it is not on the critical path :).

I think that no matter whether we decide to switch the default before or after fixing this issue, this is one of the most urgent issues to fix as far as I can see.

Agree. This is one of the top items of my todo list.

If there is some way I can help contribute to fixing PR32550, I would like to help; but with the dependency on tablegen generation, I’m not sure what the best way is to help make that PR get fixed faster?

Thanks for offering to help. I agree that with the dependency on the tabelgen generation in the way, this is probably not a good use of your time. Depending when I can spare some time on this, I’ll either do it or explain the refactoring in the google doc.

In the meantime, I’d suggest to focus on validating the debug info on your side.

I think FastISel tries to generate the best code it can no matter what. For GISel O0 however, not doing this optimization sounds sensible to me.
Now, I would say that the same remark as the previous bullet point apply: we shouldn’t do it unless it gets in the way of running/debugging the program.

I agree that these optimizations should not be done at -O0. I think not doing them is actually an improvement: you give the user what they asked, i.e. “no optimization”, and an as-straigthforward-as-possible mapping from source to assembly.

  • FastISel doesn’t\ seem to handle functions with switch statements, so it falls back to DAGISel. DAGISel produces code that’s a lot better than GlobalISel for switch statement at -O0. I’m not sure if we need to do something here before enabling GlobalISel by default. I’m thinking we may need to add a smarter way to lower switch statements rather than just a cascaded sequence of conditional branches.

Sounds optimization-ish to me. Same remark.

Agreed, optimization-ish. In comparison to the above point on FastISel “peepholes”, I think that lowering switch statements to something else than a cascaded sequence of conditional branches doesn’t make the generated code harder to map to the source. So, in comparison to the above point on FastISel “peepholes”, I’m not actively against being smarter about lowering switch statements at -O0.
But I agree, this shouldn’t hold up turning on GlobalISel by default at -O0.

Other than the above remarks, before turning on GlobalISel by default, we’d better test/verify that debug info quality remains good.
I haven’t looked into that at all, but am hoping to start looking into that soon, with the help of the DIVA tool (https://github.com/SNSystems/DIVA) presented at last EuroLLVM (http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf). I don’t recall anyone so far making any statements about the quality of the debug info they’ve measured or experienced with GlobalISel enabled?

We ran the lldb test suite with GISel. IIRC, Tim has the details, GISel was on part with SDISel.

Other than the all of the above, I just wanted to mention that Oliver recently started running csmith on AArch64 GlobalISel and found one issue so far that already got fixed (https://bugs.llvm.org//show_bug.cgi?id=32733).
If he finds other correctness issues, I’m sure he’ll keep on reporting them.

Great!

Thanks,
-Quentin