[GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Hi Kristof,

I made a fix for PR32550, that is much less involved in r302453.

In particular, it does not rely on the greedy mode of the regbankselect pass, thus there is no compile time implication.

Could you try it and check where we stand?

Thanks,
-Quentin

Great Quentin :).

I’ve rerun the benchmarks comparing “-O0 -g” with “-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2” on r302453, on AArch64 Cortex-A57.
I indeed see almost no moves between GPR and FPR registers anymore (see details below for where I still see some).
On geomean, I see 13% slow down (down from 17% on my previous run).
On geomean, code size increase is about 3% (down from 11% on my previous run).
I also checked debug info quality with the DIVA tool on the test-suite; and asked one of our DS-5 validation guys to test debug experience for GlobalISel in combination with the ARM DS-5 Development Studio debugger. Both of us didn’t find any issues with the debug info produced at -O0 -g with GlobalISel.
I didn’t check compile time, nor do I intend to check that as I don’t have a good infrastructure set up for that.

Overall, my impression is that GlobalISel is ready to be enabled by default for AArch64 at -O0, if others also believe it’s ready.

Thanks,

Kristof

Detailed observations:

  1. Debug info analysis with DIVA:
    I used DIVA options “-a --show-generated --show-level --show-location” to dump and diff debug info differences for the test-suite for “-O0 -g” comparing GlobalISel with non-GlobalISel.
    The only difference I found was on MultiSource/Benchmarks/tramp3d-v4:
    001 37577 {Function} “operator*” → “Zero”
    • No declaration
    • Template
      002 {TemplateParameter} “T” ← “double”
    • 002 37577 {Parameter} ← “Zero”
      002 37577 {Parameter} ← “const double &”
      Which indicates that for non-GlobalISel, the Zero parameter isn’t present in the debug info for the instantiation of the below template function with T=double:
      template
      inline Zero operator*(Zero, const T&) { return Zero(); }
      In other words, the debug info looks ever so slightly better with GlobalISel.
  2. Quick analysis of reasons for slow down for the 10 programs that regress the most when enabling GlobalISel at -O0:
    • SingleSource/Benchmarks/BenchmarkGame/nsieve-bits (114% slowdown): no conversion of divide by power-of-2 to shift right. I think this is an improvement for -O0: no such optimization should be done when not requesting optimizations.
    • MultiSource/Benchmarks/MallocBench/gs/gs (88%): “interp” function: switch statement lowered as cascaded-sequence-of-conditional-branches.
      Same issue causes MultiSource/Applications/sqlite3/sqlite3 (71%).
      Same issue causes MultiSource/Applications/lua/lua (46%).
    • SingleSource/Benchmarks/Misc/flops-2 (75%): Poor lowering of fneg:
      • FastISel:
        ldur d0, [x29,#-16]
        fneg d0, d0
        stur d0, [x29,#-16]
      • GlobalISel:
        ldur d0, [x29,#-64]
        orr x8, xzr, #0x8000000000000000
        fmov d1, x8
        fsub d0, d1, d0
        fmov x8, d0
        stur x8, [x29,#-64]
    • MultiSource/Benchmarks/Prolangs-C++/city/city (74%): a call to memcpy for copying 4 bytes is present with GlobalISel that isn’t present with FastISel, in function vehicle::select_move().
      Same issue causes SingleSource/Benchmarks/Shootout-C++/Shootout-C+±moments (58.5%)
    • MultiSource/Applications/siod/siod (72%): Seems to be mainly due to loading constants in the entry BB, but I’m not sure that indeed is the biggest cause. (https://bugs.llvm.org//show_bug.cgi?id=32561)
    • MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset (48%): Due to creating all constants in the entry BB, see function CopyObject. (https://bugs.llvm.org//show_bug.cgi?id=32561)
    • MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode (46%): Function Reference_IDCT: Probably due to creating all constants in the entry BB + spilling floating point data through an X register:
      • FastISel:
        fadd d0, d1, d0
        str d0, [sp,#528]
      • GlobalISel:
        fadd d0, d1, d0
        fmov x9, d0
        stur x9, [x29,#-48]
  3. Other overall impression when comparing code generation: The code size increase is probably mainly due to creating constants always in the entry BB of the function. For functions with many constants, this leads to lots and lots of constant creation and then immediately spilling it to the stack. (https://bugs.llvm.org//show_bug.cgi?id=32561)

Hi Kristof,

Great Quentin :).

I’ve rerun the benchmarks comparing “-O0 -g” with “-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2” on r302453, on AArch64 Cortex-A57.
I indeed see almost no moves between GPR and FPR registers anymore (see details below for where I still see some).
On geomean, I see 13% slow down (down from 17% on my previous run).
On geomean, code size increase is about 3% (down from 11% on my previous run).
I also checked debug info quality with the DIVA tool on the test-suite; and asked one of our DS-5 validation guys to test debug experience for GlobalISel in combination with the ARM DS-5 Development Studio debugger. Both of us didn’t find any issues with the debug info produced at -O0 -g with GlobalISel.
I didn’t check compile time, nor do I intend to check that as I don’t have a good infrastructure set up for that.

Overall, my impression is that GlobalISel is ready to be enabled by default for AArch64 at -O0, if others also believe it’s ready.

Sounds good to me!

Renato, Diana, how does that sound?

Thanks,

Kristof

Detailed observations:

  1. Debug info analysis with DIVA:
    I used DIVA options “-a --show-generated --show-level --show-location” to dump and diff debug info differences for the test-suite for “-O0 -g” comparing GlobalISel with non-GlobalISel.
    The only difference I found was on MultiSource/Benchmarks/tramp3d-v4:
    001 37577 {Function} “operator*” → “Zero”
    • No declaration
    • Template
      002 {TemplateParameter} “T” ← “double”
    • 002 37577 {Parameter} ← “Zero”
      002 37577 {Parameter} ← “const double &”
      Which indicates that for non-GlobalISel, the Zero parameter isn’t present in the debug info for the instantiation of the below template function with T=double:
      template
      inline Zero operator*(Zero, const T&) { return Zero(); }
      In other words, the debug info looks ever so slightly better with GlobalISel.
  2. Quick analysis of reasons for slow down for the 10 programs that regress the most when enabling GlobalISel at -O0:
    • SingleSource/Benchmarks/BenchmarkGame/nsieve-bits (114% slowdown): no conversion of divide by power-of-2 to shift right. I think this is an improvement for -O0: no such optimization should be done when not requesting optimizations.
    • MultiSource/Benchmarks/MallocBench/gs/gs (88%): “interp” function: switch statement lowered as cascaded-sequence-of-conditional-branches.
      Same issue causes MultiSource/Applications/sqlite3/sqlite3 (71%).
      Same issue causes MultiSource/Applications/lua/lua (46%).
    • SingleSource/Benchmarks/Misc/flops-2 (75%): Poor lowering of fneg:
      • FastISel:
        ldur d0, [x29,#-16]
        fneg d0, d0
        stur d0, [x29,#-16]
      • GlobalISel:
        ldur d0, [x29,#-64]
        orr x8, xzr, #0x8000000000000000
        fmov d1, x8
        fsub d0, d1, d0
        fmov x8, d0
        stur x8, [x29,#-64]
    • MultiSource/Benchmarks/Prolangs-C++/city/city (74%): a call to memcpy for copying 4 bytes is present with GlobalISel that isn’t present with FastISel, in function vehicle::select_move().
      Same issue causes SingleSource/Benchmarks/Shootout-C++/Shootout-C+±moments (58.5%)
    • MultiSource/Applications/siod/siod (72%): Seems to be mainly due to loading constants in the entry BB, but I’m not sure that indeed is the biggest cause. (https://bugs.llvm.org//show_bug.cgi?id=32561)
    • MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset (48%): Due to creating all constants in the entry BB, see function CopyObject. (https://bugs.llvm.org//show_bug.cgi?id=32561)
    • MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode (46%): Function Reference_IDCT: Probably due to creating all constants in the entry BB + spilling floating point data through an X register:
      • FastISel:
        fadd d0, d1, d0
        str d0, [sp,#528]
      • GlobalISel:
        fadd d0, d1, d0
        fmov x9, d0
        stur x9, [x29,#-48]

Good finding, I forgot to do stores in my previous fix. I’ll do them shortly.

  1. Other overall impression when comparing code generation: The code size increase is probably mainly due to creating constants always in the entry BB of the function. For functions with many constants, this leads to lots and lots of constant creation and then immediately spilling it to the stack. (https://bugs.llvm.org//show_bug.cgi?id=32561)

About a month (r299283), I mistakenly committed a prototype I was working on to solve this kind of issue. I reverted it in r299287, but you can give it a try.

Basically, that’s an additional generic pass that runs only at O0. It localizes (as in basic block local) the definition of the constants to workaround the limitations of O0 regalloc.

Cheers,
-Quentin

Hi Quentin,

Hi Kristof,

Great Quentin :).

I’ve rerun the benchmarks comparing “-O0 -g” with “-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2” on r302453, on AArch64 Cortex-A57.
I indeed see almost no moves between GPR and FPR registers anymore (see details below for where I still see some).
On geomean, I see 13% slow down (down from 17% on my previous run).
On geomean, code size increase is about 3% (down from 11% on my previous run).
I also checked debug info quality with the DIVA tool on the test-suite; and asked one of our DS-5 validation guys to test debug experience for GlobalISel in combination with the ARM DS-5 Development Studio debugger. Both of us didn’t find any issues with the debug info produced at -O0 -g with GlobalISel.
I didn’t check compile time, nor do I intend to check that as I don’t have a good infrastructure set up for that.

Overall, my impression is that GlobalISel is ready to be enabled by default for AArch64 at -O0, if others also believe it’s ready.

Sounds good to me!

Renato, Diana, how does that sound?

I’d definitely like to give this a run through on my side as well before you flip the switch. I’ll try to do that this week.

-eric

Hi Eric,

Hi Quentin,

Hi Kristof,

Great Quentin :).

I’ve rerun the benchmarks comparing “-O0 -g” with “-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2” on r302453, on AArch64 Cortex-A57.
I indeed see almost no moves between GPR and FPR registers anymore (see details below for where I still see some).
On geomean, I see 13% slow down (down from 17% on my previous run).
On geomean, code size increase is about 3% (down from 11% on my previous run).
I also checked debug info quality with the DIVA tool on the test-suite; and asked one of our DS-5 validation guys to test debug experience for GlobalISel in combination with the ARM DS-5 Development Studio debugger. Both of us didn’t find any issues with the debug info produced at -O0 -g with GlobalISel.
I didn’t check compile time, nor do I intend to check that as I don’t have a good infrastructure set up for that.

Overall, my impression is that GlobalISel is ready to be enabled by default for AArch64 at -O0, if others also believe it’s ready.

Sounds good to me!

Renato, Diana, how does that sound?

I’d definitely like to give this a run through on my side as well before you flip the switch. I’ll try to do that this week.

Thanks for doing this. Waiting for your inputs then.

Cheers,
-Quentin

Hi Quentin,

I'll defer to Diana, as she was looking into this. If she's happy, I'm happy. :slight_smile:

cheers,
--renato

Hi Kristof,

Great Quentin :).

I’ve rerun the benchmarks comparing “-O0 -g” with “-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2” on r302453, on AArch64 Cortex-A57.
I indeed see almost no moves between GPR and FPR registers anymore (see details below for where I still see some).
On geomean, I see 13% slow down (down from 17% on my previous run).
On geomean, code size increase is about 3% (down from 11% on my previous run).
I also checked debug info quality with the DIVA tool on the test-suite; and asked one of our DS-5 validation guys to test debug experience for GlobalISel in combination with the ARM DS-5 Development Studio debugger. Both of us didn’t find any issues with the debug info produced at -O0 -g with GlobalISel.
I didn’t check compile time, nor do I intend to check that as I don’t have a good infrastructure set up for that.

Overall, my impression is that GlobalISel is ready to be enabled by default for AArch64 at -O0, if others also believe it’s ready.

Sounds good to me!

Renato, Diana, how does that sound?

Thanks,

Kristof

Detailed observations:

  1. Debug info analysis with DIVA:
    I used DIVA options “-a --show-generated --show-level --show-location” to dump and diff debug info differences for the test-suite for “-O0 -g” comparing GlobalISel with non-GlobalISel.
    The only difference I found was on MultiSource/Benchmarks/tramp3d-v4:
    001 37577 {Function} “operator*” → “Zero”
    • No declaration
    • Template
      002 {TemplateParameter} “T” ← “double”
    • 002 37577 {Parameter} ← “Zero”
      002 37577 {Parameter} ← “const double &”
      Which indicates that for non-GlobalISel, the Zero parameter isn’t present in the debug info for the instantiation of the below template function with T=double:
      template
      inline Zero operator*(Zero, const T&) { return Zero(); }
      In other words, the debug info looks ever so slightly better with GlobalISel.
  2. Quick analysis of reasons for slow down for the 10 programs that regress the most when enabling GlobalISel at -O0:
    • SingleSource/Benchmarks/BenchmarkGame/nsieve-bits (114% slowdown): no conversion of divide by power-of-2 to shift right. I think this is an improvement for -O0: no such optimization should be done when not requesting optimizations.
    • MultiSource/Benchmarks/MallocBench/gs/gs (88%): “interp” function: switch statement lowered as cascaded-sequence-of-conditional-branches.
      Same issue causes MultiSource/Applications/sqlite3/sqlite3 (71%).
      Same issue causes MultiSource/Applications/lua/lua (46%).
    • SingleSource/Benchmarks/Misc/flops-2 (75%): Poor lowering of fneg:
      • FastISel:
        ldur d0, [x29,#-16]
        fneg d0, d0
        stur d0, [x29,#-16]
      • GlobalISel:
        ldur d0, [x29,#-64]
        orr x8, xzr, #0x8000000000000000
        fmov d1, x8
        fsub d0, d1, d0
        fmov x8, d0
        stur x8, [x29,#-64]
    • MultiSource/Benchmarks/Prolangs-C++/city/city (74%): a call to memcpy for copying 4 bytes is present with GlobalISel that isn’t present with FastISel, in function vehicle::select_move().
      Same issue causes SingleSource/Benchmarks/Shootout-C++/Shootout-C+±moments (58.5%)
    • MultiSource/Applications/siod/siod (72%): Seems to be mainly due to loading constants in the entry BB, but I’m not sure that indeed is the biggest cause. (https://bugs.llvm.org//show_bug.cgi?id=32561)
    • MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset (48%): Due to creating all constants in the entry BB, see function CopyObject. (https://bugs.llvm.org//show_bug.cgi?id=32561)
    • MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode (46%): Function Reference_IDCT: Probably due to creating all constants in the entry BB + spilling floating point data through an X register:
      • FastISel:
        fadd d0, d1, d0
        str d0, [sp,#528]
      • GlobalISel:
        fadd d0, d1, d0
        fmov x9, d0
        stur x9, [x29,#-48]

Good finding, I forgot to do stores in my previous fix. I’ll do them shortly.

Should be fixed by r302679

Thanks Quentin,

That reduces the slow-down when enabling globalisel at -O0 from 13% (on r302453) to 9.5% (on r302679) in my experiments.
The code size increase also reduces from just over 3% to 2.8%.

Kristof

Hi all,

I'm still running some validation on this, I'll send an email when
it's done. If that goes well I don't have anything against making the
switch.

For the record, here's a summary of issues that were deferred for
later on (some of which are optimization-ish and we might decide to
never do at -O0 at all):
* Crash in RegBankSelect for half fp types:
https://bugs.llvm.org/show_bug.cgi?id=32560
* Improving constant placement: http://bugs.llvm.org/show_bug.cgi?id=32561
* Fancy switch lowering
* Transforming division-by-constant-power-of-2 into right shift

AFAICT all the other issues that were brought up were fixed (yay!).

Cheers,
Diana

Hi Diana,

Thanks for the summary.

Hi all,

I’m still running some validation on this, I’ll send an email when
it’s done. If that goes well I don’t have anything against making the
switch.

For the record, here’s a summary of issues that were deferred for
later on (some of which are optimization-ish and we might decide to
never do at -O0 at all):

I’ll have a look.

I’ve commented in the PR to mention the localizer technic I was playing with, if someone wants to give it a try.

  • Fancy switch lowering
  • Transforming division-by-constant-power-of-2 into right shift

AFAICT all the other issues that were brought up were fixed (yay!).

Cheers,
Diana

Cheers,
-Quentin

Hi,

I ran into a little snag on the test-suite:
https://bugs.llvm.org/show_bug.cgi?id=33021
It boils down to GlobalISel generating calls to fabs instead of using
FABSDr (so we get undefined references).

Cheers,
Diana

It turns out that can be fixed by adding -lm to the link line, so I
will probably convert it into a test-suite bug.

I don't suppose it's crucial to handle the fabs intrinsic nicely at -O0.

Agreed. That was probably just luck before :slight_smile:

-eric

Got another one: https://bugs.llvm.org/show_bug.cgi?id=33036

I'm still investigating whether this is an actual GlobalISel issue or
something else (I'll also start a build on a more recent revision to
see how that behaves).

Turns out it really is a GlobalISel issue - we eat up too much stack
space because all the constants used in the switches are stored on the
stack. We need to fix this somehow before changing the default. I'll
try to give it a run with Quentin's r299283 on top to see if it helps.

Cheers,
Diana

FWIW, I ran Quentin’s r299283 Localizer patch (see also http://bugs.llvm.org/show_bug.cgi?id=32561) on my benchmark set.
This is taking the commit in r299283 + adding the pass to the pipeline right after RegBankSelect.

It reduces the slow-down when enabling globalisel at -O0 from 9.5% (on r302453) to 6.3% (on r302679) in my experiments.
The code size increase also reduces from just over 2.8% to 1.8%.

I haven’t measured the impact on compile-time.

I think those are nice improvements; but I wouldn’t hold up enabling-by-default for those improvements.
However, the increase in stack size usage being so huge that a bootstrap fails seems like something that should be addressed before enabling-by-default.
I think Diana found that when enabling r299283, the bootstrap failed with llvm-tblgen segfaulting.
So there clearly is some work required there.

Thanks,

Kristof

I think Diana found that when enabling r299283, the bootstrap failed with
llvm-tblgen segfaulting.
So there clearly is some work required there.

Indeed.

@Quentin, what is the status of that patch? Have you been working on
it since then? Should I investigate the segfault more and send you a
reproducer? Is this patch the way forward, or do you have other ideas
for reducing the stack usage?

Thanks,
Diana

Hi Diana,

I think Diana found that when enabling r299283, the bootstrap failed with
llvm-tblgen segfaulting.
So there clearly is some work required there.

Indeed.

@Quentin, what is the status of that patch?

Initially it was meant as a prototype to investigate how we could address those issues at O0. I didn’t mean to publish it.

Have you been working on
it since then?

No, I haven’t touched it and I honestly didn’t plan to do that.

Should I investigate the segfault more and send you a
reproducer?

Depends, do you guys want to pursue in that direction?
My though was that it is probably best to rely on a better reg allocator (basic or greedy) scheme for O0. I believe the only concern may be compile time but it may just fly (+ I have ideas how to make it better pretty easy).

Is this patch the way forward, or do you have other ideas
for reducing the stack usage?

Better reg alloc scheme for O0 :).

Hi Diana,

I think Diana found that when enabling r299283, the bootstrap failed with
llvm-tblgen segfaulting.
So there clearly is some work required there.

Indeed.

@Quentin, what is the status of that patch?

Initially it was meant as a prototype to investigate how we could address those issues at O0. I didn’t mean to publish it.

Have you been working on
it since then?

No, I haven’t touched it and I honestly didn’t plan to do that.

Should I investigate the segfault more and send you a
reproducer?

Depends, do you guys want to pursue in that direction?

Not necessarily, if you think a better reg alloc scheme will do the
trick then I see no point in complicating the pass pipeline for now.

My though was that it is probably best to rely on a better reg allocator (basic or greedy) scheme for O0. I believe the only concern may be compile time but it may just fly (+ I have ideas how to make it better pretty easy).

Is this patch the way forward, or do you have other ideas
for reducing the stack usage?

Better reg alloc scheme for O0 :).

Ok, let's go with that then :slight_smile:

Hi Diana,

Hi Diana,

I think Diana found that when enabling r299283, the bootstrap failed with
llvm-tblgen segfaulting.
So there clearly is some work required there.

Indeed.

@Quentin, what is the status of that patch?

Initially it was meant as a prototype to investigate how we could address those issues at O0. I didn’t mean to publish it.

Have you been working on
it since then?

No, I haven’t touched it and I honestly didn’t plan to do that.

Should I investigate the segfault more and send you a
reproducer?

Depends, do you guys want to pursue in that direction?

Not necessarily, if you think a better reg alloc scheme will do the
trick then I see no point in complicating the pass pipeline for now.

My though was that it is probably best to rely on a better reg allocator (basic or greedy) scheme for O0. I believe the only concern may be compile time but it may just fly (+ I have ideas how to make it better pretty easy).

Is this patch the way forward, or do you have other ideas
for reducing the stack usage?

Better reg alloc scheme for O0 :).

Ok, let’s go with that then :slight_smile:

Sounds good, let me push a patch that would allow to use the greedy allocator at O0 and see if that’s sufficient. Then, we will look at the compile time that change will imply.

Cheers,
-Quentin