[RFC] AArch64: Should we disable GlobalMerge?

Hi all,

I've started looking at the GlobalMerge pass, enabled by default on
ARM and AArch64. I think we should reconsider that, at least for
AArch64.

As is, the pass just merges all globals together, in groups of 4KB
(AArch64, 128B on ARM).

At the time it was enabled, the general thinking was "it's almost
free, it doesn't affect performance much, we might as well use it".
Now, it's preventing some link-time optimizations (as acknowledged in
one of the FIXMEs).

-- Performance impact
Overall, it isn't that profitable on the test-suite, and actually
degrades performance on a lot of other - "non-benchmark" - projects I
tried (where the main reason to use a global is file- or function-
static variables, only accessed through a single getter function).

Across several runs on the entire test-suite, when disabling the pass,
I measured:
without LTO, a -0.19% geomean improvement
with LTO, a +0.11% geomean regression.

As for just SPEC2006, there are two big regressions: 400.perlbench
(10.6% w/ LTO, 2.7% w/o) and 471.omnetpp (2.3% w/, 3.9% w/o).

Numbers are attached.

-- A way forward
One obvious way to improve it is: look at uses of globals, and try to
form sets of globals commonly used together. The tricky part is to
define heuristics for "commonly". Also, the pass then becomes much
more expensive. I'm currently looking into improving it, and will
report if I come up with a good solution. But this shouldn't stop us
from disabling it, for now.

Also, the pass seems like a good candidate for
-O3/CodeGenOpt::Aggressive. However, the latter is implied by LTO,
which IMO shouldn't include these not-always-profitable optimizations.
That's another problem though.

Right now, I think we should disable the pass by default, until it's
deemed profitable enough.

-Ahmed

With the numbers!
-Ahmed

disable_globalmerge_aarch64_LTO.txt (24.9 KB)

disable_globalmerge_aarch64.txt (27.8 KB)

Hi Ahmed,

Did you run these experiments on a platform with a linker that makes
use of the AArch64CollectLOH-pass-produced information?
I'm guessing that the AArch64CollectLOH-pass information and a linker
that makes use of that information could affect the profitability of
the GlobalMerge pass?

Thanks,

Kristof

Hi Ahmed,

Before "moving forward", it would be good to understand what in
GlobalMerge is impacting what in LTO.

With LTO becoming more important nowadays, I agree we have to balance
the compiler optimisations to work well with it, but by turning things
off we might be impacting unknown code in an unknown way.

We'll never know how unknown code behaves, but if at least we
understand what of GM affects what of LTO, then people using unknown
code will have a more informed view on what to disable, when.

cheers,
--renato

Hi Ahmed,

Yes. I’d share with Kristof and Renato’s concerns, and the impact/dependence upon link-time tool should be clarified before disabling this pass.

On the other hand, actually the test on our hardware shows disabling this pass without LTO considered, some spec benchmarks would have big regressions, (positive is bad)

spec.cpu2000.ref.253_perlbmk 3.27%
spec.cpu2000.ref.254_gap 3.18%

although I do see some improvements like below, (negative is good)

spec.cpu2006.ref.400_perlbench -1.90%
spec.cpu2006.ref.471_omnetpp -1.64%
spec.cpu2006.ref.482_sphinx3 -1.03%

Thanks,
-Jiangning

Hi Kristof,

Our tests are on iOS, which definitely uses the LOH optimizations for ARM64.

-Jim

Hi Ahmed,

Did you run these experiments on a platform with a linker that makes
use of the AArch64CollectLOH-pass-produced information?

As Jim says, I'm on iOS, so yes. However, I'm mostly running tests
with the pass disabled.

I'm guessing that the AArch64CollectLOH-pass information and a linker
that makes use of that information could affect the profitability of
the GlobalMerge pass?

It could, and does, from what I've seen (beware anecdata):
- reusing the adrp base prevents optimizing it (the various
Adrp*{ldr,str} LOHs).
- reusing the adrp+add MergedGlobal pointer, with indexed addressing,
doesn't prevent the AdrpAdd optimization.

All in all, whether GlobalMerge is profitable or not (by increasing
register pressure, or adding another indirection), whenever the LOH
optimizations fire, they reduce its usefulness.

AFAICT, the only case where LOHs help GlobalMerge is when the
MergedGlobal base is closer to the adrp sequence than the actual
global. Given that we only merge 4k of globals, on a 1MB range this
doesn't happen very often.

Which brings us to my fallback proposal: what about disabling the
pass on darwin only? Various darwin-enabled features (e.g., LOHs)
help mitigate the adrp problem, and global usage is usually frowned
upon in those circles (except for singletons, class-/function-statics
and whatnot, which I'm trying to address in an upcoming patch).

As for other targets, as a first step, making the pass run under -O3
rather than -O1 is hopefully agreeable to everyone? After all, it is
"aggressive", and isn't always profitable. That's pretty much the
description of -O3.
We can still run into problematic cases under LTO, though.

-Ahmed

Hi Ahmed,

Did you run these experiments on a platform with a linker that makes
use of the AArch64CollectLOH-pass-produced information?

As Jim says, I’m on iOS, so yes. However, I’m mostly running tests
with the pass disabled.

I’m guessing that the AArch64CollectLOH-pass information and a linker
that makes use of that information could affect the profitability of
the GlobalMerge pass?

It could, and does, from what I’ve seen (beware anecdata):

  • reusing the adrp base prevents optimizing it (the various
    Adrp*{ldr,str} LOHs).
  • reusing the adrp+add MergedGlobal pointer, with indexed addressing,
    doesn’t prevent the AdrpAdd optimization.

All in all, whether GlobalMerge is profitable or not (by increasing
register pressure, or adding another indirection), whenever the LOH
optimizations fire, they reduce its usefulness.

AFAICT, the only case where LOHs help GlobalMerge is when the
MergedGlobal base is closer to the adrp sequence than the actual
global. Given that we only merge 4k of globals, on a 1MB range this
doesn’t happen very often.

Which brings us to my fallback proposal: what about disabling the
pass on darwin only? Various darwin-enabled features (e.g., LOHs)
help mitigate the adrp problem, and global usage is usually frowned
upon in those circles (except for singletons, class-/function-statics
and whatnot, which I’m trying to address in an upcoming patch).

Before making the disabling darwin only I’d like to see some analysis of the regressions/improvements. Has anyone looked at the code for those yet?

As for other targets, as a first step, making the pass run under -O3
rather than -O1 is hopefully agreeable to everyone? After all, it is
“aggressive”, and isn’t always profitable. That’s pretty much the
description of -O3.
We can still run into problematic cases under LTO, though.

Seems reasonable to me, but probably want to see what happens with the above questions first.

-eric

Which brings us to my fallback proposal: what about disabling the
pass on darwin only?

That's a decision for Jim/Evan. I'm ok if they are.

As for other targets, as a first step, making the pass run under -O3
rather than -O1 is hopefully agreeable to everyone?

Sounds reasonable.

Even though it conflicts with LTO, that's what O3 means, as you said,
instability. People at O3 might want to fiddle with the passes
(on/off) to get the best performance for their own code/workload.

cheers,
--renato

-- A way forward
One obvious way to improve it is: look at uses of globals, and try to
form sets of globals commonly used together. The tricky part is to
define heuristics for "commonly". Also, the pass then becomes much
more expensive. I'm currently looking into improving it, and will
report if I come up with a good solution. But this shouldn't stop us
from disabling it, for now.

Hi Ahmed,

Before "moving forward", it would be good to understand what in
GlobalMerge is impacting what in LTO.

With LTO becoming more important nowadays, I agree we have to balance
the compiler optimisations to work well with it, but by turning things
off we might be impacting unknown code in an unknown way.

We'll never know how unknown code behaves, but if at least we
understand what of GM affects what of LTO, then people using unknown
code will have a more informed view on what to disable, when.

Fair enough. First, a couple things to note:
- GlobalMerge runs as a pre-ISel pass, so very late in the mid-level pipeline.
- GlobalMerge (by default) only looks at internal globals.

Internal globals come up with file- or function- static variables. In
LTO, all module-level globals are internalized, and are eligible for
merging.

So, we can generally group global usage into a few categories:
- a function that uses a local static variable (say, llvm::outs())
- a function that uses several globals at once. For instance,
400.perlbench's interpreter has a bunch of those, as does its
parser/lexer.
- a set of functions that share a few common globals (say, an inlined
reference to a function-local static variable), but otherwise each use
several other globals (again, perl's interpreter).

GlobalMerge is only ever a win if we are able to share base pointers.
This requires:
- several globals being referenced
- the references being close enough (otherwise we'll just
rematerialize the base, or worse, increase register pressure)

There is one obvious special case for the first requirement: if a
global is only ever used alone, there's no point in merging it
anywhere. (this is improvement #1).
Once we can determine the set of used globals for each function, we
can try to merge those sets only. (#2)

We can try to better handle the second requirement, by having some
more precise metric for distance between uses. One trivially
available such metric is grouping used sets by parent basic-block
rather than function (#3).

Experimentally, #1 catches a lot of the singleton-ish globals out
there, which is the majority in some of the more "modern" code I've
looked at. It leaves the legitimate merging in perl alone.

#2 (and even moreso #3) is actually too aggressive, and doesn't catch
a lot/most of the profitable cases in perl. Consider:
- a "g_log" global (or, say, LLVM's outs/dbgs/errs), used pretty much everywhere
- several sets of globals, used in different parts of the program
(perl's interpreter vs parser)

You'd pick one of the latter sets, and add the "g_log" global to it.
Now you made it more expensive everywhere you use "g_log", without the
benefit of base sharing in all the other functions.

So you need to be smart when picking the sets. You can combine some
of them, using some cost metric. (#4) This is where it gets
complicated.

I'll try measuring some of those, see what happens on benchmarks.
Again, that shouldn't stop us from enabling GlobalMerge less often.
Hopefully it's clear that the pass isn't always a win, so -O3 should
be OK. I'm less comfortable with disabling it on Darwin only, but
that seems like the obvious next step.

Thanks for the feedback!

-Ahmed

Hi Ahmed,

Yes. I'd share with Kristof and Renato's concerns, and the impact/dependence
upon link-time tool should be clarified before disabling this pass.

On the other hand, actually the test on our hardware shows disabling this
pass without LTO considered, some spec benchmarks would have big
regressions, (positive is bad)

spec.cpu2000.ref.253_perlbmk 3.27%
spec.cpu2000.ref.254_gap 3.18%

although I do see some improvements like below, (negative is good)

spec.cpu2006.ref.400_perlbench -1.90%
spec.cpu2006.ref.471_omnetpp -1.64%
spec.cpu2006.ref.482_sphinx3 -1.03%

Interesting! Can you share geomean SPEC2006/2000 numbers, perhaps?

-Ahmed

>
> Hi Ahmed,
>
> Did you run these experiments on a platform with a linker that makes
> use of the AArch64CollectLOH-pass-produced information?

As Jim says, I'm on iOS, so yes. However, I'm mostly running tests
with the pass disabled.

>
> I'm guessing that the AArch64CollectLOH-pass information and a linker
> that makes use of that information could affect the profitability of
> the GlobalMerge pass?

It could, and does, from what I've seen (beware anecdata):
- reusing the adrp base prevents optimizing it (the various
Adrp*{ldr,str} LOHs).
- reusing the adrp+add MergedGlobal pointer, with indexed addressing,
doesn't prevent the AdrpAdd optimization.

All in all, whether GlobalMerge is profitable or not (by increasing
register pressure, or adding another indirection), whenever the LOH
optimizations fire, they reduce its usefulness.

AFAICT, the only case where LOHs help GlobalMerge is when the
MergedGlobal base is closer to the adrp sequence than the actual
global. Given that we only merge 4k of globals, on a 1MB range this
doesn't happen very often.

Which brings us to my fallback proposal: what about disabling the
pass on darwin only? Various darwin-enabled features (e.g., LOHs)
help mitigate the adrp problem, and global usage is usually frowned
upon in those circles (except for singletons, class-/function-statics
and whatnot, which I'm trying to address in an upcoming patch).

Before making the disabling darwin only I'd like to see some analysis of the
regressions/improvements. Has anyone looked at the code for those yet?

Yep, I put a quick analysis in my other reply.

As for other targets, as a first step, making the pass run under -O3
rather than -O1 is hopefully agreeable to everyone? After all, it is
"aggressive", and isn't always profitable. That's pretty much the
description of -O3.
We can still run into problematic cases under LTO, though.

Seems reasonable to me, but probably want to see what happens with the above
questions first.

Fair enough. Bottom line is:
- disabling it without LTO is a slight win on the test-suite, a solid
win everywhere else I've looked.
- disabling it with LTO regresses quite a few SPEC benchmarks, and is
overall a slight regression on the test-suite.

-Ahmed

Which brings us to my fallback proposal: what about disabling the
pass on darwin only?

That's a decision for Jim/Evan. I'm ok if they are.

Jim, thoughts?

As for other targets, as a first step, making the pass run under -O3
rather than -O1 is hopefully agreeable to everyone?

Sounds reasonable.

Great!

Even though it conflicts with LTO, that's what O3 means, as you said,
instability. People at O3 might want to fiddle with the passes
(on/off) to get the best performance for their own code/workload.

By the way, I'm not convinced LTO being either -O3 or -O0 is sensible.
But that's a discussion for another day =)

-Ahmed

Hi Ahmed,

Did you run these experiments on a platform with a linker that makes
use of the AArch64CollectLOH-pass-produced information?

As Jim says, I’m on iOS, so yes. However, I’m mostly running tests
with the pass disabled.

I’m guessing that the AArch64CollectLOH-pass information and a linker
that makes use of that information could affect the profitability of
the GlobalMerge pass?

It could, and does, from what I’ve seen (beware anecdata):

  • reusing the adrp base prevents optimizing it (the various
    Adrp*{ldr,str} LOHs).
  • reusing the adrp+add MergedGlobal pointer, with indexed addressing,
    doesn’t prevent the AdrpAdd optimization.

All in all, whether GlobalMerge is profitable or not (by increasing
register pressure, or adding another indirection), whenever the LOH
optimizations fire, they reduce its usefulness.

AFAICT, the only case where LOHs help GlobalMerge is when the
MergedGlobal base is closer to the adrp sequence than the actual
global. Given that we only merge 4k of globals, on a 1MB range this
doesn’t happen very often.

Which brings us to my fallback proposal: what about disabling the
pass on darwin only? Various darwin-enabled features (e.g., LOHs)
help mitigate the adrp problem, and global usage is usually frowned
upon in those circles (except for singletons, class-/function-statics
and whatnot, which I’m trying to address in an upcoming patch).

Before making the disabling darwin only I’d like to see some analysis of the
regressions/improvements. Has anyone looked at the code for those yet?

Yep, I put a quick analysis in my other reply.

The LOH/ADRP bit?

As for other targets, as a first step, making the pass run under -O3
rather than -O1 is hopefully agreeable to everyone? After all, it is
“aggressive”, and isn’t always profitable. That’s pretty much the
description of -O3.
We can still run into problematic cases under LTO, though.

Seems reasonable to me, but probably want to see what happens with the above
questions first.

Fair enough. Bottom line is:

  • disabling it without LTO is a slight win on the test-suite, a solid
    win everywhere else I’ve looked.
  • disabling it with LTO regresses quite a few SPEC benchmarks, and is
    overall a slight regression on the test-suite.

Ah, I meant an analysis of the code, not just the numbers. I think the ADRP/LOH commentary really helps. It might only be a decent LTOish optimization, but I’m still curious how it’s helping there over other optimizations.

Anyhow, FWIW I’m in favor of pulling it out of the non-LTO pipeline universally.

-eric

Duncan tells me there is a plan to put -mno-global-merge into module
flags for this precise reason, so this would disable it for LTO as
well, when -O3 wasn't specified. This takes care of our non-O3
concerns; I'll have a look!

-Ahmed

Which brings us to my fallback proposal: what about disabling the
pass on darwin only?

That's a decision for Jim/Evan. I'm ok if they are.

Jim, thoughts?

I would prefer Darwin not differ in this regard, but I don’t feel incredibly strongly about it. Just a general preference to keeping platform dependencies and differences to a minimum. Whatever y’all decide is fine with me.

-- A way forward
One obvious way to improve it is: look at uses of globals, and try to
form sets of globals commonly used together. The tricky part is to
define heuristics for "commonly". Also, the pass then becomes much
more expensive. I'm currently looking into improving it, and will
report if I come up with a good solution. But this shouldn't stop us
from disabling it, for now.

Hi Ahmed,

Before "moving forward", it would be good to understand what in
GlobalMerge is impacting what in LTO.

With LTO becoming more important nowadays, I agree we have to balance
the compiler optimisations to work well with it, but by turning things
off we might be impacting unknown code in an unknown way.

We'll never know how unknown code behaves, but if at least we
understand what of GM affects what of LTO, then people using unknown
code will have a more informed view on what to disable, when.

Fair enough. First, a couple things to note:
- GlobalMerge runs as a pre-ISel pass, so very late in the mid-level pipeline.

To be precise, GlobalMerge is registered as a pre-ISel pass, but still it runs very early in the pipeline, because all its work in done during doInitialization… Pretty broken, I know.

-Quentin

I would prefer Darwin not differ in this regard, but I don’t feel incredibly strongly about it. Just a general preference to keeping platform dependencies and differences to a minimum.

Same.

Whatever y’all decide is fine with me.

We might not need this after all =) With a module-level flag, we
could only enable it under -O3 even for LTO, which is fine no matter
the platform.

-Ahmed

To be precise, GlobalMerge is registered as a pre-ISel pass, but still it runs very early in the pipeline, because all its work in done during doInitialization… Pretty broken, I know.

Oh god, I forgot about this... it actually runs pretty early, not
sure when exactly..

-Ahmed

> Before making the disabling darwin only I'd like to see some analysis of
> the
> regressions/improvements. Has anyone looked at the code for those yet?

Yep, I put a quick analysis in my other reply.

The LOH/ADRP bit?

>
>>
>> As for other targets, as a first step, making the pass run under -O3
>> rather than -O1 is hopefully agreeable to everyone? After all, it is
>> "aggressive", and isn't always profitable. That's pretty much the
>> description of -O3.
>> We can still run into problematic cases under LTO, though.
>>
>
> Seems reasonable to me, but probably want to see what happens with the
> above
> questions first.

Fair enough. Bottom line is:
- disabling it without LTO is a slight win on the test-suite, a solid
win everywhere else I've looked.
- disabling it with LTO regresses quite a few SPEC benchmarks, and is
overall a slight regression on the test-suite.

Ah, I meant an analysis of the code, not just the numbers. I think the
ADRP/LOH commentary really helps. It might only be a decent LTOish
optimization, but I'm still curious how it's helping there over other
optimizations.

Basically - and I think this is what Renato asks as well - it doesn't
really interact with later optimizations. Throughout most of the
backend, we keep global references (e.g., adrp+add) together, as a
pseudo instruction (MOVaddr, LOADgot, ...). Very late we expand it to
adrp+add/.... So, the only thing that helps is the LOH linker
optimizations, which try to simplify some of the adrp sequences.
Really, the backend is oblivious to the fact that global references
aren't trivial. We don't try to CSE the adrp's, for instance (I
believe there was a patch for that, Quentin and Jiangning might know
more). Does that clarify a bit?

Looking at the code, you have two main problematic situations:
- the register pressure tradeoff:

Consider:

adrp x8, 133
ldr x8, [x8, #3568]
...
adrp x8, 133
ldr x0, [x8, #3576]

Turning into:

adrp x19, 133
add x19, x19, #3392
ldr x8, [x19, #192]
...
ldr x0, [x19, #200]

- an additional instruction when only one global from a merged set is
accessed (or when the LOH optimizations fired)

Consider the similar:

adrp x20, 133
ldr x8, [x20, #3432]
...
str x0, [x20, #3432]

Turning into:

adrp x20, 133
add x20, x20, #3392
ldr x8, [x20, #56]
...
str x0, [x20, #56]

One positive case is explained in the GlobalMerge.cpp comments: it
reduces register pressure in a loop, by using a single base register
for multiple globals.

Another positive is that merging globals effectively CSEs the base
address computation.

Anyhow, FWIW I'm in favor of pulling it out of the non-LTO pipeline
universally.

I tend to agree, but it's still sometimes useful in non-LTO. One case
that came up in benchmarks was a bunch of file-static globals used
pervasively in a single file (I believe lex/yacc can generate this
kind of thing). There it's very beneficial, even without LTO. Hence,
-O3 and -mno-global-merge, if necessary.

-Ahmed