Removing AvailableExternal values in GlobalDCE (was Re: RFC: ThinLTO Impementation Plan)

I found allowing GlobalDCE to remove referenced
AvailableExternallyLinkage values earlier passes all the clang/llvm
tests (see patch below). AFAICT, it shouldn't be necessary to keep
these functions around past the first call to GlobalDCE (which is
after inlining), but let me know if I missed something. Although they
will be removed eventually by MachineFunctionPass as David pointed
out, so this is more of a compile-time improvement than anything. It
will affect ThinLTO more significantly in compile time because there
will be more functions with this linkage type, depending on how many
static functions are imported that we have to promote.

Also, what about available_externally variable
definitions/initializations? I'm not sure what source construct
currently can generate those, but are these suppressed/turned into
declarations at some point? Note that ThinLTO imported file static
variables always need to be promoted, and will be marked
available_external.

Index: lib/Transforms/IPO/GlobalDCE.cpp

>>
>>
>>>
>>> >
>>> >
>>> >>
>>> >> Looking back through my GlobalDCE changes, it looks like one of the
>>> >> places I had changed (where we mark defined globals in runOnModule)
>>> >> already has a guard for !hasAvailableExternallyLinkage and
>>> >> !isDiscardableIfUnused, so my additional guard against marking
>>> >> imported functions is unnecessary. But the other place I had to
change
>>> >> was in GlobalIsNeeded where it walks through the function and
>>> >> recursively marks any referenced global as needed. Here there was no
>>> >> guard against marking a global that is available externally as
needed
>>> >> if it is referenced. I had added a check here to not mark imported
>>> >> functions as needed on reference unless they were discardable (i.e.
>>> >> link once). Is this a bug - should this have a guard against marking
>>> >> available externally function refs as needed?
>>> >
>>> >
>>> > Duncan's probably got a better idea of what the right answer is
here. I
>>> > suspect "yes".
>>> >
>>> > The trick with available_externally is to ensure we keep these around
>>> > for
>>> > long enough that their definitions are useful (for inlining, constant
>>> > prop,
>>> > all that good stuff) but remove them before we actually do too much
work
>>> > on
>>> > them, or at least before we emit them into the final object file.
>>>
>>> Yep, and that is exactly how long I want my imported functions to
>>> stick around. Currently I have the ThinLTO import pass added at the
>>> start of addLTOOptimizationPasses, just after the AA passes. So the
>>> imported functions stick around through global opt, constant merge,
>>> combining, and inlining. The call to GlobalDCE is just after inlining
>>> and a second globalopt.
>>>
>>> >
>>> > I imagine if GlobalDCE isn't removing available_externally functions
>>> > it's
>>> > because they're still useful in the optimization pipeline and
something
>>> > further down the pipe removes them (because they do, ultimately, get
>>> > removed).
>>>
>>> That would be good to know, since if they are useful later on in the
>>> pipe then presumably my imported functions could be as well. But I
>>> wonder who is removing them since GlobalDCE isn't (if there are refs)?
>>
>>
>> Yep - seems no one removes them (well, probably if there are no calls
to the
>> function at all they get removed as usual (eg: if all the calls are
inlined
>> we probably drop the function as we would for a linkonce_odr
function)). If
>> they're still around come codegen time, we just skip them:
>>
>> lib/CodeGen/MachineFunctionPass.cpp, MachineFunctionPass::runOnFunction:
>>
>> 33| bool MachineFunctionPass::runOnFunction(Function &F) {
>> 34| // Do not codegen any 'available_externally' functions at all,
they
>> have
>> 35| // definitions outside the translation unit.
>> 36| if (F.hasAvailableExternallyLinkage())
>> 37| return false;
>> 38|
>> 39| MachineFunction &MF =
getAnalysis<MachineFunctionAnalysis>().getMF();
>> 40| return runOnMachineFunction(MF);
>> 41| }
>
> Ok, thanks for digging this up. I guess that would work for the
> imported functions as well, but it seems like a waste of compile time
> if they are not useful after inlining/global opt. Is there any reason
> to keep them past GlobalDCE or should I try changing GlobalDCE so that
> it removes them?
>

I found allowing GlobalDCE to remove referenced
AvailableExternallyLinkage values earlier passes all the clang/llvm
tests (see patch below).

The LLVM regression tests might not catch the sort of regression this
change could cause. Usually we test each pass in isolation and tend towards
white box testing (so we test the cases that are "interesting" according to
the algorithm) - in this case, you're adding a condition (adding an
"interesting" case that wasn't interesting before - so didn't need to be
tested explicitly) that changes the behavior of GlobalDCE. This behavior
may've been depended upon by further down-stream optimizations for overall
performance.

AFAICT, it shouldn't be necessary to keep
these functions around past the first call to GlobalDCE (which is
after inlining), but let me know if I missed something.

It's possible they're still useful for optimizations other than inlining.
We can still do constant propagation through such functions (hey, look,
this function always returns 3, etc).

It might be necessary/worth actually running perf numbers on such a change
(or manually (possibly by asking those who know - not me, unfortunately)
checking the passes that come after GlobalDCE to see if any would do cross
function constant prop, etc. Perhaps someone else knows this off-the-cuff.

Although they
will be removed eventually by MachineFunctionPass as David pointed
out,

Interestingly they're not exactly /removed/, they're just not emitted. I
don't know whether that's important.

so this is more of a compile-time improvement than anything. It
will affect ThinLTO more significantly in compile time because there
will be more functions with this linkage type, depending on how many
static functions

Hmm - why static functions in particular? I imagine this attribute would be
used for non-static functions too in ThinLTO.

are imported that we have to promote.

Also, what about available_externally variable
definitions/initializations? I'm not sure what source construct
currently can generate those,

I believe we emit C++ vtables as available_externally when the type has a
key function. This allows devirtualization opportunities without bloating
object file size with duplicate vtable data.

but are these suppressed/turned into
declarations at some point?

Maybe? Again, I would guess (and you can check some of this with
-print-after-all to see how the IR changes as it goes through the
optimizations) they might just be omitted from the generated code, not
necessarily removed from the IR at any point.

>>
>>
>>>
>>> >
>>> >
>>> >>
>>> >> Looking back through my GlobalDCE changes, it looks like one of the
>>> >> places I had changed (where we mark defined globals in runOnModule)
>>> >> already has a guard for !hasAvailableExternallyLinkage and
>>> >> !isDiscardableIfUnused, so my additional guard against marking
>>> >> imported functions is unnecessary. But the other place I had to
>>> >> change
>>> >> was in GlobalIsNeeded where it walks through the function and
>>> >> recursively marks any referenced global as needed. Here there was
>>> >> no
>>> >> guard against marking a global that is available externally as
>>> >> needed
>>> >> if it is referenced. I had added a check here to not mark imported
>>> >> functions as needed on reference unless they were discardable (i.e.
>>> >> link once). Is this a bug - should this have a guard against
>>> >> marking
>>> >> available externally function refs as needed?
>>> >
>>> >
>>> > Duncan's probably got a better idea of what the right answer is
>>> > here. I
>>> > suspect "yes".
>>> >
>>> > The trick with available_externally is to ensure we keep these
>>> > around
>>> > for
>>> > long enough that their definitions are useful (for inlining,
>>> > constant
>>> > prop,
>>> > all that good stuff) but remove them before we actually do too much
>>> > work
>>> > on
>>> > them, or at least before we emit them into the final object file.
>>>
>>> Yep, and that is exactly how long I want my imported functions to
>>> stick around. Currently I have the ThinLTO import pass added at the
>>> start of addLTOOptimizationPasses, just after the AA passes. So the
>>> imported functions stick around through global opt, constant merge,
>>> combining, and inlining. The call to GlobalDCE is just after inlining
>>> and a second globalopt.
>>>
>>> >
>>> > I imagine if GlobalDCE isn't removing available_externally functions
>>> > it's
>>> > because they're still useful in the optimization pipeline and
>>> > something
>>> > further down the pipe removes them (because they do, ultimately, get
>>> > removed).
>>>
>>> That would be good to know, since if they are useful later on in the
>>> pipe then presumably my imported functions could be as well. But I
>>> wonder who is removing them since GlobalDCE isn't (if there are refs)?
>>
>>
>> Yep - seems no one removes them (well, probably if there are no calls
>> to the
>> function at all they get removed as usual (eg: if all the calls are
>> inlined
>> we probably drop the function as we would for a linkonce_odr
>> function)). If
>> they're still around come codegen time, we just skip them:
>>
>> lib/CodeGen/MachineFunctionPass.cpp,
>> MachineFunctionPass::runOnFunction:
>>
>> 33| bool MachineFunctionPass::runOnFunction(Function &F) {
>> 34| // Do not codegen any 'available_externally' functions at all,
>> they
>> have
>> 35| // definitions outside the translation unit.
>> 36| if (F.hasAvailableExternallyLinkage())
>> 37| return false;
>> 38|
>> 39| MachineFunction &MF =
>> getAnalysis<MachineFunctionAnalysis>().getMF();
>> 40| return runOnMachineFunction(MF);
>> 41| }
>
> Ok, thanks for digging this up. I guess that would work for the
> imported functions as well, but it seems like a waste of compile time
> if they are not useful after inlining/global opt. Is there any reason
> to keep them past GlobalDCE or should I try changing GlobalDCE so that
> it removes them?
>

I found allowing GlobalDCE to remove referenced
AvailableExternallyLinkage values earlier passes all the clang/llvm
tests (see patch below).

The LLVM regression tests might not catch the sort of regression this change
could cause. Usually we test each pass in isolation and tend towards white
box testing (so we test the cases that are "interesting" according to the
algorithm) - in this case, you're adding a condition (adding an
"interesting" case that wasn't interesting before - so didn't need to be
tested explicitly) that changes the behavior of GlobalDCE. This behavior
may've been depended upon by further down-stream optimizations for overall
performance.

AFAICT, it shouldn't be necessary to keep
these functions around past the first call to GlobalDCE (which is
after inlining), but let me know if I missed something.

It's possible they're still useful for optimizations other than inlining. We
can still do constant propagation through such functions (hey, look, this
function always returns 3, etc).

I thought this was done by passes such as IPSCCPPass and GlobalOpt
which happen earlier.

It might be necessary/worth actually running perf numbers on such a change
(or manually (possibly by asking those who know - not me, unfortunately)
checking the passes that come after GlobalDCE to see if any would do cross
function constant prop, etc. Perhaps someone else knows this off-the-cuff.

I can collect some perf data.

Although they
will be removed eventually by MachineFunctionPass as David pointed
out,

Interestingly they're not exactly /removed/, they're just not emitted. I
don't know whether that's important.

Yes, true, suppressed not removed.

so this is more of a compile-time improvement than anything. It
will affect ThinLTO more significantly in compile time because there
will be more functions with this linkage type, depending on how many
static functions

Hmm - why static functions in particular? I imagine this attribute would be
used for non-static functions too in ThinLTO.

Arg, yes, I made the same mistake in some earlier email - all "global"
functions (originally so and those static promoted) will be
available_externally. So this is a significant amount for ThinLTO.

are imported that we have to promote.

Also, what about available_externally variable
definitions/initializations? I'm not sure what source construct
currently can generate those,

I believe we emit C++ vtables as available_externally when the type has a
key function. This allows devirtualization opportunities without bloating
object file size with duplicate vtable data.

but are these suppressed/turned into
declarations at some point?

Maybe? Again, I would guess (and you can check some of this with
-print-after-all to see how the IR changes as it goes through the
optimizations) they might just be omitted from the generated code, not
necessarily removed from the IR at any point.

Ok, I will investigate.

Thanks,
Teresa

>
>
>>
>> >>
>> >>
>> >>>
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> Looking back through my GlobalDCE changes, it looks like one of
the
>> >>> >> places I had changed (where we mark defined globals in
runOnModule)
>> >>> >> already has a guard for !hasAvailableExternallyLinkage and
>> >>> >> !isDiscardableIfUnused, so my additional guard against marking
>> >>> >> imported functions is unnecessary. But the other place I had to
>> >>> >> change
>> >>> >> was in GlobalIsNeeded where it walks through the function and
>> >>> >> recursively marks any referenced global as needed. Here there was
>> >>> >> no
>> >>> >> guard against marking a global that is available externally as
>> >>> >> needed
>> >>> >> if it is referenced. I had added a check here to not mark
imported
>> >>> >> functions as needed on reference unless they were discardable
(i.e.
>> >>> >> link once). Is this a bug - should this have a guard against
>> >>> >> marking
>> >>> >> available externally function refs as needed?
>> >>> >
>> >>> >
>> >>> > Duncan's probably got a better idea of what the right answer is
>> >>> > here. I
>> >>> > suspect "yes".
>> >>> >
>> >>> > The trick with available_externally is to ensure we keep these
>> >>> > around
>> >>> > for
>> >>> > long enough that their definitions are useful (for inlining,
>> >>> > constant
>> >>> > prop,
>> >>> > all that good stuff) but remove them before we actually do too
much
>> >>> > work
>> >>> > on
>> >>> > them, or at least before we emit them into the final object file.
>> >>>
>> >>> Yep, and that is exactly how long I want my imported functions to
>> >>> stick around. Currently I have the ThinLTO import pass added at the
>> >>> start of addLTOOptimizationPasses, just after the AA passes. So the
>> >>> imported functions stick around through global opt, constant merge,
>> >>> combining, and inlining. The call to GlobalDCE is just after
inlining
>> >>> and a second globalopt.
>> >>>
>> >>> >
>> >>> > I imagine if GlobalDCE isn't removing available_externally
functions
>> >>> > it's
>> >>> > because they're still useful in the optimization pipeline and
>> >>> > something
>> >>> > further down the pipe removes them (because they do, ultimately,
get
>> >>> > removed).
>> >>>
>> >>> That would be good to know, since if they are useful later on in the
>> >>> pipe then presumably my imported functions could be as well. But I
>> >>> wonder who is removing them since GlobalDCE isn't (if there are
refs)?
>> >>
>> >>
>> >> Yep - seems no one removes them (well, probably if there are no calls
>> >> to the
>> >> function at all they get removed as usual (eg: if all the calls are
>> >> inlined
>> >> we probably drop the function as we would for a linkonce_odr
>> >> function)). If
>> >> they're still around come codegen time, we just skip them:
>> >>
>> >> lib/CodeGen/MachineFunctionPass.cpp,
>> >> MachineFunctionPass::runOnFunction:
>> >>
>> >> 33| bool MachineFunctionPass::runOnFunction(Function &F) {
>> >> 34| // Do not codegen any 'available_externally' functions at all,
>> >> they
>> >> have
>> >> 35| // definitions outside the translation unit.
>> >> 36| if (F.hasAvailableExternallyLinkage())
>> >> 37| return false;
>> >> 38|
>> >> 39| MachineFunction &MF =
>> >> getAnalysis<MachineFunctionAnalysis>().getMF();
>> >> 40| return runOnMachineFunction(MF);
>> >> 41| }
>> >
>> > Ok, thanks for digging this up. I guess that would work for the
>> > imported functions as well, but it seems like a waste of compile time
>> > if they are not useful after inlining/global opt. Is there any reason
>> > to keep them past GlobalDCE or should I try changing GlobalDCE so that
>> > it removes them?
>> >
>>
>> I found allowing GlobalDCE to remove referenced
>> AvailableExternallyLinkage values earlier passes all the clang/llvm
>> tests (see patch below).
>
>
> The LLVM regression tests might not catch the sort of regression this
change
> could cause. Usually we test each pass in isolation and tend towards
white
> box testing (so we test the cases that are "interesting" according to the
> algorithm) - in this case, you're adding a condition (adding an
> "interesting" case that wasn't interesting before - so didn't need to be
> tested explicitly) that changes the behavior of GlobalDCE. This behavior
> may've been depended upon by further down-stream optimizations for
overall
> performance.
>
>>
>> AFAICT, it shouldn't be necessary to keep
>> these functions around past the first call to GlobalDCE (which is
>> after inlining), but let me know if I missed something.
>
>
> It's possible they're still useful for optimizations other than
inlining. We
> can still do constant propagation through such functions (hey, look, this
> function always returns 3, etc).

I thought this was done by passes such as IPSCCPPass and GlobalOpt
which happen earlier.

You might well be right - I know next to nothing about these things. Having
someone a double-check/review from someone more familiar would be nice.

I'll leave the rest of the review to any such someone who might be
sufficiently informed about the pass pipeline to decide whether this is the
right call. (and/or perf numbers that demonstrate it)

Hopefully my naive review has at least covered the basics.

- David

>
>
>>
>> >>
>> >>
>> >>>
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> Looking back through my GlobalDCE changes, it looks like one of
>> >>> >> the
>> >>> >> places I had changed (where we mark defined globals in
>> >>> >> runOnModule)
>> >>> >> already has a guard for !hasAvailableExternallyLinkage and
>> >>> >> !isDiscardableIfUnused, so my additional guard against marking
>> >>> >> imported functions is unnecessary. But the other place I had to
>> >>> >> change
>> >>> >> was in GlobalIsNeeded where it walks through the function and
>> >>> >> recursively marks any referenced global as needed. Here there
>> >>> >> was
>> >>> >> no
>> >>> >> guard against marking a global that is available externally as
>> >>> >> needed
>> >>> >> if it is referenced. I had added a check here to not mark
>> >>> >> imported
>> >>> >> functions as needed on reference unless they were discardable
>> >>> >> (i.e.
>> >>> >> link once). Is this a bug - should this have a guard against
>> >>> >> marking
>> >>> >> available externally function refs as needed?
>> >>> >
>> >>> >
>> >>> > Duncan's probably got a better idea of what the right answer is
>> >>> > here. I
>> >>> > suspect "yes".
>> >>> >
>> >>> > The trick with available_externally is to ensure we keep these
>> >>> > around
>> >>> > for
>> >>> > long enough that their definitions are useful (for inlining,
>> >>> > constant
>> >>> > prop,
>> >>> > all that good stuff) but remove them before we actually do too
>> >>> > much
>> >>> > work
>> >>> > on
>> >>> > them, or at least before we emit them into the final object file.
>> >>>
>> >>> Yep, and that is exactly how long I want my imported functions to
>> >>> stick around. Currently I have the ThinLTO import pass added at the
>> >>> start of addLTOOptimizationPasses, just after the AA passes. So the
>> >>> imported functions stick around through global opt, constant merge,
>> >>> combining, and inlining. The call to GlobalDCE is just after
>> >>> inlining
>> >>> and a second globalopt.
>> >>>
>> >>> >
>> >>> > I imagine if GlobalDCE isn't removing available_externally
>> >>> > functions
>> >>> > it's
>> >>> > because they're still useful in the optimization pipeline and
>> >>> > something
>> >>> > further down the pipe removes them (because they do, ultimately,
>> >>> > get
>> >>> > removed).
>> >>>
>> >>> That would be good to know, since if they are useful later on in
>> >>> the
>> >>> pipe then presumably my imported functions could be as well. But I
>> >>> wonder who is removing them since GlobalDCE isn't (if there are
>> >>> refs)?
>> >>
>> >>
>> >> Yep - seems no one removes them (well, probably if there are no
>> >> calls
>> >> to the
>> >> function at all they get removed as usual (eg: if all the calls are
>> >> inlined
>> >> we probably drop the function as we would for a linkonce_odr
>> >> function)). If
>> >> they're still around come codegen time, we just skip them:
>> >>
>> >> lib/CodeGen/MachineFunctionPass.cpp,
>> >> MachineFunctionPass::runOnFunction:
>> >>
>> >> 33| bool MachineFunctionPass::runOnFunction(Function &F) {
>> >> 34| // Do not codegen any 'available_externally' functions at all,
>> >> they
>> >> have
>> >> 35| // definitions outside the translation unit.
>> >> 36| if (F.hasAvailableExternallyLinkage())
>> >> 37| return false;
>> >> 38|
>> >> 39| MachineFunction &MF =
>> >> getAnalysis<MachineFunctionAnalysis>().getMF();
>> >> 40| return runOnMachineFunction(MF);
>> >> 41| }
>> >
>> > Ok, thanks for digging this up. I guess that would work for the
>> > imported functions as well, but it seems like a waste of compile time
>> > if they are not useful after inlining/global opt. Is there any reason
>> > to keep them past GlobalDCE or should I try changing GlobalDCE so
>> > that
>> > it removes them?
>> >
>>
>> I found allowing GlobalDCE to remove referenced
>> AvailableExternallyLinkage values earlier passes all the clang/llvm
>> tests (see patch below).
>
>
> The LLVM regression tests might not catch the sort of regression this
> change
> could cause. Usually we test each pass in isolation and tend towards
> white
> box testing (so we test the cases that are "interesting" according to
> the
> algorithm) - in this case, you're adding a condition (adding an
> "interesting" case that wasn't interesting before - so didn't need to be
> tested explicitly) that changes the behavior of GlobalDCE. This behavior
> may've been depended upon by further down-stream optimizations for
> overall
> performance.
>
>>
>> AFAICT, it shouldn't be necessary to keep
>> these functions around past the first call to GlobalDCE (which is
>> after inlining), but let me know if I missed something.
>
>
> It's possible they're still useful for optimizations other than
> inlining. We
> can still do constant propagation through such functions (hey, look,
> this
> function always returns 3, etc).

I thought this was done by passes such as IPSCCPPass and GlobalOpt
which happen earlier.

You might well be right - I know next to nothing about these things. Having
someone a double-check/review from someone more familiar would be nice.

I'll leave the rest of the review to any such someone who might be
sufficiently informed about the pass pipeline to decide whether this is the
right call. (and/or perf numbers that demonstrate it)

Hopefully my naive review has at least covered the basics.

I just did some SPEC cpu2006 performance testing with this change, at
-O2 both with and without LTO. There were no measurable performance
differences. In fact, the final binaries were largely identical
(always the same size before and after, generated code looks
identical, slight differences in label naming are the only differences
in the nm and objdump output for a couple benchmarks - most are
totally identical).

Can someone take a look at the patch and approve if possible? While
currently this doesn't affect large numbers of functions, with ThinLTO
we want to avoid keeping around the larger numbers of
available-externally imported objects unnecessarily. This change
avoids the need of special casing here for ThinLTO imported
functions/variables.

Thanks,
Teresa

I'd be interested in performance of a clang bootstrap or of Chromium (or
some other large C++ program built with -flto).

Moreover, I suspect this will cause a problem for full LTO, where a lot
of inlining takes place at link time. There have even been proposals
(but no one has collected any data to share yet) to run just the
always-inliner at compile time, deferring almost all inlining until
later.

But if we can have different pass pipelines for "normal" vs. "LTO"
compiles, then "ThinLTO" can have its own pass pipeline as well. Seems
like it already needs one. Why not add a late pass there to drop
available_externally functions?

Looking back through my GlobalDCE changes, it looks like one of
the
places I had changed (where we mark defined globals in
runOnModule)
already has a guard for !hasAvailableExternallyLinkage and
!isDiscardableIfUnused, so my additional guard against marking
imported functions is unnecessary. But the other place I had to
change
was in GlobalIsNeeded where it walks through the function and
recursively marks any referenced global as needed. Here there
was
no
guard against marking a global that is available externally as
needed
if it is referenced. I had added a check here to not mark
imported
functions as needed on reference unless they were discardable
(i.e.
link once). Is this a bug - should this have a guard against
marking
available externally function refs as needed?

Duncan's probably got a better idea of what the right answer is
here. I
suspect "yes".

The trick with available_externally is to ensure we keep these
around
for
long enough that their definitions are useful (for inlining,
constant
prop,
all that good stuff) but remove them before we actually do too
much
work
on
them, or at least before we emit them into the final object file.

Yep, and that is exactly how long I want my imported functions to
stick around. Currently I have the ThinLTO import pass added at the
start of addLTOOptimizationPasses, just after the AA passes. So the
imported functions stick around through global opt, constant merge,
combining, and inlining. The call to GlobalDCE is just after
inlining
and a second globalopt.

I imagine if GlobalDCE isn't removing available_externally
functions
it's
because they're still useful in the optimization pipeline and
something
further down the pipe removes them (because they do, ultimately,
get
removed).

That would be good to know, since if they are useful later on in
the
pipe then presumably my imported functions could be as well. But I
wonder who is removing them since GlobalDCE isn't (if there are
refs)?

Yep - seems no one removes them (well, probably if there are no
calls
to the
function at all they get removed as usual (eg: if all the calls are
inlined
we probably drop the function as we would for a linkonce_odr
function)). If
they're still around come codegen time, we just skip them:

lib/CodeGen/MachineFunctionPass.cpp,
MachineFunctionPass::runOnFunction:

33| bool MachineFunctionPass::runOnFunction(Function &F) {
34| // Do not codegen any 'available_externally' functions at all,
they
have
35| // definitions outside the translation unit.
36| if (F.hasAvailableExternallyLinkage())
37| return false;
38|
39| MachineFunction &MF =
getAnalysis<MachineFunctionAnalysis>().getMF();
40| return runOnMachineFunction(MF);
41| }

Ok, thanks for digging this up. I guess that would work for the
imported functions as well, but it seems like a waste of compile time
if they are not useful after inlining/global opt. Is there any reason
to keep them past GlobalDCE or should I try changing GlobalDCE so
that
it removes them?

I found allowing GlobalDCE to remove referenced
AvailableExternallyLinkage values earlier passes all the clang/llvm
tests (see patch below).

The LLVM regression tests might not catch the sort of regression this
change
could cause. Usually we test each pass in isolation and tend towards
white
box testing (so we test the cases that are "interesting" according to
the
algorithm) - in this case, you're adding a condition (adding an
"interesting" case that wasn't interesting before - so didn't need to be
tested explicitly) that changes the behavior of GlobalDCE. This behavior
may've been depended upon by further down-stream optimizations for
overall
performance.

AFAICT, it shouldn't be necessary to keep
these functions around past the first call to GlobalDCE (which is
after inlining), but let me know if I missed something.

It's possible they're still useful for optimizations other than
inlining. We
can still do constant propagation through such functions (hey, look,
this
function always returns 3, etc).

I thought this was done by passes such as IPSCCPPass and GlobalOpt
which happen earlier.

You might well be right - I know next to nothing about these things. Having
someone a double-check/review from someone more familiar would be nice.

I'll leave the rest of the review to any such someone who might be
sufficiently informed about the pass pipeline to decide whether this is the
right call. (and/or perf numbers that demonstrate it)

Hopefully my naive review has at least covered the basics.

I just did some SPEC cpu2006 performance testing with this change, at
-O2 both with and without LTO. There were no measurable performance
differences. In fact, the final binaries were largely identical
(always the same size before and after, generated code looks
identical, slight differences in label naming are the only differences
in the nm and objdump output for a couple benchmarks - most are
totally identical).

Can someone take a look at the patch and approve if possible? While
currently this doesn't affect large numbers of functions, with ThinLTO
we want to avoid keeping around the larger numbers of
available-externally imported objects unnecessarily. This change
avoids the need of special casing here for ThinLTO imported
functions/variables.

I'd be interested in performance of a clang bootstrap or of Chromium (or
some other large C++ program built with -flto).

Moreover, I suspect this will cause a problem for full LTO, where a lot
of inlining takes place at link time. There have even been proposals
(but no one has collected any data to share yet) to run just the
always-inliner at compile time, deferring almost all inlining until
later.

Note that one of the sets of SPEC cpu2006 data I collected was with
full LTO. I'm trying to understand where this would be an issue for
full LTO, since GlobalDCE is running after inlining in the link time
LTO pipeline.

Just to be clear, are you concerned about the following:

1) GlobalDCE run after inlining in the "-flto -O2 -c" build removes an
unreferenced available externally function
2) LTO link step would have inlined that available externally function
somewhere (must be a different module since it was unreferenced in
step 1) but now can't

I'm not sure when 2) can happen since it had to be unreferenced in its
module to be removed with my changes. For an available externally
function, any reference from a different module would have needed its
own copy to start with (at least that is the case with C inline
functions). It looks like available externally C inline functions are
already removed when unreferenced by the compile step (e.g. when
inlined into that module's references during the "-flto -O2 -c"
compile).

But if we can have different pass pipelines for "normal" vs. "LTO"
compiles, then "ThinLTO" can have its own pass pipeline as well. Seems
like it already needs one. Why not add a late pass there to drop
available_externally functions?

Sure that's possible, but I am trying to understand what circumstance
there is where a full LTO would need to keep around available
externally functions after GlobalDCE where ThinLTO wouldn't.
Presumably if they are useful for optimization after GlobalDCE then we
would want to keep them around longer for ThinLTO as well. I just
haven't identified any need to do so.

Thanks,
Teresa

You might well be right - I know next to nothing about these things. Having
someone a double-check/review from someone more familiar would be nice.

I'll leave the rest of the review to any such someone who might be
sufficiently informed about the pass pipeline to decide whether this is the
right call. (and/or perf numbers that demonstrate it)

Hopefully my naive review has at least covered the basics.

I just did some SPEC cpu2006 performance testing with this change, at
-O2 both with and without LTO. There were no measurable performance
differences. In fact, the final binaries were largely identical
(always the same size before and after, generated code looks
identical, slight differences in label naming are the only differences
in the nm and objdump output for a couple benchmarks - most are
totally identical).

Can someone take a look at the patch and approve if possible? While
currently this doesn't affect large numbers of functions, with ThinLTO
we want to avoid keeping around the larger numbers of
available-externally imported objects unnecessarily. This change
avoids the need of special casing here for ThinLTO imported
functions/variables.

I'd be interested in performance of a clang bootstrap or of Chromium (or
some other large C++ program built with -flto).

Moreover, I suspect this will cause a problem for full LTO, where a lot
of inlining takes place at link time. There have even been proposals
(but no one has collected any data to share yet) to run just the
always-inliner at compile time, deferring almost all inlining until
later.

Note that one of the sets of SPEC cpu2006 data I collected was with
full LTO. I'm trying to understand where this would be an issue for
full LTO, since GlobalDCE is running after inlining in the link time
LTO pipeline.

Just to be clear, are you concerned about the following:

1) GlobalDCE run after inlining in the "-flto -O2 -c" build removes an
unreferenced available externally function
2) LTO link step would have inlined that available externally function
somewhere (must be a different module since it was unreferenced in
step 1) but now can't

I'm not sure when 2) can happen since it had to be unreferenced in its
module to be removed with my changes.

That's not what it looks like to me. Here's what I think the relevant
part of your patch is:

@@ -231,8 +237,10 @@ void GlobalDCE::GlobalIsNeeded(GlobalValue *G) {
    for (Function::iterator BB = F->begin(), E = F->end(); BB != E; ++BB)
      for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I)
        for (User::op_iterator U = I->op_begin(), E = I->op_end(); U != E; ++U)
- if (GlobalValue *GV = dyn_cast<GlobalValue>(*U))
- GlobalIsNeeded(GV);
+ if (GlobalValue *GV = dyn_cast<GlobalValue>(*U)) {
+ if (!GV->hasAvailableExternallyLinkage())
+ GlobalIsNeeded(GV);
+ }
          else if (Constant *C = dyn_cast<Constant>(*U))
            MarkUsedGlobalsAsNeeded(C);
  }

IIUC, this changes the logic from "if it's referenced, keep it" to
"if it's referenced and not available_externally, keep it".

In particular, I'm worried about GlobalDCE removing a *referenced*
available_externally function, particularly if/when we stop inlining at
the compile step when -flto.

You might well be right - I know next to nothing about these things. Having
someone a double-check/review from someone more familiar would be nice.

I'll leave the rest of the review to any such someone who might be
sufficiently informed about the pass pipeline to decide whether this is the
right call. (and/or perf numbers that demonstrate it)

Hopefully my naive review has at least covered the basics.

I just did some SPEC cpu2006 performance testing with this change, at
-O2 both with and without LTO. There were no measurable performance
differences. In fact, the final binaries were largely identical
(always the same size before and after, generated code looks
identical, slight differences in label naming are the only differences
in the nm and objdump output for a couple benchmarks - most are
totally identical).

Can someone take a look at the patch and approve if possible? While
currently this doesn't affect large numbers of functions, with ThinLTO
we want to avoid keeping around the larger numbers of
available-externally imported objects unnecessarily. This change
avoids the need of special casing here for ThinLTO imported
functions/variables.

I'd be interested in performance of a clang bootstrap or of Chromium (or
some other large C++ program built with -flto).

Moreover, I suspect this will cause a problem for full LTO, where a lot
of inlining takes place at link time. There have even been proposals
(but no one has collected any data to share yet) to run just the
always-inliner at compile time, deferring almost all inlining until
later.

Note that one of the sets of SPEC cpu2006 data I collected was with
full LTO. I'm trying to understand where this would be an issue for
full LTO, since GlobalDCE is running after inlining in the link time
LTO pipeline.

Just to be clear, are you concerned about the following:

1) GlobalDCE run after inlining in the "-flto -O2 -c" build removes an
unreferenced available externally function
2) LTO link step would have inlined that available externally function
somewhere (must be a different module since it was unreferenced in
step 1) but now can't

I'm not sure when 2) can happen since it had to be unreferenced in its
module to be removed with my changes.

That's not what it looks like to me. Here's what I think the relevant
part of your patch is:

@@ -231,8 +237,10 @@ void GlobalDCE::GlobalIsNeeded(GlobalValue *G) {
    for (Function::iterator BB = F->begin(), E = F->end(); BB != E; ++BB)
      for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I)
        for (User::op_iterator U = I->op_begin(), E = I->op_end(); U != E; ++U)
- if (GlobalValue *GV = dyn_cast<GlobalValue>(*U))
- GlobalIsNeeded(GV);
+ if (GlobalValue *GV = dyn_cast<GlobalValue>(*U)) {
+ if (!GV->hasAvailableExternallyLinkage())
+ GlobalIsNeeded(GV);
+ }
          else if (Constant *C = dyn_cast<Constant>(*U))
            MarkUsedGlobalsAsNeeded(C);
  }

IIUC, this changes the logic from "if it's referenced, keep it" to
"if it's referenced and not available_externally, keep it".

Sorry, you're right, I conflated a couple of different things when I
looked at this again this morning. For ThinLTO I don't in fact need
this change for anything that is fully inlined after importing.

In particular, I'm worried about GlobalDCE removing a *referenced*
available_externally function, particularly if/when we stop inlining at
the compile step when -flto.

Ok, that makes sense. In that case, I agree that doing this for
ThinLTO imported functions is the right way to go. My initial approach
was to mark the functions that were ThinLTO-imported and check that
here. I think we will likely want to mark the imported functions as
such regardless, since we may want to apply different optimization
thresholds for imported functions, or at least use for debugging and
tracing, so we could consider using that approach here.

Another possibility as you mentioned was to drop in a new pass in the
ThinLTO backend optimization pipeline just for this. Or perhaps the
GlobalDCE pass invocation could take a flag indicating that it should
remove the avail extern functions, that we can set in the ThinLTO
backend case. It seems like most of the code would be shared between
the current GlobalDCE and any new ThinLTO-specific version that
setting GlobalDCE up to do this optionally (either per-imported
function or with a configuration flag on the pass) made sense.

Teresa

Since the compiler is always free to delete available_externally
functions, I think you could just add a pass to the -flto=thin pipeline
that deletes all of them (referenced or not) -- it's just a single loop
through all the functions deleting the bodies of those with the right
linkage. I imagine there are other pass pipelines that might want to do
something similar. I don't really like having GlobalDCE delete them
(even behind a flag) because they aren't actually dead, and I think a
separate pass makes it easier to test and all that. (I haven't actually
worked much with pass pipelines, though, so there's a chance I'm on my
own here?)

You make an interesting point about applying different thresholds to
imported functions, but is there any reason not to change all
available_externally functions in the some way? Moreover, it feels like
thin-LTO's imported functions are a new source of functions with
available_externally linkage, but otherwise they aren't interestingly
different.

In particular, I'm worried about GlobalDCE removing a *referenced*
available_externally function, particularly if/when we stop inlining at
the compile step when -flto.

Ok, that makes sense. In that case, I agree that doing this for
ThinLTO imported functions is the right way to go. My initial approach
was to mark the functions that were ThinLTO-imported and check that
here. I think we will likely want to mark the imported functions as
such regardless, since we may want to apply different optimization
thresholds for imported functions, or at least use for debugging and
tracing, so we could consider using that approach here.

Another possibility as you mentioned was to drop in a new pass in the
ThinLTO backend optimization pipeline just for this. Or perhaps the
GlobalDCE pass invocation could take a flag indicating that it should
remove the avail extern functions, that we can set in the ThinLTO
backend case. It seems like most of the code would be shared between
the current GlobalDCE and any new ThinLTO-specific version that
setting GlobalDCE up to do this optionally (either per-imported
function or with a configuration flag on the pass) made sense.

Since the compiler is always free to delete available_externally
functions, I think you could just add a pass to the -flto=thin pipeline
that deletes all of them (referenced or not) -- it's just a single loop
through all the functions deleting the bodies of those with the right
linkage. I imagine there are other pass pipelines that might want to do
something similar. I don't really like having GlobalDCE delete them
(even behind a flag) because they aren't actually dead, and I think a
separate pass makes it easier to test and all that. (I haven't actually
worked much with pass pipelines, though, so there's a chance I'm on my
own here?)

Ok, sounds reasonable to do as a separate ThinLTO specific pass.

You make an interesting point about applying different thresholds to
imported functions, but is there any reason not to change all
available_externally functions in the some way? Moreover, it feels like
thin-LTO's imported functions are a new source of functions with
available_externally linkage, but otherwise they aren't interestingly
different.

I suspect those cases are going to be due to summary information
recorded for the imported functions anyway, which we haven't defined
the format of yet anyway. So I think we can separate that from this
aspect of imported function handling, which can be based on the
linkage type alone, and defer that discussion until later.

Thanks for your help,
Teresa

Since the compiler is always free to delete available_externally
functions, I think you could just add a pass to the -flto=thin pipeline
that deletes all of them (referenced or not) -- it's just a single loop
through all the functions deleting the bodies of those with the right
linkage. I imagine there are other pass pipelines that might want to do
something similar. I don't really like having GlobalDCE delete them
(even behind a flag) because they aren't actually dead, and I think a
separate pass makes it easier to test and all that. (I haven't actually
worked much with pass pipelines, though, so there's a chance I'm on my
own here?)

It's possible to get into situations where available_externally functions
or globals (dllimported vftable) reference linkonce_odr functions (virtual
destructor thunks), so a single pass to drop available_externally function
bodies isn't as strong as adding a flag to GlobalDCE.

Personally, I think the right approach is to add a bool to
createGlobalDCEPass defaulting to true named something like
IsAfterInlining. In most standard pass pipelines, GlobalDCE runs after
inlining for obvious reasons, so the default makes sense. The special case
is the LTO pass pipeline, where the code will be reloaded and reoptimized
later. IMO it's natural to put the customization there.

You make an interesting point about applying different thresholds to

imported functions, but is there any reason not to change all
available_externally functions in the some way? Moreover, it feels like
thin-LTO's imported functions are a new source of functions with
available_externally linkage, but otherwise they aren't interestingly
different.

Right, ThinLTO functions and C99 inline functions aren't interestingly
different. Having GlobalDCE drop bodies of available_externally functions
in the standard -O2 pipeline avoids wasting time running IR passes like
CodeGenPrepare on them.

Since the compiler is always free to delete available_externally
functions, I think you could just add a pass to the -flto=thin pipeline
that deletes all of them (referenced or not) -- it's just a single loop
through all the functions deleting the bodies of those with the right
linkage. I imagine there are other pass pipelines that might want to do
something similar. I don't really like having GlobalDCE delete them
(even behind a flag) because they aren't actually dead, and I think a
separate pass makes it easier to test and all that. (I haven't actually
worked much with pass pipelines, though, so there's a chance I'm on my
own here?)

It's possible to get into situations where available_externally functions or globals (dllimported vftable) reference linkonce_odr functions (virtual destructor thunks), so a single pass to drop available_externally function bodies isn't as strong as adding a flag to GlobalDCE.

Unless I'm missing something, running the pass to drop
available_externally function bodies just before running GlobalDCE
is equivalent to incorporating it into GlobalDCE.

Personally, I think the right approach is to add a bool to createGlobalDCEPass defaulting to true named something like IsAfterInlining. In most standard pass pipelines, GlobalDCE runs after inlining for obvious reasons, so the default makes sense. The special case is the LTO pass pipeline, where the code will be reloaded and reoptimized later. IMO it's natural to put the customization there.

That's a good point. But I still like these as separate passes.

Since the compiler is always free to delete available_externally
functions, I think you could just add a pass to the -flto=thin pipeline
that deletes all of them (referenced or not) -- it's just a single loop
through all the functions deleting the bodies of those with the right
linkage. I imagine there are other pass pipelines that might want to do
something similar. I don't really like having GlobalDCE delete them
(even behind a flag) because they aren't actually dead, and I think a
separate pass makes it easier to test and all that. (I haven't actually
worked much with pass pipelines, though, so there's a chance I'm on my
own here?)

It's possible to get into situations where available_externally functions
or globals (dllimported vftable) reference linkonce_odr functions (virtual
destructor thunks), so a single pass to drop available_externally function
bodies isn't as strong as adding a flag to GlobalDCE.

Personally, I think the right approach is to add a bool to
createGlobalDCEPass defaulting to true named something like
IsAfterInlining. In most standard pass pipelines, GlobalDCE runs after
inlining for obvious reasons, so the default makes sense. The special case
is the LTO pass pipeline, where the code will be reloaded and reoptimized
later. IMO it's natural to put the customization there.

+1. That is what I was going to suggest too. The flag can be more general
to indicate bodies are needed for referenced functions. For instance if
there are cloning phase after inlining, then DCE before the cloning still
needs to keep the bodies.

You make an interesting point about applying different thresholds to

imported functions, but is there any reason not to change all
available_externally functions in the some way? Moreover, it feels like
thin-LTO's imported functions are a new source of functions with
available_externally linkage, but otherwise they aren't interestingly
different.

Right, ThinLTO functions and C99 inline functions aren't interestingly
different

I think they are more like GNU extern inline functions.

. Having GlobalDCE drop bodies of available_externally functions in the
standard -O2 pipeline avoids wasting time running IR passes like
CodeGenPrepare on them.

yes.

thanks,

David

Hm, yeah, that's pretty sane. So, add a pass to downgrade
available_externally global definitions to declarations with external
linkage, add it to the standard -O2 pass pipeline, ensure that it's not
part of LTO, and then call it done?

SGTM.

Agreed. Although I assume you mean invoke the new pass under a
ThinLTO-only option so that avail extern are not dropped in the
compile pass before the LTO link?

Teresa

No, this pass would actually be an improvement to the standard -O2
pipeline. The special case is the regular LTO pass pipeline, which wants to
run GlobalDCE but doesn't want to drop available_externally function
definitions until after linker-stage inlining.

Consider this test case:

declare void @blah()
define i32 @main() {
  call void @foo()
  ret i32 0
}
define available_externally void @foo() noinline {
  call void @bar()
  ret void
}
define linkonce_odr void @bar() noinline {
  call void @blah()
  ret void
}

If I run opt -O2 on this and feed it to llc, it actually generates code for
bar, even though there are no actual references to bar in the final code:

main: # @main
        pushq %rax
        callq foo
        xorl %eax, %eax
        popq %rdx
        retq

bar: # @bar
        jmp blah # TAILCALL

This corner case happens to come up organically with dllimported classes,
which is why I happened to think of it. :slight_smile: I'm happy with a flag to
globaldce for LTO and the original patch, honestly.

Agreed. Although I assume you mean invoke the new pass under a
ThinLTO-only option so that avail extern are not dropped in the
compile pass before the LTO link?

No, this pass would actually be an improvement to the standard -O2 pipeline.
The special case is the regular LTO pass pipeline, which wants to run
GlobalDCE but doesn't want to drop available_externally function definitions
until after linker-stage inlining.

Ok. It looks to me like the LTO compile step with -O2 -flto -c uses
this same -O2 optimization pipeline as without LTO. Clang communicates
the fact that we are doing an LTO compile to llvm via the
-emit-llvm-bc flag, which just tells it to emit bitcode and exit
before codegen. So I would either need to key off of that or setup a
new flag to indicate to llvm that we are doing an LTO -c compile. Or
is there some other way that I am missing?

Incidentally, we'll also want to skip this new pass and keep any
referenced avail extern functions in the ThinLTO -c compile step for
the same reasons (and there are no imported functions yet at that
point).

Teresa

Agreed. Although I assume you mean invoke the new pass under a
ThinLTO-only option so that avail extern are not dropped in the
compile pass before the LTO link?

No, this pass would actually be an improvement to the standard -O2 pipeline.
The special case is the regular LTO pass pipeline, which wants to run
GlobalDCE but doesn’t want to drop available_externally function definitions
until after linker-stage inlining.

Ok. It looks to me like the LTO compile step with -O2 -flto -c uses
this same -O2 optimization pipeline as without LTO. Clang communicates
the fact that we are doing an LTO compile to llvm via the
-emit-llvm-bc flag, which just tells it to emit bitcode and exit
before codegen. So I would either need to key off of that or setup a
new flag to indicate to llvm that we are doing an LTO -c compile. Or
is there some other way that I am missing?

Incidentally, we’ll also want to skip this new pass and keep any
referenced avail extern functions in the ThinLTO -c compile step for
the same reasons (and there are no imported functions yet at that
point).

Ultimately for any planned LTO build we’re going to want to do a different pass pipeline, it’s probably worth considering what should be done before and during LTO.

-eric