Best way to clean up empty global_ctors

Hi,

I'd like to fix PR19590, which is about llvm.global_ctors containing
functions that end up being empty after optimization (which causes the
linker to add useless init_array entries to the output binary).
globalopt removes empty functions from llvm.global_ctors, but by the
time the function becomes empty globalopt has already run and it
doesn't run again.

I'm wondering what the best fix is:
1. Should globalopt run once more after all other passes have run?
2. Or should the llvm.global_ctors optimization code in globalopt be
moved into a helper function somewhere that's called from both
globalopt and a new optimization pass cdtoropt that does nothing but
remove empty functions from llvm.global_ctors and llvm.global_dtors?
Then only this new pass would be added after all other passes. (This
new pass should run very quickly, it doesn't have to do much.)

Thanks for any advice,
Nico

Can we strengthen globalopt to symbolically execute at most one direct call to any given function, or is that crazy talk?

I'm not sure I understand what you mean. The unoptimized IR is

    @llvm.global_ctors = appending global [1 x { i32, void ()* }] [{
i32, void ()* } { i32 65535, void ()* @_GLOBAL__I_a }]

    define internal void @_GLOBAL__I_a() section
"__TEXT,__StaticInit,regular,pure_instructions" {
    entry:
      call void @__cxx_global_var_init()
      ret void
    }

    define internal void @__cxx_global_var_init() section
"__TEXT,__StaticInit,regular,pure_instructions" {
    entry:
      %0 = load i32* @_ZN3Bar18LINKER_INITIALIZEDE, align 4
      call void @_ZN3FooC1E17LinkerInitialized(%class.Foo* @foo, i32 %0)
      ret void
    }

    ; Function Attrs: ssp uwtable
    define linkonce_odr void
@_ZN3FooC1E17LinkerInitialized(%class.Foo* %this, i32) unnamed_addr #0
align 2 {
    entry:
      %this.addr = alloca %class.Foo*, align 8
      %.addr = alloca i32, align 4
      store %class.Foo* %this, %class.Foo** %this.addr, align 8
      store i32 %0, i32* %.addr, align 4
      %this1 = load %class.Foo** %this.addr
      %1 = load i32* %.addr, align 4
      call void @_ZN3FooC2E17LinkerInitialized(%class.Foo* %this1, i32 %1)
      ret void
    }

…which is quite a bit of stuff. But other passes successfully reduce this to

    @llvm.global_ctors = appending global [1 x { i32, void ()* }] [{
i32, void ()* } { i32 65535, void ()* @_GLOBAL__I_a }]

    ; Function Attrs: nounwind readnone
    define internal void @_GLOBAL__I_a() #1 section
"__TEXT,__StaticInit,regular,pure_instructions" {
    entry:
      ret void
    }

which is much nicer, and globalopt is able to reduce the latter bit to
an empty global_ctors list – the problem is that it runs before the IR
is optimized that much. Just doing globalopt once more (or doing just
the global_ctor bits of globalopt once more) is enough to fix the
problem. The existing machinery seems to do all that's needed, it just
needs to be used in a slightly different order as far as I understand.

How late do you need to move globalopt to get this to work? Do you get
any performance regressions if you just move it there?

Cheers,
Rafael

I talked about this with Nick in person months ago, and my understanding is that GlobalOpt is also an enabling optimization that needs to run early. For example, if we can eliminate an initializer to an internal global with no other stores to it, we can propagate the result.

Maybe we should run it twice.

That is probably fine, we just have to make sure compile time doesn't
surfer too much.

Cheers,
Rafael

Looking at GlobalOpt, it seems to do full symbolic evaluation of all
constructors, which is more than what's needed here. Just walking
llvm.global_ctors and removing calls to functions that are just "ret"
is all that's needed. Unless folks think that's a terrible idea, I'll
try to write a lightweight pass for doing just that.

Looking at GlobalOpt, it seems to do full symbolic evaluation of all
constructors, which is more than what's needed here. Just walking
llvm.global_ctors and removing calls to functions that are just "ret"
is all that's needed. Unless folks think that's a terrible idea, I'll
try to write a lightweight pass for doing just that.

Not a bad idea, we just normally avoid small passes like that. Looking
at -O2, looks like we run -globaldce really late, maybe this could be
done in it?

Cheers,
Rafael

Looking at GlobalOpt, it seems to do full symbolic evaluation of all
constructors, which is more than what's needed here. Just walking
llvm.global_ctors and removing calls to functions that are just "ret"
is all that's needed. Unless folks think that's a terrible idea, I'll
try to write a lightweight pass for doing just that.

Not a bad idea, we just normally avoid small passes like that.

That's good to know, thanks. Is there a some kind of 'philosophical
overview' document for llvm's pass design somewhere that explains
things like this?

Looking
at -O2, looks like we run -globaldce really late, maybe this could be
done in it?

I'll take a look, thanks!

There are many cases where another run of globalopt at the end would clean
things up.

I have a few concerns. One is that globalopt can miscompile, and I'm afraid
of making it do more or learn new tricks because that may expose its
miscompiley underbelly. The second is that a late run of globalopt opens up
all manner of new opportunities for the rest of the scalar optimizers,
which we don't run again. That's a sign that we should be designing it
differently, I just haven't thought about how exactly.

Teaching -globaldce about the ctors and dtors list sounds like a great plan
to me.

Nick

OK, you've said you don't want to teach globalopt any new tricks, so I'm
going to suggest teaching it a new trick :wink:

If globalopt sees the load of a variable but doesn't have a definition of
that variable, could it succeed and track that it has an unusable value
(but fail if that value is actually used anywhere)? This is perhaps just
addressing the symptom rather than the root cause, but it'd work for this
case (where the only reason globalopt can't symbolically execute the
function is because it loads and ignores an undefined global).