[RFC] Propeller: A frame work for Post Link Optimizations

From: Sriraman Tallam <tmsriram@google.com>
Sent: Friday, September 27, 2019 9:43 AM
To: Eli Friedman <efriedma@quicinc.com>
Cc: Xinliang David Li <xinliangli@gmail.com>; llvm-dev <llvm-dev@lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations

Why are you proposing to add a bunch of options to clang to manipulate
LLVM
code generation, given none of those options are actually used for Propeller
workflows?

Where do you suggest labelling and section options should exist? We
looked at basic block sections to be similar to function sections in
terms of option handling?

Generating bitcode with/without propeller doesn’t actually affect the
generated bitcode, right? So you could say that Propeller is enabled with “-Wl,–
enable-propeller”, and not change clang at all. I’m not a fan of adding options
just because we can. If basic block sections are only used as a sort of secret
handshake between the propeller compiler and the propeller linker, we can
change that interface later, if we want; if we expose it as a clang option, we’re
committing to making basic block sections continue to work the same way until
the end of time.

The generated MIR code is slightly different as the later passes have
more CFI instructions, basic block labels and extra direct jumps which
would have been fall throughs otherwise. So, some llvm option is
needed.

At link (LTO codegen) time, yes. But clang’s bitcode generation doesn’t change; only LTO code generation in lld changes.

After having spent more time on this, I think what you are suggesting is to impose this work-flow for Propeller always:

Step 1: generate labels binary

clang -emit-llvm -c -O2 foo.cc bar.cc (generates optimized foo.bc and bar.bc)
./lld foo.bc bar.bc --basicblock-labels -o a.out.labels (generates labels binary)

Step 2: Profile into perf.data just like now

Step 3: Convert profiles just like now

Step 4: Re-optimize with propeller by invoking lld directly
./lld foo.bc bar.bc --basicblock-sections --profile=perf.propeller -o a.out

If this workflow is adopted we might be able to get away without clang options but I still think Step 1 could use a clang option to combine the 2 steps into one and may be one more option in Step 4 might be useful if we want the invocation to be via clang.

Notice how this work-flow is very similar to ThinLTO’s, and even thinlto needs command-line options, -flto=thin, to break this down into the steps.

However, the issue with this work-flow imposes LTO like workflow on Propeller. Propeller is LTO agnostic and works even without LTO or regular FDO. The above does not support the simple workflows where you build each module with -O3 and then dump the native object files and link. I think it would be beneficial to allow Propeller to be used everywhere by modifying the make files to adopt it rather than constrain the work flow.

Thanks
Sri

From: Sriraman Tallam <tmsriram@google.com>
Sent: Friday, September 27, 2019 9:43 AM
To: Eli Friedman <efriedma@quicinc.com>
Cc: Xinliang David Li <xinliangli@gmail.com>; llvm-dev <llvm-dev@lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations

Why are you proposing to add a bunch of options to clang to manipulate
LLVM
code generation, given none of those options are actually used for Propeller
workflows?

Where do you suggest labelling and section options should exist? We
looked at basic block sections to be similar to function sections in
terms of option handling?

Generating bitcode with/without propeller doesn’t actually affect the
generated bitcode, right? So you could say that Propeller is enabled with “-Wl,–
enable-propeller”, and not change clang at all. I’m not a fan of adding options
just because we can. If basic block sections are only used as a sort of secret
handshake between the propeller compiler and the propeller linker, we can
change that interface later, if we want; if we expose it as a clang option, we’re
committing to making basic block sections continue to work the same way until
the end of time.

The generated MIR code is slightly different as the later passes have
more CFI instructions, basic block labels and extra direct jumps which
would have been fall throughs otherwise. So, some llvm option is
needed.

At link (LTO codegen) time, yes. But clang’s bitcode generation doesn’t change; only LTO code generation in lld changes.

I see what you mean here.

I envisioned that basic block sections could also be useful on its own
outside of propeller. Hence, we made it like function-sections with
a separate option -fbasicblock-sections. The propeller option is an
umbrella option to make it easy to invoke Propeller.

I don’t think this is really true. Basic block labels means “insert labels that are useful for propeller profiling”, and basic block sections means “split the function into multiple chunks that are convenient for propeller optimization”. So it’s really only useful for propeller-like workflows. And I’d be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn’t specified in some rigorous way.

The idea of basic block sections was seeded by Rui Ueyama. When basic
block sections was originally looked at, Propeller was not designed.
We looked at basic block sections as finer granularity than function
sections. Section prefix based code layout where you just create
smaller code sections and let the linker do what it does today would
have much better results with basic block sections. Symbol ordering
file with basic block sections to do random orderings can be done
without invoking Propeller. Function sections has found uses after it
was invented like Identical Code Folding.

I’m not sure that function sections are entirely an apt comparison
here for a few reasons:

  • function sections mostly gave an easy way to do ICF - using MIR or
    LLVM IR are also fairly effective means as is just using the linker to

I did not mean to suggest we should do folding with basic block sections and I am aware of the issues, though I am not ruling out the possibility. We had briefly brainstormed this.

AFAICT, Function Sections was not invented to do ICF. Function sections was used for linker garbage collection and global code layout and when we first worked on ICF for gold much later, we found function sections to be useful without which it would have been much harder.

compare bytes. Using function sections can still run into language
specific problems as well, hence why icf=safe vs icf=all. Though pcc’s

Agreed, and I would like to humbly suggest that I added those options to gold originally and am aware of that.

recent work helps out a little bit via significant address
determination in the front end

  • order files and other similar mechanisms work based on externally
    visible symbols and are fairly straightforward to use in an ABI safe
    way (no function → function fallthrough for example)
  • basic block sections can run into the multiple entry point problems
    depending on how they’re laid out as well as other ABI and visibility
    issues

This can likely be done safely, but there are definitely some concerns
here with how you want to enable the use of basic block
labels/sections.

I would like to expose basic block sections independently and let me talk more about one more effort we are trying to do. Independent of Propeller, we are looking at a machine learning based approach to do code layout at link time. I won’t go into the details of the features but having basic block sections allows us to naturally feed the native object files into a black box that would try several different reorderings of the sections and learn a model. We looked at basic block sections as the building block on top of which Propeller was built. I don’t think it would be the only application, just my opinion.

Thanks
Sri

From: Sriraman Tallam <tmsriram@google.com>
Sent: Friday, September 27, 2019 9:43 AM
To: Eli Friedman <efriedma@quicinc.com>
Cc: Xinliang David Li <xinliangli@gmail.com>; llvm-dev <llvm-dev@lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations

Why are you proposing to add a bunch of options to clang to manipulate
LLVM
code generation, given none of those options are actually used for Propeller
workflows?

Where do you suggest labelling and section options should exist? We
looked at basic block sections to be similar to function sections in
terms of option handling?

Generating bitcode with/without propeller doesn’t actually affect the
generated bitcode, right? So you could say that Propeller is enabled with “-Wl,–
enable-propeller”, and not change clang at all. I’m not a fan of adding options
just because we can. If basic block sections are only used as a sort of secret
handshake between the propeller compiler and the propeller linker, we can
change that interface later, if we want; if we expose it as a clang option, we’re
committing to making basic block sections continue to work the same way until
the end of time.

The generated MIR code is slightly different as the later passes have
more CFI instructions, basic block labels and extra direct jumps which
would have been fall throughs otherwise. So, some llvm option is
needed.

At link (LTO codegen) time, yes. But clang’s bitcode generation doesn’t change; only LTO code generation in lld changes.

I see what you mean here.

I envisioned that basic block sections could also be useful on its own
outside of propeller. Hence, we made it like function-sections with
a separate option -fbasicblock-sections. The propeller option is an
umbrella option to make it easy to invoke Propeller.

I don’t think this is really true. Basic block labels means “insert labels that are useful for propeller profiling”, and basic block sections means “split the function into multiple chunks that are convenient for propeller optimization”. So it’s really only useful for propeller-like workflows. And I’d be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn’t specified in some rigorous way.

The idea of basic block sections was seeded by Rui Ueyama. When basic
block sections was originally looked at, Propeller was not designed.
We looked at basic block sections as finer granularity than function
sections. Section prefix based code layout where you just create
smaller code sections and let the linker do what it does today would
have much better results with basic block sections. Symbol ordering
file with basic block sections to do random orderings can be done
without invoking Propeller. Function sections has found uses after it
was invented like Identical Code Folding.

I’m not sure that function sections are entirely an apt comparison
here for a few reasons:

  • function sections mostly gave an easy way to do ICF - using MIR or
    LLVM IR are also fairly effective means as is just using the linker to
    compare bytes. Using function sections can still run into language
    specific problems as well, hence why icf=safe vs icf=all. Though pcc’s
    recent work helps out a little bit via significant address
    determination in the front end
  • order files and other similar mechanisms work based on externally
    visible symbols and are fairly straightforward to use in an ABI safe
    way (no function → function fallthrough for example)

I missed this point. Just to be clear, With the linker relaxation, a simple order file of basic block symbols can be used to re-order basic blocks in any manner. This can be done with the existing patches now and we have used this extensively in our work to validate our layout decisions.

>
> >
> > > From: Sriraman Tallam <tmsriram@google.com>
> > > Sent: Friday, September 27, 2019 9:43 AM
> > > To: Eli Friedman <efriedma@quicinc.com>
> > > Cc: Xinliang David Li <xinliangli@gmail.com>; llvm-dev <llvm-dev@lists.llvm.org>
> > > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> > > Optimizations
> > >
> > > >
> > > > > > Why are you proposing to add a bunch of options to clang to manipulate
> > > LLVM
> > > > > code generation, given none of those options are actually used for Propeller
> > > > > workflows?
> > > > >
> > > > > Where do you suggest labelling and section options should exist? We
> > > > > looked at basic block sections to be similar to function sections in
> > > > > terms of option handling?
> > > >
> > > > Generating bitcode with/without propeller doesn't actually affect the
> > > generated bitcode, right? So you could say that Propeller is enabled with "-Wl,--
> > > enable-propeller", and not change clang at all. I'm not a fan of adding options
> > > just because we can. If basic block sections are only used as a sort of secret
> > > handshake between the propeller compiler and the propeller linker, we can
> > > change that interface later, if we want; if we expose it as a clang option, we're
> > > committing to making basic block sections continue to work the same way until
> > > the end of time.
> > >
> > > The generated MIR code is slightly different as the later passes have
> > > more CFI instructions, basic block labels and extra direct jumps which
> > > would have been fall throughs otherwise. So, some llvm option is
> > > needed.
> >
> > At link (LTO codegen) time, yes. But clang's bitcode generation doesn't change; only LTO code generation in lld changes.
>
> I see what you mean here.
>
> >
> > > I envisioned that basic block sections could also be useful on its own
> > > outside of propeller. Hence, we made it like function-sections with
> > > a separate option -fbasicblock-sections. The propeller option is an
> > > umbrella option to make it easy to invoke Propeller.
> >
> > I don't think this is really true. Basic block labels means "insert labels that are useful for propeller profiling", and basic block sections means "split the function into multiple chunks that are convenient for propeller optimization". So it's really only useful for propeller-like workflows. And I'd be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn't specified in some rigorous way.
>
> The idea of basic block sections was seeded by Rui Ueyama. When basic
> block sections was originally looked at, Propeller was not designed.
> We looked at basic block sections as finer granularity than function
> sections. Section prefix based code layout where you just create
> smaller code sections and let the linker do what it does today would
> have much better results with basic block sections. Symbol ordering
> file with basic block sections to do random orderings can be done
> without invoking Propeller. Function sections has found uses after it
> was invented like Identical Code Folding.
>

I'm not sure that function sections are entirely an apt comparison
here for a few reasons:

- function sections mostly gave an easy way to do ICF - using MIR or
LLVM IR are also fairly effective means as is just using the linker to

I did not mean to suggest we should do folding with basic block sections and I am aware of the issues, though I am not ruling out the possibility. We had briefly brainstormed this.

AFAICT, Function Sections was not invented to do ICF. Function sections was used for linker garbage collection and global code layout and when we first worked on ICF for gold much later, we found function sections to be useful without which it would have been much harder.

They're both in the same arena.

compare bytes. Using function sections can still run into language
specific problems as well, hence why icf=safe vs icf=all. Though pcc's

Agreed, and I would like to humbly suggest that I added those options to gold originally and am aware of that.

Of course, but context is helpful for discussion and for others in the
future understanding the discussion.

recent work helps out a little bit via significant address
determination in the front end
- order files and other similar mechanisms work based on externally
visible symbols and are fairly straightforward to use in an ABI safe
way (no function -> function fallthrough for example)
- basic block sections can run into the multiple entry point problems
depending on how they're laid out as well as other ABI and visibility
issues

This can likely be done safely, but there are definitely some concerns
here with how you want to enable the use of basic block
labels/sections.

I would like to expose basic block sections independently and let me talk more about one more effort we are trying to do. Independent of Propeller, we are looking at a machine learning based approach to do code layout at link time. I won't go into the details of the features but having basic block sections allows us to naturally feed the native object files into a black box that would try several different reorderings of the sections and learn a model. We looked at basic block sections as the building block on top of which Propeller was built. I don't think it would be the only application, just my opinion.

Again a question of "why at the object level rather than a low level
IR" seems like something we should really explore the tradeoffs for at
more length.

-eric

From: Sriraman Tallam <tmsriram@google.com>
Sent: Friday, September 27, 2019 9:43 AM
To: Eli Friedman <efriedma@quicinc.com>
Cc: Xinliang David Li <xinliangli@gmail.com>; llvm-dev <llvm-dev@lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations

Why are you proposing to add a bunch of options to clang to manipulate
LLVM
code generation, given none of those options are actually used for Propeller
workflows?

Where do you suggest labelling and section options should exist? We
looked at basic block sections to be similar to function sections in
terms of option handling?

Generating bitcode with/without propeller doesn’t actually affect the
generated bitcode, right? So you could say that Propeller is enabled with “-Wl,–
enable-propeller”, and not change clang at all. I’m not a fan of adding options
just because we can. If basic block sections are only used as a sort of secret
handshake between the propeller compiler and the propeller linker, we can
change that interface later, if we want; if we expose it as a clang option, we’re
committing to making basic block sections continue to work the same way until
the end of time.

The generated MIR code is slightly different as the later passes have
more CFI instructions, basic block labels and extra direct jumps which
would have been fall throughs otherwise. So, some llvm option is
needed.

At link (LTO codegen) time, yes. But clang’s bitcode generation doesn’t change; only LTO code generation in lld changes.

I see what you mean here.

I envisioned that basic block sections could also be useful on its own
outside of propeller. Hence, we made it like function-sections with
a separate option -fbasicblock-sections. The propeller option is an
umbrella option to make it easy to invoke Propeller.

I don’t think this is really true. Basic block labels means “insert labels that are useful for propeller profiling”, and basic block sections means “split the function into multiple chunks that are convenient for propeller optimization”. So it’s really only useful for propeller-like workflows. And I’d be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn’t specified in some rigorous way.

The idea of basic block sections was seeded by Rui Ueyama. When basic
block sections was originally looked at, Propeller was not designed.
We looked at basic block sections as finer granularity than function
sections. Section prefix based code layout where you just create
smaller code sections and let the linker do what it does today would
have much better results with basic block sections. Symbol ordering
file with basic block sections to do random orderings can be done
without invoking Propeller. Function sections has found uses after it
was invented like Identical Code Folding.

I’m not sure that function sections are entirely an apt comparison
here for a few reasons:

  • function sections mostly gave an easy way to do ICF - using MIR or
    LLVM IR are also fairly effective means as is just using the linker to

I did not mean to suggest we should do folding with basic block sections and I am aware of the issues, though I am not ruling out the possibility. We had briefly brainstormed this.

AFAICT, Function Sections was not invented to do ICF. Function sections was used for linker garbage collection and global code layout and when we first worked on ICF for gold much later, we found function sections to be useful without which it would have been much harder.

They’re both in the same arena.

compare bytes. Using function sections can still run into language
specific problems as well, hence why icf=safe vs icf=all. Though pcc’s

Agreed, and I would like to humbly suggest that I added those options to gold originally and am aware of that.

Of course, but context is helpful for discussion and for others in the
future understanding the discussion.

recent work helps out a little bit via significant address
determination in the front end

  • order files and other similar mechanisms work based on externally
    visible symbols and are fairly straightforward to use in an ABI safe
    way (no function → function fallthrough for example)
  • basic block sections can run into the multiple entry point problems
    depending on how they’re laid out as well as other ABI and visibility
    issues

This can likely be done safely, but there are definitely some concerns
here with how you want to enable the use of basic block
labels/sections.

I would like to expose basic block sections independently and let me talk more about one more effort we are trying to do. Independent of Propeller, we are looking at a machine learning based approach to do code layout at link time. I won’t go into the details of the features but having basic block sections allows us to naturally feed the native object files into a black box that would try several different reorderings of the sections and learn a model. We looked at basic block sections as the building block on top of which Propeller was built. I don’t think it would be the only application, just my opinion.

Again a question of “why at the object level rather than a low level
IR” seems like something we should really explore the tradeoffs for at
more length.

This is so that we can do inter-procedural basic block ordering. Doing this at link time with all the basic blocks exposes a lot more inter-procedural ordering opportunities. At a low-level IR, we are still limited to intra-module basic block reordering.

Thanks
Sri

>
>
>
>>
>> >
>> > >
>> > > > From: Sriraman Tallam <tmsriram@google.com>
>> > > > Sent: Friday, September 27, 2019 9:43 AM
>> > > > To: Eli Friedman <efriedma@quicinc.com>
>> > > > Cc: Xinliang David Li <xinliangli@gmail.com>; llvm-dev <llvm-dev@lists.llvm.org>
>> > > > Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
>> > > > Optimizations
>> > > >
>> > > > >
>> > > > > > > Why are you proposing to add a bunch of options to clang to manipulate
>> > > > LLVM
>> > > > > > code generation, given none of those options are actually used for Propeller
>> > > > > > workflows?
>> > > > > >
>> > > > > > Where do you suggest labelling and section options should exist? We
>> > > > > > looked at basic block sections to be similar to function sections in
>> > > > > > terms of option handling?
>> > > > >
>> > > > > Generating bitcode with/without propeller doesn't actually affect the
>> > > > generated bitcode, right? So you could say that Propeller is enabled with "-Wl,--
>> > > > enable-propeller", and not change clang at all. I'm not a fan of adding options
>> > > > just because we can. If basic block sections are only used as a sort of secret
>> > > > handshake between the propeller compiler and the propeller linker, we can
>> > > > change that interface later, if we want; if we expose it as a clang option, we're
>> > > > committing to making basic block sections continue to work the same way until
>> > > > the end of time.
>> > > >
>> > > > The generated MIR code is slightly different as the later passes have
>> > > > more CFI instructions, basic block labels and extra direct jumps which
>> > > > would have been fall throughs otherwise. So, some llvm option is
>> > > > needed.
>> > >
>> > > At link (LTO codegen) time, yes. But clang's bitcode generation doesn't change; only LTO code generation in lld changes.
>> >
>> > I see what you mean here.
>> >
>> > >
>> > > > I envisioned that basic block sections could also be useful on its own
>> > > > outside of propeller. Hence, we made it like function-sections with
>> > > > a separate option -fbasicblock-sections. The propeller option is an
>> > > > umbrella option to make it easy to invoke Propeller.
>> > >
>> > > I don't think this is really true. Basic block labels means "insert labels that are useful for propeller profiling", and basic block sections means "split the function into multiple chunks that are convenient for propeller optimization". So it's really only useful for propeller-like workflows. And I'd be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn't specified in some rigorous way.
>> >
>> > The idea of basic block sections was seeded by Rui Ueyama. When basic
>> > block sections was originally looked at, Propeller was not designed.
>> > We looked at basic block sections as finer granularity than function
>> > sections. Section prefix based code layout where you just create
>> > smaller code sections and let the linker do what it does today would
>> > have much better results with basic block sections. Symbol ordering
>> > file with basic block sections to do random orderings can be done
>> > without invoking Propeller. Function sections has found uses after it
>> > was invented like Identical Code Folding.
>> >
>>
>> I'm not sure that function sections are entirely an apt comparison
>> here for a few reasons:
>>
>> - function sections mostly gave an easy way to do ICF - using MIR or
>> LLVM IR are also fairly effective means as is just using the linker to
>
>
> I did not mean to suggest we should do folding with basic block sections and I am aware of the issues, though I am not ruling out the possibility. We had briefly brainstormed this.
>
> AFAICT, Function Sections was not invented to do ICF. Function sections was used for linker garbage collection and global code layout and when we first worked on ICF for gold much later, we found function sections to be useful without which it would have been much harder.
>

They're both in the same arena.

>
>>
>> compare bytes. Using function sections can still run into language
>> specific problems as well, hence why icf=safe vs icf=all. Though pcc's
>
>
> Agreed, and I would like to humbly suggest that I added those options to gold originally and am aware of that.

Of course, but context is helpful for discussion and for others in the
future understanding the discussion.

>
>>
>> recent work helps out a little bit via significant address
>> determination in the front end
>> - order files and other similar mechanisms work based on externally
>> visible symbols and are fairly straightforward to use in an ABI safe
>> way (no function -> function fallthrough for example)
>> - basic block sections can run into the multiple entry point problems
>> depending on how they're laid out as well as other ABI and visibility
>> issues
>>
>>
>> This can likely be done safely, but there are definitely some concerns
>> here with how you want to enable the use of basic block
>> labels/sections.
>
>
> I would like to expose basic block sections independently and let me talk more about one more effort we are trying to do. Independent of Propeller, we are looking at a machine learning based approach to do code layout at link time. I won't go into the details of the features but having basic block sections allows us to naturally feed the native object files into a black box that would try several different reorderings of the sections and learn a model. We looked at basic block sections as the building block on top of which Propeller was built. I don't think it would be the only application, just my opinion.
>

Again a question of "why at the object level rather than a low level
IR" seems like something we should really explore the tradeoffs for at
more length.

This is so that we can do inter-procedural basic block ordering. Doing this at link time with all the basic blocks exposes a lot more inter-procedural ordering opportunities. At a low-level IR, we are still limited to intra-module basic block reordering.

Seems like you could spend just as much work and reliability on making
MIR/late stage LTO compilation handle this for you?

-eric

From: Sriraman Tallam <tmsriram@google.com>
Sent: Friday, September 27, 2019 9:43 AM
To: Eli Friedman <efriedma@quicinc.com>
Cc: Xinliang David Li <xinliangli@gmail.com>; llvm-dev <llvm-dev@lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations

Why are you proposing to add a bunch of options to clang to manipulate
LLVM
code generation, given none of those options are actually used for Propeller
workflows?

Where do you suggest labelling and section options should exist? We
looked at basic block sections to be similar to function sections in
terms of option handling?

Generating bitcode with/without propeller doesn’t actually affect the
generated bitcode, right? So you could say that Propeller is enabled with “-Wl,–
enable-propeller”, and not change clang at all. I’m not a fan of adding options
just because we can. If basic block sections are only used as a sort of secret
handshake between the propeller compiler and the propeller linker, we can
change that interface later, if we want; if we expose it as a clang option, we’re
committing to making basic block sections continue to work the same way until
the end of time.

The generated MIR code is slightly different as the later passes have
more CFI instructions, basic block labels and extra direct jumps which
would have been fall throughs otherwise. So, some llvm option is
needed.

At link (LTO codegen) time, yes. But clang’s bitcode generation doesn’t change; only LTO code generation in lld changes.

I see what you mean here.

I envisioned that basic block sections could also be useful on its own
outside of propeller. Hence, we made it like function-sections with
a separate option -fbasicblock-sections. The propeller option is an
umbrella option to make it easy to invoke Propeller.

I don’t think this is really true. Basic block labels means “insert labels that are useful for propeller profiling”, and basic block sections means “split the function into multiple chunks that are convenient for propeller optimization”. So it’s really only useful for propeller-like workflows. And I’d be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn’t specified in some rigorous way.

The idea of basic block sections was seeded by Rui Ueyama. When basic
block sections was originally looked at, Propeller was not designed.
We looked at basic block sections as finer granularity than function
sections. Section prefix based code layout where you just create
smaller code sections and let the linker do what it does today would
have much better results with basic block sections. Symbol ordering
file with basic block sections to do random orderings can be done
without invoking Propeller. Function sections has found uses after it
was invented like Identical Code Folding.

I’m not sure that function sections are entirely an apt comparison
here for a few reasons:

  • function sections mostly gave an easy way to do ICF - using MIR or
    LLVM IR are also fairly effective means as is just using the linker to

I did not mean to suggest we should do folding with basic block sections and I am aware of the issues, though I am not ruling out the possibility. We had briefly brainstormed this.

AFAICT, Function Sections was not invented to do ICF. Function sections was used for linker garbage collection and global code layout and when we first worked on ICF for gold much later, we found function sections to be useful without which it would have been much harder.

They’re both in the same arena.

compare bytes. Using function sections can still run into language
specific problems as well, hence why icf=safe vs icf=all. Though pcc’s

Agreed, and I would like to humbly suggest that I added those options to gold originally and am aware of that.

Of course, but context is helpful for discussion and for others in the
future understanding the discussion.

recent work helps out a little bit via significant address
determination in the front end

  • order files and other similar mechanisms work based on externally
    visible symbols and are fairly straightforward to use in an ABI safe
    way (no function → function fallthrough for example)
  • basic block sections can run into the multiple entry point problems
    depending on how they’re laid out as well as other ABI and visibility
    issues

This can likely be done safely, but there are definitely some concerns
here with how you want to enable the use of basic block
labels/sections.

I would like to expose basic block sections independently and let me talk more about one more effort we are trying to do. Independent of Propeller, we are looking at a machine learning based approach to do code layout at link time. I won’t go into the details of the features but having basic block sections allows us to naturally feed the native object files into a black box that would try several different reorderings of the sections and learn a model. We looked at basic block sections as the building block on top of which Propeller was built. I don’t think it would be the only application, just my opinion.

Again a question of “why at the object level rather than a low level
IR” seems like something we should really explore the tradeoffs for at
more length.

This is so that we can do inter-procedural basic block ordering. Doing this at link time with all the basic blocks exposes a lot more inter-procedural ordering opportunities. At a low-level IR, we are still limited to intra-module basic block reordering.

Seems like you could spend just as much work and reliability on making
MIR/late stage LTO compilation handle this for you?

Full LTO yes, Thin LTO … no. We want this to be scalable and ThinLTO is what we are more interested in.

Thanks
Sri

From: Sriraman Tallam <tmsriram@google.com>
Sent: Friday, September 27, 2019 9:43 AM
To: Eli Friedman <efriedma@quicinc.com>
Cc: Xinliang David Li <xinliangli@gmail.com>; llvm-dev <llvm-dev@lists.llvm.org>
Subject: [EXT] Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations

Why are you proposing to add a bunch of options to clang to manipulate
LLVM
code generation, given none of those options are actually used for Propeller
workflows?

Where do you suggest labelling and section options should exist? We
looked at basic block sections to be similar to function sections in
terms of option handling?

Generating bitcode with/without propeller doesn’t actually affect the
generated bitcode, right? So you could say that Propeller is enabled with “-Wl,–
enable-propeller”, and not change clang at all. I’m not a fan of adding options
just because we can. If basic block sections are only used as a sort of secret
handshake between the propeller compiler and the propeller linker, we can
change that interface later, if we want; if we expose it as a clang option, we’re
committing to making basic block sections continue to work the same way until
the end of time.

The generated MIR code is slightly different as the later passes have
more CFI instructions, basic block labels and extra direct jumps which
would have been fall throughs otherwise. So, some llvm option is
needed.

At link (LTO codegen) time, yes. But clang’s bitcode generation doesn’t change; only LTO code generation in lld changes.

I see what you mean here.

I envisioned that basic block sections could also be useful on its own
outside of propeller. Hence, we made it like function-sections with
a separate option -fbasicblock-sections. The propeller option is an
umbrella option to make it easy to invoke Propeller.

I don’t think this is really true. Basic block labels means “insert labels that are useful for propeller profiling”, and basic block sections means “split the function into multiple chunks that are convenient for propeller optimization”. So it’s really only useful for propeller-like workflows. And I’d be concerned that future changes to propeller could affect the behavior of these options in the future, if the behavior isn’t specified in some rigorous way.

The idea of basic block sections was seeded by Rui Ueyama. When basic
block sections was originally looked at, Propeller was not designed.
We looked at basic block sections as finer granularity than function
sections. Section prefix based code layout where you just create
smaller code sections and let the linker do what it does today would
have much better results with basic block sections. Symbol ordering
file with basic block sections to do random orderings can be done
without invoking Propeller. Function sections has found uses after it
was invented like Identical Code Folding.

I’m not sure that function sections are entirely an apt comparison
here for a few reasons:

  • function sections mostly gave an easy way to do ICF - using MIR or
    LLVM IR are also fairly effective means as is just using the linker to

I did not mean to suggest we should do folding with basic block sections and I am aware of the issues, though I am not ruling out the possibility. We had briefly brainstormed this.

AFAICT, Function Sections was not invented to do ICF. Function sections was used for linker garbage collection and global code layout and when we first worked on ICF for gold much later, we found function sections to be useful without which it would have been much harder.

They’re both in the same arena.

compare bytes. Using function sections can still run into language
specific problems as well, hence why icf=safe vs icf=all. Though pcc’s

Agreed, and I would like to humbly suggest that I added those options to gold originally and am aware of that.

Of course, but context is helpful for discussion and for others in the
future understanding the discussion.

recent work helps out a little bit via significant address
determination in the front end

  • order files and other similar mechanisms work based on externally
    visible symbols and are fairly straightforward to use in an ABI safe
    way (no function → function fallthrough for example)
  • basic block sections can run into the multiple entry point problems
    depending on how they’re laid out as well as other ABI and visibility
    issues

This can likely be done safely, but there are definitely some concerns
here with how you want to enable the use of basic block
labels/sections.

I would like to expose basic block sections independently and let me talk more about one more effort we are trying to do. Independent of Propeller, we are looking at a machine learning based approach to do code layout at link time. I won’t go into the details of the features but having basic block sections allows us to naturally feed the native object files into a black box that would try several different reorderings of the sections and learn a model. We looked at basic block sections as the building block on top of which Propeller was built. I don’t think it would be the only application, just my opinion.

Again a question of “why at the object level rather than a low level
IR” seems like something we should really explore the tradeoffs for at
more length.

This is so that we can do inter-procedural basic block ordering. Doing this at link time with all the basic blocks exposes a lot more inter-procedural ordering opportunities. At a low-level IR, we are still limited to intra-module basic block reordering.

Seems like you could spend just as much work and reliability on making
MIR/late stage LTO compilation handle this for you?

Full LTO yes, Thin LTO … no. We want this to be scalable and ThinLTO is what we are more interested in.

As Sri said, we need to be able to use this for ThinLTO, and I’m not sure how you would do whole program BB layout in ThinLTO which does the whole program analysis much much earlier than MIR. Eric - did you have something else in mind or were you talking about Full LTO?

Teresa

I guess Eric means full program optimization/cross module optimization using MIR. This is in theory workable in full LTO style, but not in ThinLTO style which works on summary data. As we have discussed, eliminating monolithic style optimization is the key design goal.

This was also briefly discussed in one of the previous replies I sent. There are other benefits of doing this in linker such as handling prebuilt libraries such as libc.

David

One could imagine using MIR as an “object-file” format from the compiler to the linker – and then within the linker, doing block layout on the MIR and lowering to actual instructions.

I don’t know that I’d say that’s simpler than what you’ve described, however.

One could imagine using MIR as an “object-file” format from the compiler to the linker – and then within the linker, doing block layout on the MIR and lowering to actual instructions.

Sure, but this moves some of the currently parallel compilation (in ThinLTO at least) into a monolithic phase (the linker), which is what Propeller is trying to avoid to achieve scalability. Teresa

Hi Sri and the team,

Thank you for sharing your proposal. It helps bring awareness to the importance
of context-sensitive and link-time optimizations in the compiler toolchain for
boosting the performance of large applications. The proposal also shows that
different approaches are possible to achieve a similar goal.

Putting basic blocks into separate sections is an ambitious approach with
expected challenges. For the first time, the real overheads of this approach
were measured at scale, and the results look quite staggering. I can imagine
how it might work for Google where the overhead of the 2-stage re-compilation
could be masked out by a distributed build system, and dealing with C++
exceptions is not an issue due to software development practices. At the same
time, it should be clear that for users without access to a distributed build
system, the actual build-time overhead will far exceed that of BOLT.
Considering the proposal in its current form, I'm not convinced it's the best
approach even for Google in the long run.

Since the proposal references BOLT in multiple places, and since I’m directly
involved with BOLT and hence potentially biased, I decided to break my feedback
into two parts, with the first part being unrelated to BOLT and hopefully being
as objective as possible.

Here’s a quick summary of my concerns, followed by more detailed feedback.

PERFORMANCE OF TARGET APPLICATIONS
  * Poor optimization of C++ code bases with exceptions
  * Pessimization/overhead for stack unwinding used by system-wide profilers and
    for exception handling
  * Increased memory footprint at application runtime
  * No clear plan for adding optimizations in the future without the need for
    more RFCs
  * Effectiveness of the relaxation pass
  * Inconsistencies in performance data
  * Debugging overhead

BUILD SYSTEM
  * Requirement for two different builds and hidden overhead of Propeller
  * Step 1 overheads
  * Enormous increase in object file size
  * Lack of scalability for actual code reordering during the optimization phase

PROFILING
  * Propeller-optimized binary for continuous deployment

PERFORMANCE OF TARGET APPLICATIONS

*Poor optimization for C++ code bases with exceptions*

From what I understand, your list of large real-world applications (you exclude
SPEC from the list, which I totally agree with) does not include a single one
that uses C++ exceptions. This is based on the assumption that exceptions are
not used at Google, and Clang is compiled without the exceptions support by
default. I consider this is a major omission from the evaluation and a drawback
of the proposal.

A lot of exception-handling code is generated by compiler “behind the scenes”
and is invisible to the user, but is exposed to Propeller. Even if such code is
never executed, i.e. exceptions are not thrown, it is important to be able to
reorder the code in the function including these blocks. The RFC says: “basic
blocks that are involved in exception handling, that is, they either throw or
catch an exception, are grouped together and placed in a single section. Unique
sections are not created for such basic blocks.” Further down the paragraph,
you say: “benchmarks which spend a significant amount of time handling
exceptions might not get the optimization benefits of Propeller.“ The proposal,
intentionally or not, makes it look like applications that are compiled with
exceptions support and that only use them rarely, i.e. for error-handling, are
not affected. Based on the limitations you describe above, I believe the code
reordering and thus optimization benefits will be limited for such
applications.

If you cannot find any large real-world benchmark that uses C++ exceptions, at
the very minimum, I would like to see Clang itself being compiled with
exceptions support and optimized by Propeller. Granted, it will not exercise
any exception-handling code, but at least the results will give an idea of how
well Propeller handles optimizing code with the mechanism enabled.

*Pessimization/overhead for stack unwinding used by system-wide profilers and
for exception handling*

Larger CFI programs put an extra burden on unwinding at runtime as more CFI
(and thus native) instructions have to be executed. This will cause more
overhead for any profiler that records stack traces, and, as you correctly note
in the proposal, for any program that heavily uses exceptions.

*Increased application memory footprint*

The increase of .eh_frame section is up to 17x, which is quite significant.
Since it’s part of the runtime image of the application, it is important to
know how much larger the application memory footprint gets. I.e., in addition
to the “Total” size of the binary in the comparison table, I would like to see
“Total Allocatable” size data.

*No clear plan for adding optimizations in the future without the need for
additional RFCs*

Some optimizations other than basic block reordering are best executed at link
time. One example would be an optimization for macro-op fusion on x86-64 CPUs.
We’ve seen a real-world application, bzip2, regressed by 5% in the case of
“unlucky” code placement. The optimization requires an analysis of instructions
(in this case, instruction pairs) and modification of the instruction stream
such as insertion of nops. For the optimization to be performed in the
compiler, it would require functions to be aligned at 64 bytes which is
generally not practical, hence an opportunity for link-time/binary
optimization. What is the plan for Propeller to handle the above and similar
cases?

*Effectiveness of the relaxation pass*

When you describe the relaxation pass, you do not mention if you run it till it
converges, or you run it conservatively leaving larger than necessary versions
of jump instructions.

*Inconsistencies in performance data*

Performance data in bar charts indicate that Propeller is not achieving all
possible improvements for Clang, the only open-source benchmark in the
evaluation where code-layout optimizations matter. Cycles, I-Cache, and
branches-not-taken indicate that BOLT is doing better. At the same time, the
summary table shows 9% improvement for all. Which data set is correct?

*Debugging overhead*

As Propeller has to support debug info generation on a per-basic-block basis,
it increases DWARF sections corresponding to address ranges, location lists,
and line number info. How much larger are these sections with Propeller flags?
We are currently pushing the limits of 32-bit DWARF, and I wonder if you
encounter any related issues considering the Propeller bloat?

With larger debug info, the overhead flows to debuggers and tools using debug
information for annotation purposes, e.g. profilers reporting line numbers
corresponding to binary addresses. Depending on their usage of the debug info,
these tools will likely use more memory and spend more time parsing DWARF. Have
you considered/measured this overhead?

By the way, I didn’t see DWARF location lists specifically mentioned in the
proposal. Making sure you use extra entries for those too.

BUILD SYSTEM

*Requirement for two different builds and hidden overhead of Propeller*

While discussing Memory Overhead and Runtime Overhead in the RFC, you chose to
include the overhead incurred only during link time. However, compared to a
regular highly-optimized AutoFDO+(Thin)LTO build, you require a second full
build of the application with (Thin)LTO. Even with the caching of IR, there’s
an extra cost of generating code not reflected in your evaluation. With tight
integration with the build system, it is possible to hide these costs
partially. In either case, since “all the benchmarks were built for the X86_64
platform on a 18 core Intel machine with 192 G of RAM“, you should be able to
measure extra time for compilation spent on this machine.

*Step 1 overheads*

Furthermore, wouldn’t the LLVM compiler incur extra overhead when compiling
with Propeller flags because of larger CFIs, debug info, etc.? I think it would
be fair to measure and report the resulting compile-time and memory overhead
for both builds (steps 1 and 4) versus regular build.

Additionally, what are runtime and memory overheads for linking in Step 1? Do
they correspond to the “All” column in the Experiments section? If you chose to
compare just the link-time overhead versus the baseline, it should come from
the combination of link times of steps 1 and 4 since you have to link twice for
Propeller.

*Enormous increase in object file size*

Even with the optimized on-demand approach, the size of object files almost
doubles. Since in a distributed build system object files are cached and
shared, I’m not sure it is fair to describe this approach as being “friendly to
distributed build systems” since it increases the existing load.

What is the breakdown of the increase in object file sizes? As you mention in
the proposal, DWARF natively supports address ranges. However, specifying a
separate range for every basic block will introduce an overhead. Is this
increase included in the total overhead for the “Object File Sizes Bloat”?

*Lack of scalability for actual code reordering during the optimization phase*

When you talk about Propeller being a distributed and scalable system, you
mention that it has “a distributed part and a whole-program part”. Does this
mean that the part where you need to recompile the application is distributed
(if you are using a distributed build system), but the actual link and the code
reordering optimization are happening serially?

PROFILING

Real-world systems built to benefit from profile feedback optimizations, such
as AutoFDO, are designed to benefit from a feedback collected on a previous
revision of the highly optimized binary running in prod. Does Propeller support
profiling binaries optimized by Propeller with the purpose of future
optimizations?

BOLT

Now to the BOLT part.

As you know, BOLT was originally designed to handle applications built using
arbitrary toolchain without any modification to the toolchain itself or the
build process (other than “--emit-relocs” linker flag for maximum gains).
Integration with any particular toolchain, such as LLVM+lld, obviously brings
immediate advantages to the design and thus, at least in theory, comparing BOLT
to such a system would not be apples-to-apples. However, I think BOLT stands
fairly competitive even in its current form. With a few enhancements and a
little help from the linker, issues raised in the Propeller proposal could be
addressed without the need for the radical section-per-basic-block approach.

Recently (July 2019), we have parallelized many passes in BOLT and improved its
runtime. If you run experiments using an old version, you should try a more
recent one. For example, optimizing Clang 7.0, with “.text” of ~50MB, takes
less than one minute. Hardly a use case requiring any further optimization and
justifying a need for a distributed system.

One of your main arguments is that Propeller has 20% memory footprint of that
of BOLT. This is achieved by limiting optimization to profiled functions, i.e.
typically less than 10% of all functions. BOLT’s main memory overhead comes
from the intermediate representation of all functions in the binary in the form
of MCPlus. If we decide to keep IR only for functions with profile information,
i.e. less than 10% of all functions, there’s fundamentally no reason why BOLT’s
memory usage wouldn’t get decimated and brought on-par with Propeller’s link
time, perhaps even less. We never found that to be a restriction in our
environment and haven’t heard it been an issue other than from you.

The following statement about BOLT in your proposal does not sound right: “with
BOLT, even small source changes invalidate the profile information, an error if
the binary’s build id does not match the profile [sic]. Hence, the binary
corresponding to the source change must be re-built and re-profiled.“ Quite the
opposite, BOLT is built to tolerate binary differences, including those
resulting from LTO, and even supports profiling previously BOLTed binaries. The
only place we check for the build ID is during the profile conversion phase
corresponding to Step 3 in your proposal. This is done to avoid any possible
user error, as the collection (Step 2) and conversion often happen on different
machines. It might be a good idea to implement this check in Propeller too.

Comparison on “Search1” includes data gathered with huge pages enabled for
Propeller and disabled for BOLT. For a direct comparison, you can include a
line with and without huge pages. If you cannot enable huge pages for BOLT, it
is fair to exclude the result.

I should also note that BOLT performs a set of optimizations beyond the code
layout that can benefit from context-sensitive profile information, such as
indirect call promotion, inlining, shrink-wrapping etc. For the full set of
optimizations implemented in BOLT, please check our CGO’19 paper
(https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/).

BOLT’s additional benefits include a support for code instrumentation and code
protection/hardening. E.g. spectre/meltdown mitigation including that for
assembly-written code.

Obviously, this is not the right place for an alternative proposal, but I
strongly believe that integrating BOLT with a linker, such as lld, would
address the majority of the issues and provide a seamless solution to the users
of the LLVM toolchain without the need for radical changes to the emitted code
and LLVM codegen. If the scalability is indeed a concern, which so far is not
what we’ve seen, then we might consider changing LLVM itself to emit more data
or add integration to a distributed build system. But again, IMHO this would be
a solution to a nonproblem.

The number of CFI instructions that have to be executed when unwinding any given stack stays the same. The CFI instructions for a function have to be duplicated in every basic block section, but when performing unwinding only one such a set is executed – the copy for the current basic block. However, this copy contains precisely the same CFI instructions as the ones that would have to be executed if there were no basic block sections.

Thanks for clarifying. This means once you move to the next basic block (or any other basic

block in the function) you have to execute an entirely new set of CFI instructions

except for the common CIE part. While indeed this is not as bad, on average, the overall

active memory footprint will increase.

Creating one FDE per basic block means that .eh_frame_hdr, an allocatable section,

will be bloated too. This will increase the FDE lookup time. I don’t see .eh_frame_hdr

being mentioned in the proposal.

Maksim

I’m a bit confused by this subthread – doesn’t BOLT have the exact same CFI bloat issue? From my cursory reading of the propellor doc, the CFI duplication is necessary to represent discontiguous functions, not anything particular to the way Propellor happens to generate those discontiguous functions.

And emitting discontiguous functions is a fundamental goal of this, right?

You’re correct, except that, in Propeller, CFI duplication happens for every basic block as it operates with the conservative assumption that a block can be put anywhere by the linker. That’s a significant bloat that is not cleaned up later. So, during link time, if N blocks from the same function are contiguous in the final layout, as it should happen most of the time for any sane BB order, we would have several FDEs for a region that only needs one. The bloat goes to the final binary (a lot more FDEs, specifically, one FDE per basic block).

BOLT will only split a function in two parts, and only if it has profile. Most of the time, a function is not split. It also has an option not to split at all. For internally reordered basic blocks of a given function, it has CFI deduplication logic (it will interpret and build the CFI states for each block and rewrite the CFIs in a way that uses the minimum number of instructions to encode the states for each block).

Maks and team, thanks for the detailed feedback and we will address all of your
concerns. Let’s begin with CFI and DebugInfo first since this is already
being discussed.

TLDR; clang is pathological and the actual CFI bloat will go down from 7x to
2x.

Let us present the CFI Bloats for each benchmark with the default option, which
is creating basic block sections only for functions with samples. For clang,
it is 7x and *not 17x* (the 17 x number is for all sections), and for the
large benchmarks it is less than 30% or 1.3x. For large benchmarks, Storage is
the highest, going from 18M to 23M, ~30%. Clang is almost pathological here.
This is because 10% of all functions in clang have samples (touched by the
profile information), and all of these functions get full basic block sections.
Whereas, for the large benchmarks, only 0.5% of functions have samples. Now,
for clang, 10% of functions that have samples also happen to contain 35% of all
basic blocks. This means, we are creating sections for 35% of all basic blocks
and the CFI bloats are clearly showing.

Now, the data for clang also shows that only 7% of the basic blocks have
samples. We are working on restricting basic block sections to only those basic
blocks that have samples. The rest of the basic blocks (cold) in that function
would share the same section. With this, we would reduce the bloat of CFI
from 7x to 2x. This is not hard to do and we will follow up with this patch.

Also for object size bloats with regards to eh_frame, the reasoning is similar.
Restricting the section creation to only basic blocks that have profiles will
reduce this a lot more.

Importantly, if CFI support were available for discontiguous ranges we wouldn't
have to duplicate CFI FDEs and the bloats would be near minimal.

BOLT parses CFI and DWARF and generates compact information by rewriting it.
Whereas, Propeller uses lld which uses relocations and sections to fixup but
does not rewrite it. This is by design and that lld is not DWARF and CFI
aware. We designed basic block sections just like function sections. The
compiler produces a bunch of sections and relocations. The linker patches the
relocations and the debug info and CFI are right, that's it. For CFI, since
there is no support for discontiguous ranges we have to duplicate and dedup
FDEs only for blocks with sections. We are asking that CFI support
discontiguous ranges and this would look even simpler. Alternately, if lld
were made DWARF and CFI aware we could rewrite it compactly like BOLT.
These would help with object size bloats and binary size bloats.

We would also like to clarify on the misconceptions around CFI Instructions:

There are two things that need to be clarified here:

1) Extra CFI FDE entries for basic blocks does not mean more dynamic
instructions are executed. In fact, they do not increase at all. Krys
talked about this earlier.
2) We do deduplication of common static CFI instructions in the FDE
and move it to the CIE . Hence, moving to a new basic block does not
mean a completely new set of CFI instructions is executed.

The next concern we want to address is about C++ Exceptions which was
considered a major omission:

TLDR; This can be implemented in propeller easily. It was considered
low priority because exception handling code is assumed in general not
performance critical .

Currently, we do not create basic block sections for any block that is
involved in exception handling. There is nothing fundamentally
difficult about exception handling basic blocks. The exception table
uses address offsets which currently does not use relocations, and we
must modify it to use relocations in order to enable sections for
exception handling basic blocks.

We will re-prioritize and send out a patch to handle exception basic blocks.

Thanks
Sri

Some more information about the relaxation pass whose effectiveness
and convergence guarantees were listed as a concern:

TLDR; Our relaxation pass is similar to what LLVM’s MCAssembler does
but with a caveat for efficiency. Our experimental results show it is
efficient and convergence is guaranteed.

Our relaxation pass is very similar to what MCAssembler does as it
needs to solve the same problem, which is laying out basic blocks and
optimizing branch instruction sizes based on jump offset. The optimal
algorithm to do this is quadratic and it involves computating of
section offsets multiple times. Our relaxation pass has a caveat for
efficiency. Our relaxation pass recomputes section offsets only after
all the sections have been processed. This is repeated until
convergence, whereas, the MCAssembler re-computes section offsets
every time a section is modified. Our pass works well in the linker as
it minimizes the section offset recomputation, which is more expensive
to do in the linker (whole program) than doing it in the assembler
(per function).

We shrink jump instructions aggressively and then fix the wrongly
shrunk branches by running a pass that grows them back. We guarantee
convergence because we always shrink in the first round and always
grow in the second.

Data shows that our relaxation is efficient. For example, for clang,
the aggressive shrinking affects ~995000 branches in 6 iterations and
only 4 branches had to be grown back.