[RFC] Making .eh_frame more linker-friendly

Keeping .eh_frame separated should still simplifies the linker because
until the last step of building .eh_frame and .eh_frame_hdr, we don't
really need to parse .eh_frame sections. So, if we have separate .eh_frame
sections on -ffunction-sections, all we have to do is (1) garbage-collect
sections and (2) construct .eh_frame and .eh_frame_hdr sections from live
.eh_frame sections. At step 1, .eh_frame can be handled as a blob of data.
That is much simpler than (1) parsing all .eh_frame sections beforehand,
(2) garbage-collect .eh_frame shards, and (3) re-construct .eh_frame and
.eh_frame_hdr from .eh_frame shards.

In addition to that, keeping .eh_frame separated should greatly simplifies
the logic of identical code folding when comparing not only function
contents but exception handler information.

I tried to investigate how to implement this and suddenly found that Rafael
looks already tried to to the same in 2015: '[FYI patch] Produce multiple .eh_frame sections' - MARC.

Basing on comments, approach worked but had slowdowns for bfd and crashed
gold (or made it slower too). I can try to investigate this again and either reimplement
or ressurrect that code to check how multiple .eh_frame sections behave
nowadays, but few things are unclear for me.

Rafael, you wrote in description that you would not land that patch that time, do
you also think now we should try this again ? (since we have production ready LLD now).

(Assuming we want to try this again)
What are we going to do with possible slowdowns of GNU linkers ? I guess some option can be introduced
to keep old behavior for them, but than also not sure how LLD should handle 2 different types of
inputs (with single/multiple .eh_frame).

Any considerations of what should we try/check at first ? (I guess perfomance numbers for bfd/gold/LLD
is major point of interest, size of output, anything else ?).

George.

I tried to investigate how to implement this and suddenly found that Rafael
looks already tried to to the same in 2015: '[FYI patch] Produce multiple .eh_frame sections' - MARC.

Clarification: not the same, but something close. At least there are multiple .eh_frame sections in output.

As far I understand what we need for this experiment is for each text section emit its own .eh_frame and set
SHF_LINK_ORDER and sh_link field for the latter pointing to .text section. Looks code from patch above
can be adopted to do that.

George.

>Keeping .eh_frame separated should still simplifies the linker because
>until the last step of building .eh_frame and .eh_frame_hdr, we don't
>really need to parse .eh_frame sections. So, if we have separate .eh_frame
>sections on -ffunction-sections, all we have to do is (1) garbage-collect
>sections and (2) construct .eh_frame and .eh_frame_hdr sections from live
>.eh_frame sections. At step 1, .eh_frame can be handled as a blob of data.
>That is much simpler than (1) parsing all .eh_frame sections beforehand,
>(2) garbage-collect .eh_frame shards, and (3) re-construct .eh_frame and
>.eh_frame_hdr from .eh_frame shards.
>
>In addition to that, keeping .eh_frame separated should greatly simplifies
>the logic of identical code folding when comparing not only function
>contents but exception handler information.

I tried to investigate how to implement this and suddenly found that Rafael
looks already tried to to the same in 2015: MARC: Mailing list ARChives
commits&m=144683596826489.

Basing on comments, approach worked but had slowdowns for bfd and crashed
gold (or made it slower too). I can try to investigate this again and
either reimplement
or ressurrect that code to check how multiple .eh_frame sections behave
nowadays, but few things are unclear for me.

Rafael, you wrote in description that you would not land that patch that
time, do
you also think now we should try this again ? (since we have production
ready LLD now).

(Assuming we want to try this again)
What are we going to do with possible slowdowns of GNU linkers ? I guess
some option can be introduced
to keep old behavior for them, but than also not sure how LLD should
handle 2 different types of
inputs (with single/multiple .eh_frame).

Thank you for taking a look. I think that the answer depends on how much
slower GNU linkers are with separate .eh_frame sections. If it is not too
slow, it may make sense to generate split .eh_frame sections
unconditionally. Otherwise, we might want to add a new option so that clang
doesn't produce split .eh_frame sections by default.

Either way, my aim is to make lld handle only split .eh_frame sections to
do gc-sections or ICF. Supporting both non-split and split section doesn't
simplify it at all and therefore doesn't make sense. It'd rather increase
the work we need to do.

Any considerations of what should we try/check at first ? (I guess

Thank you for taking a look. I think that the answer depends on how much slower GNU linkers are with separate .eh_frame sections. If it is not too slow, it may make >sense to generate split .eh_frame sections unconditionally. Otherwise, we might want to add a new option so that clang doesn’t produce split .eh_frame sections by >default.

I’ll start investigating the implementation and try to update/reimplement and check that all.​

(Can probably take some time because want to investigate llvm/mc code a little as it is new for me).

Just finished lib/MC patch ​that creates .eh_frame section for each text sections,

sets SHF_LINK_ORDER flag and proper sh_link field.

Patch based on Rafael’s code mentioned earlier in this thread, uploaded it just

in case anyone is interested to look at current version,

it is here: https://reviews.llvm.org/D40352.

Now going to start testing of how it works with GNU and LLD linkers.

George.

I performed tests basing on first diff of https://reviews.llvm.org/D40352.
(Creates .eh_frame for each .text.*, sets SHF_LINK_ORDER and .sh_link of created
.eh_frame section to point to corresponding .text.)

With use of GNU ld (GNU Binutils) 2.29.51.20171006 it reports errors when linking sample apps:
~/LLVM/Release/bin/clang++ test.cpp -ffunction-sections -o test.o
/usr/local/bin/ld: .eh_frame has both ordered [.eh_frame' in /tmp/test-dbc52e.o] and unordered [.eh_frame’ in /usr/lib/gcc/x86_64-linux-gnu/5.4.1/…/…/…/x86_64-linux-gnu/crt1.o] sections
/usr/local/bin/ld: final link failed: Bad value

Looks it’s code explicitly restricts mixing sections with link order flag and without:
https://github.com/gittup/binutils/blob/gittup/bfd/elflink.c#L9991

With GNU gold (GNU Binutils 2.29.51.20171006) 1.14 have an assert:
~/LLVM/Release/bin/clang++ test.cpp -ffunction-sections -o test.o
/usr/local/bin/ld: internal error in layout_eh_frame_section, at …/./…/gold/object.cc:1309
It is that place: https://github.com/gittup/binutils/blob/gittup/gold/object.cc#L1372
Did not investigate it, but it looks it is place (=?big5?b?RG91ZyBLd2F - [PATCH] Gold: Do not emit locals from discard .eh_frame sections.)
mentioned in comment for '[FYI patch] Produce multiple .eh_frame sections' - MARC.

LLD ~head fails here https://github.com/llvm-mirror/lld/blob/master/ELF/InputFiles.cpp#L392 as
we are trying to do cast to InputSection, though .eh_frame is not regular InputSection but EhInputSection.

Basing on all above I supposed it should be fine to not emit SHF_LINK_ORDER flags for start.
I changed implementation so that it emits multiple .eh_frames and sets sh_link field for
SHT_X86_64_UNWIND sections, but don’t set the SHF_LINK_ORDER flag. In that way only
x64 target is affected and SHT_X86_64_UNWIND sections are known to be .eh_frame so can be
handled on linker side not by name, what is also nice for start it seems.
(Uploaded current version as diff 2 here: https://reviews.llvm.org/D40352)

With such change LLD and bfd links fine, and I was able to test clang link time. Clang does not
use exceptions, but x64 ABI still requires .eh_frame sections for functions, and result objects
contains many of them so I think it is reasonable test. What I did: built clang with and without
patch, used result compiler binaries for producing 2 sets of release clang objects
(flags were -fPIC -std=c++11 -ffunction-sections -fdata-sections), used them for testing link time
with use of ld.bfd and LLD.

Results (30 runs for each):

  1. ld.bfd link time before patch: 2,922310802 seconds time elapsed ( ± 0,13% )
    Size: 80 667 832 bytes
  2. ld.bfd link time after patch: 3,773565826 seconds time elapsed ( ± 0,13% )
    Size: 80 667 832 bytes
  3. LLD link time before patch: 0,419247946 seconds time elapsed ( ± 0,48% )
    Size: 80 738 240 bytes
  4. LLD link time after patch: 0,460139012 seconds time elapsed ( ± 0,44% )
    Size: 80 738 240 bytes

There is no difference in size (that is probably expected as linkers do deduplication optimization, so looks right for me),
slowdown for bfd is about 29%, for LLD about 9.7%.

What we gonna do next ? My plan is to prepare demo patch for LLD to stop parsing .eh_frames
for GC step and benchmark the results with use of -gc-sections. It also can show amount of LLD
code reduction/simplification we can have.

George.

In that way only x64 target is affected and SHT_X86_64_UNWIND sections are known to be .eh_frame so can be

handled on linker side not by name, what is also nice for start it seems.
(Uploaded current version as diff 2 here: https://reviews.llvm.org/D40352)

Taking this back. If we will emit multiple eh_frame we anyways anyways will need either use some flag or
handle “.eh_frame” by name for other targets, so doing that for SHT_X86_64_UNWIND did not have much sence.

George.

What we gonna do next ? My plan is to prepare demo patch for LLD to stop parsing .eh_frames

for GC step and benchmark the results with use of -gc-sections. It also can show amount of LLD
code reduction/simplification we can have.

George.

Demo patch for LLD that stops parsing .eh_frame during GC stage

is: https://reviews.llvm.org/D40484. With it LLD GC code is slightly simpler.

I tested it both together with lib/mc patch for emiting multiple .eh_frames

(https://reviews.llvm.org/D40352) and along, it looks it has no any visible effect on

perfomance by itself.

George.

With GNU gold (GNU Binutils 2.29.51.20171006) 1.14 have an assert:
~/LLVM/Release/bin/clang++ test.cpp -ffunction-sections -o test.o
/usr/local/bin/ld: internal error in layout_eh_frame_section, at
.././../gold/object.cc:1309
It is that place:
https://github.com/gittup/binutils/blob/gittup/gold/object.cc#L1372
Did not investigate it, but it looks it is place
(=?big5?b?RG91ZyBLd2F - [PATCH] Gold: Do not emit locals from discard .eh_frame sections.)
mentioned in comment for
'[FYI patch] Produce multiple .eh_frame sections' - MARC.

I've committed a patch in gold that should fix this problem:

   Cary Coutant - [gold commit] Allow multiple .eh_frame sections per object file

Can you try gold again with this patch applied? You should at least
get a little further.

If it still doesn't work, could I trouble you for a sample object file?

-cary

With GNU gold (GNU Binutils 2.29.51.20171006) 1.14 have an assert:
~/LLVM/Release/bin/clang++ test.cpp -ffunction-sections -o test.o
/usr/local/bin/ld: internal error in layout_eh_frame_section, at
.././../gold/object.cc:1309
It is that place:
https://github.com/gittup/binutils/blob/gittup/gold/object.cc#L1372
Did not investigate it, but it looks it is place
(=?big5?b?RG91ZyBLd2F - [PATCH] Gold: Do not emit locals from discard .eh_frame sections.)
mentioned in comment for
'[FYI patch] Produce multiple .eh_frame sections' - MARC.

I've committed a patch in gold that should fix this problem:

  Cary Coutant - [gold commit] Allow multiple .eh_frame sections per object file

Can you try gold again with this patch applied? You should at least
get a little further.

If it still doesn't work, could I trouble you for a sample object file?

-cary

I'll try it soon and return with results, thanks !

Just a small clarification: your commit message saying that "LLVM is experimenting with placing
.eh_frame sections in the COMDAT group", that is true for original experiment which faced this gold issue
('[FYI patch] Produce multiple .eh_frame sections' - MARC), current approach I am experimented
with (https://reviews.llvm.org/D40352) sets sh_link of .eh_frame to corresponding .text section instead.
Though anyways issue looks to be that gold did not like to have multiple .eh_frame sections and that
is what both approaches faced with and your patch seeems fixes that.

George.

With GNU gold (GNU Binutils 2.29.51.20171006) 1.14 have an assert:
~/LLVM/Release/bin/clang++ test.cpp -ffunction-sections -o test.o
/usr/local/bin/ld: internal error in layout_eh_frame_section, at
.././../gold/object.cc:1309
It is that place:
https://github.com/gittup/binutils/blob/gittup/gold/object.cc#L1372
Did not investigate it, but it looks it is place
(=?big5?b?RG91ZyBLd2F - [PATCH] Gold: Do not emit locals from discard .eh_frame sections.)
mentioned in comment for
'[FYI patch] Produce multiple .eh_frame sections' - MARC.

I've committed a patch in gold that should fix this problem:

  Cary Coutant - [gold commit] Allow multiple .eh_frame sections per object file

Can you try gold again with this patch applied? You should at least
get a little further.

If it still doesn't work, could I trouble you for a sample object file?

-cary

I'll try it soon and return with results, thanks !

I can confirm your patch fixes gold behavior.

I built latest binutils and performed benchmark tests again today.
Following versions of linkers were used:
* GNU ld (GNU Binutils) 2.29.51.20171129
* GNU gold (GNU Binutils 2.29.51.20171129) 1.14
* LLD 6.0.0 (trunk 319302) (compatible with GNU linkers)

--no-threads was set for gold and LLD tests.

Clang link time with single .eh_frame section in objects.
* ld.bfd: 2,940055212 seconds time elapsed ( +- 0,17% ) (t1)
* ld.gold: 0,994370076 seconds time elapsed ( +- 0,11% ) (t2)
* LLD: 0,445566042 seconds time elapsed ( +- 0,32% ) (t3)

Clang link time with multiple .eh_frame sections in objects.
(latest diff 3 of https://reviews.llvm.org/D40352 applied to produce them).
* ld.bfd: 3,792698701 seconds time elapsed ( +- 0,19% ) (1,29*t1)
* ld.gold: 1,135187654 seconds time elapsed ( +- 0,10% ) (1,1416*t2)
* LLD: 0,506076638 seconds time elapsed ( +- 0,31% ) (1,1358*t3)

George.

I am very interested in reviving this.

Did anyone get any further with these ideas?

@Grimar: Did you do any profiling of the code? Were the slowdowns

you were seeing fundamental (i.e. due to IO) or could a more optimal

implementation reduce the slowdown? Did you do any end to end

timings for compilation + link time?

The same issues arise for all metadata sections:

.eh_frame

.debug_*

.stack_sizes

etc…

In our proprietary linker we’ve had to implement special handling

for each of these sections, which we’d love to be able to remove or

reduce.

One fundamental problem is overhead. I posted about

this on the gabi list:

https://groups.google.com/d/msg/generic-abi/A-1rbP8hFCA/EDA7Sf3KBwAJ

Take the .stack_sizes section. There is an llvm bug

which suggests that we should produce multiple .stack_size

sections rather than one monolithic .stack_size section:

https://bugs.llvm.org/show_bug.cgi?id=36717.

However, for .stack_sizes the payload is on average 10 bytes

per function. Splitting into multiple sections on elf x86-64 adds

a overhead of 128 bytes per function. An order of magnitude

increase. However, I don’t know of any way to avoid this increase

without adding special case code to the linker?

Another thought is that although the gnu linkers are a concern

upstream, ​on our platform (and others where we are fully in control),

we could use this approach for .eh_frame. We would be able to test

and maintain the separate code paths in the backend.

@Grimar: Did you do any profiling of the code? Were the slowdowns

you were seeing fundamental (i.e. due to IO) or could a more optimal

implementation reduce the slowdown? Did you do any end to end

timings for compilation + link time?

No, as far I remember I did not profile this. All results I had were about linker
timing for linking clang (posted in this thread).

I think the slowdown is natural. The more input sections we have the slower we are.
Since LLD is very fast by nature, honestly I do not really believe there is a huge
chance to significantly boost the observed ~12(?) percents slowdown.

Otherwise, that would probably be done earlier for the common case.

George.

I performed additional benchmarking today and also profiled this.

Previously I tried to link clang. I decided to take something much larger today.
So what I did was: took chromium sources and built 2 sets of objects.
One of them was built with vanilla clang and one with clang+https://reviews.llvm.org/D40352 patch applied.
As a result, the second set of object contained multiple .eh_frames (for each text section),
and the first set was just regular set.

Link times after 200 runs were:

  1. ~2332ms for the regular set.

  2. ~2464ms for “D40352” set.

The difference is about 6%. Does not look too scary as ~10% for clang link time I had earlier actually.
Note that I did not apply any other patches than D40352. For example, there is a draft patch for linker
that could use the benefit of objects with multiple .eh_frames to significantly simplify and even slightly
improve the -gc-sections implementation: https://reviews.llvm.org/D40484.
It could save some time for the case with GC probably.

Also, I tried to profile the difference observed but did not found any visible bottlenecks.
We seem just become a bit slower everywhere in the linker, I believe this is caused by natural
reasons: more sections, larger inputs → slower result.

(I shared both reproduce sets used here: https://drive.google.com/open?id=15tGIypHOATiodxISRCAbg5iiGTBFXAtc).​