On LLD performance

Rui_Ueyama · March 11, 2015, 10:02pm

I spent a week to optimize LLD performance and just wanted to share things what I found. Also if there’s anyone who have a good idea on how to make it faster, I’d like to hear.

My focus is mainly on Windows, but my optimizations are generally platform neutral. I aim both single-thread and multi-thread performance.

r231434 is a change that has the largest impact. It greatly reduced linking time to link large binaries, as it changed the order of number of symbols we need to process for files within --start-group/–end-group. If you have many library files in a group, you would see a significant performance gain. It’s probably not often the case on Unix. On the other hand, on Windows (and IIRC on Darwin), we move all library files to end of input file list and group them together, so it’s effective. It improves single-thread performance.

r231454 is to apply relocations in parallel in the writer using parallel_for. Because number of relocations are usually pretty large, and each application is independent, you basically get linear performance gain by using threads. Previously it took about 2 seconds to apply 10M relocations (the number in the commit message is wrong) on my machine, but it’s now 300 milliseconds, for example. This technique should be applicable to other ports.

r231585 changes the algorithm to create base relocations so that we can use parallel_sort. Unfortunately, base relocations are Windows specific, I don’t think this technique is applicable to other ports.

r231584 and r231432 are effective but minor improvements.

At this point, the resolver is bottleneck. Readers introduce surprisingly small delay when linking large binaries, probably thanks to parallel file loading and archive member preloading. Or just that file parsing is an easy task. Preloading hit rate is >99%, so when you need a symbol from an archive file, its member is almost always parsed and ready to be used. ArchiveFile::find() returns immediately with a result. Writers and other post-resolver passes seem reasonably fast. The dominant factor is the resolver.

What the resolver does is, roughly speaking, reading files until it resolves all symbols, and put all symbols received from files to a hash table. That’s a kind of tight loop. In r231549 I cut number of hash table lookup, but looks like it’s hard to optimize that more than this.

An idea to make the resolver faster would be to use a concurrent hash map to insert new symbols in parallel. Assuming symbols from the same file don’t conflict each other (I think it’s a valid assumption), this can be parallelized. I wanted single-thread performance gain, though. (Also, concurrent hash maps are not currently available in LLVM.)

Another idea is to eliminate a preprocess to create reverse edges of the atom graph. Some edges in the graph need to be treated as bi-directional, so that all connected edges become live or dead as a group regardless of direction of edges (so that depending symbols are not reclaimed by garbage collector). We probably should always add two edges for each bi-directional edge in the first place to eliminate the need of preprocessing entirely. It’s definitely doable.

An interesting idea that unfortunately didn’t work is interning symbol names. I thought that computing a hash value of a symbol name is bottleneck in hash table, as C++ symbols can be very long. So I wrote a thread-safe string pool to intern (unique-fy) symbol strings, so that we can compare symbol equivalence by pointer comparison. String interning was done in the reader, which is parallelized, and I expected it would improve single-thread performance of the resolver. But I didn’t observe meaningful difference in terms of performance.

Bottlenecks were not there where I expected, as always (I learned about that again and again, but I was surprised every time). Maybe we need to revisit “optimized” code in LLD to see if it’s not really premature optimization. If it is, we should rewrite with simple code.

I’m currently trying two ideas above. This mail is just FYI, but if you have any recommendations or ideas or whatever, hit “reply”.

Stephan_Tolksdorf · March 12, 2015, 11:52am

Instead of using a concurrent hash map you could fill multiple hash maps in parallel and then merge them afterwards. If the implementation caches the hash values (which it should for string keys), merging maps doesn't require recomputing the hash values and can be a relatively fast operation when there is only a small overlap between the maps.

More generally, if hash map operations make up a large share of the run time, it might be worth looking for situations where hash values are unnecessarily recomputed, e.g. when an item is conditionally inserted or when a hash map is grown or copied. (This is just a general observation, I don't know anything about LLD's implementation.)

- Stephan

dblaikie · March 12, 2015, 4:40pm

An idea to make the resolver faster would be to use a concurrent hash
map to insert new symbols in parallel. Assuming symbols from the same
file don't conflict each other (I think it's a valid assumption), this
can be parallelized. I wanted single-thread performance gain, though.
(Also, concurrent hash maps are not currently available in LLVM.)

Instead of using a concurrent hash map you could fill multiple hash maps
in parallel and then merge them afterwards. If the implementation caches
the hash values (which it should for string keys), merging maps doesn't
require recomputing the hash values and can be a relatively fast operation
when there is only a small overlap between the maps.

Yeah - I'd wonder more about the process here - each file has
non-overlapping items? Are the possibly-overlapping items between files
know-able? (only certain kinds of symbols) maybe you can use that to your
advantage in some way? (yeah, I know - super vague)

Rafael_Avila_de_Espi · March 12, 2015, 4:49pm

I tried benchmarking it on linux by linking clang Release+asserts (but
lld itself with no asserts). The first things I noticed were:

missing options:

warning: ignoring unknown argument: --no-add-needed
warning: ignoring unknown argument: -O3
warning: ignoring unknown argument: --gc-sections

I just removed them from the command line.

Looks like --hash-style=gnu and --build-id are just ignored, so I
removed them too.

Looks like --strip-all is ignored, so I removed and ran strip manually.

Looks like .note.GNU-stack is incorrectly added, neither gold nor
bfd.ld adds it for clang.

Looks like .gnu.version and .gnu.version_r are not implemented.

Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
it is not included in .got.

Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
bfd puts everything in .data.rel. I have to research a bit to find out
what this is. For now I just added the sizes into a single entry.

.eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
from the command line.

With all that, the sections that increased in size the most when using lld were:

.rodata: 9 449 278 bytes bigger
.eh_frame: 438 376 bytes bigger
.comment: 77 797 bytes bigger
.data.rel.ro: 48 056 bytes bigger

The comment section is bigger because it has multiple copies of

clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)

The lack of duplicate entry merging would also explain the size
difference of .rodata and .eh_frame. No idea why .data.rel.ro is
bigger.

So, with the big warning that both linkers are not doing exactly the
same thing, the performance numbers I got were:

lld:

       1961.842991 task-clock (msec) # 0.999 CPUs
utilized ( +- 0.04% )
             1,152 context-switches # 0.587 K/sec
                 0 cpu-migrations # 0.000 K/sec
               ( +-100.00% )
           199,310 page-faults # 0.102 M/sec
               ( +- 0.00% )
     5,893,291,145 cycles # 3.004 GHz
               ( +- 0.03% )
     3,329,741,079 stalled-cycles-frontend # 56.50% frontend
cycles idle ( +- 0.05% )
   <not supported> stalled-cycles-backend
     6,255,727,902 instructions # 1.06 insns per
cycle
                                                  # 0.53 stalled
cycles per insn ( +- 0.01% )
     1,295,893,191 branches # 660.549 M/sec
               ( +- 0.01% )
        26,760,734 branch-misses # 2.07% of all
branches ( +- 0.01% )

1.963705923 seconds time elapsed
( +- 0.04% )

gold:

        990.708786 task-clock (msec) # 0.999 CPUs
utilized ( +- 0.06% )
                 0 context-switches # 0.000 K/sec
                 0 cpu-migrations # 0.000 K/sec
               ( +-100.00% )
            77,840 page-faults # 0.079 M/sec
     2,976,552,629 cycles # 3.004 GHz
               ( +- 0.02% )
     1,384,720,988 stalled-cycles-frontend # 46.52% frontend
cycles idle ( +- 0.04% )
   <not supported> stalled-cycles-backend
     4,105,948,264 instructions # 1.38 insns per
cycle
                                                  # 0.34 stalled
cycles per insn ( +- 0.00% )
       868,894,366 branches # 877.043 M/sec
               ( +- 0.00% )
        15,426,051 branch-misses # 1.78% of all
branches ( +- 0.01% )

0.991619294 seconds time elapsed
( +- 0.06% )

The biggest difference that shows up is that lld has 1,152 context
switches, but the cpu utilization is still < 1. Maybe there is just a
threading bug somewhere?

From your description, we build a hash of symbols in an archive and

for each undefined symbol in the overall link check if it is there. It
would probably be more efficient to walk the symbols defined in an
archive and check if it is needed by the overall link status, no? That
would save building a hash table for each archive.

One big difference of how lld works is the atom model. It basically
creates one Atom per symbol. That is inherently more work than what is
done by gold. IMHO it is confusing what atoms are and one way to
specify atoms in the object files.

It would be interesting to define an atom as the smallest thing that
cannot be split. It could still have multiple symbols in it for
example, and there would be no such thing as a AbsoltueAtom, just an
AbsoluteSymbol. In this model, the MachO reader would use symbols to
create atoms, but that is just one way to do it. The elf reader would
create 1 atom per regular section and special case SHF_MERGE and
.eh_frame (but we should really fix this one in LLVM too).

The atoms created in this way (for ELF at least) would be freely
movable, further reducing the cost.

Cheers,
Rafael

Rui_Ueyama · March 12, 2015, 6:12pm

An idea to make the resolver faster would be to use a concurrent hash
map to insert new symbols in parallel. Assuming symbols from the same
file don't conflict each other (I think it's a valid assumption), this
can be parallelized. I wanted single-thread performance gain, though.
(Also, concurrent hash maps are not currently available in LLVM.)

Instead of using a concurrent hash map you could fill multiple hash maps
in parallel and then merge them afterwards. If the implementation caches
the hash values (which it should for string keys), merging maps doesn't
require recomputing the hash values and can be a relatively fast operation
when there is only a small overlap between the maps.

Yeah - I'd wonder more about the process here - each file has
non-overlapping items? Are the possibly-overlapping items between files
know-able? (only certain kinds of symbols) maybe you can use that to your
advantage in some way? (yeah, I know - super vague)

Each file has non-overlapping symbols. We may be able to know
possibly-overlapping symbols between files, but I don't know if we can
exploit that because it's similar to what the linker actually does.

We can model an object file using two sets, U and D, containing names of
undefined symbols and defined symbols, respectively. For each file, U and D
are non-overlapping. Linker has two sets, U' and D', containing name of
undefined symbols need to be resolved, and defined symbols it collected so
far. It reads files until U' is empty. For each file, new undefined symbols
are added or removed by U' = union(U' - D, U - D'). Defined symbols are
added by D' = union(D', D). D' grows as we read more files and U' shrinks.

And there are library files, which gives a linker a choice whether it reads
a member file or not. Linker only reads member files who resolves undefined
symbols.

There are also odd stuffs such as COMDAT groups or
merge-not-by-name-but-by-content sections, that may complicate the model.
(I don't think about that yet.)

It feels to me what linker does is straightforward and it's hard to
improve, but I can't prove. There might be room to improve.

Joerg_Sonnenberger1 · March 12, 2015, 10:09pm

Not always, with weak symbols you can have overlap. You mentioend COMDAT
as the other case.

Joerg

Rafael_Avila_de_Espi · March 12, 2015, 10:12pm

There are also odd stuffs such as COMDAT groups or
merge-not-by-name-but-by-content sections, that may complicate the model. (I
don't think about that yet.)

For comdats (on ELF) you should be able to avoid even reading the bits
from subsequent files once a comdat of a given name has been found.

Cheers,
Rafael

Shankar_Easwaran · March 13, 2015, 3:32pm

Symbols are not resolved as part of reading. So this is not achievable with lld.

Shankar Easwaran

Rafael_Avila_de_Espi · March 13, 2015, 4:00pm

Looks like a design decision that might cost us some performance.

Cheers,
Rafael

Shankar_Easwaran · March 13, 2015, 4:38pm

Rafael,

This is very good information and extremely useful.

I tried benchmarking it on linux by linking clang Release+asserts (but
lld itself with no asserts). The first things I noticed were:

missing options:

warning: ignoring unknown argument: --no-add-needed
warning: ignoring unknown argument: -O3
warning: ignoring unknown argument: --gc-sections

I just removed them from the command line.

Looks like --hash-style=gnu and --build-id are just ignored, so I
removed them too.

Looks like --strip-all is ignored, so I removed and ran strip manually.

Looks like .note.GNU-stack is incorrectly added, neither gold nor
bfd.ld adds it for clang.

Looks like .gnu.version and .gnu.version_r are not implemented.

Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
it is not included in .got.

I have a fix for this. Will merge it.

Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
bfd puts everything in .data.rel. I have to research a bit to find out
what this is. For now I just added the sizes into a single entry.

.eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
from the command line.

With all that, the sections that increased in size the most when using lld were:

.rodata: 9 449 278 bytes bigger
.eh_frame: 438 376 bytes bigger
.comment: 77 797 bytes bigger
.data.rel.ro: 48 056 bytes bigger

Did you try --merge-strings with lld ? --gc-sections

The comment section is bigger because it has multiple copies of

clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)

The lack of duplicate entry merging would also explain the size
difference of .rodata and .eh_frame. No idea why .data.rel.ro is
bigger.

So, with the big warning that both linkers are not doing exactly the
same thing, the performance numbers I got were:

lld:

        1961.842991 task-clock (msec) # 0.999 CPUs
utilized ( +- 0.04% )
              1,152 context-switches # 0.587 K/sec
                  0 cpu-migrations # 0.000 K/sec
                ( +-100.00% )
            199,310 page-faults # 0.102 M/sec
                ( +- 0.00% )
      5,893,291,145 cycles # 3.004 GHz
                ( +- 0.03% )
      3,329,741,079 stalled-cycles-frontend # 56.50% frontend
cycles idle ( +- 0.05% )
    <not supported> stalled-cycles-backend
      6,255,727,902 instructions # 1.06 insns per
cycle
                                                   # 0.53 stalled
cycles per insn ( +- 0.01% )
      1,295,893,191 branches # 660.549 M/sec
                ( +- 0.01% )
         26,760,734 branch-misses # 2.07% of all
branches ( +- 0.01% )

        1.963705923 seconds time elapsed
           ( +- 0.04% )

gold:

         990.708786 task-clock (msec) # 0.999 CPUs
utilized ( +- 0.06% )
                  0 context-switches # 0.000 K/sec
                  0 cpu-migrations # 0.000 K/sec
                ( +-100.00% )
             77,840 page-faults # 0.079 M/sec
      2,976,552,629 cycles # 3.004 GHz
                ( +- 0.02% )
      1,384,720,988 stalled-cycles-frontend # 46.52% frontend
cycles idle ( +- 0.04% )
    <not supported> stalled-cycles-backend
      4,105,948,264 instructions # 1.38 insns per
cycle
                                                   # 0.34 stalled
cycles per insn ( +- 0.00% )
        868,894,366 branches # 877.043 M/sec
                ( +- 0.00% )
         15,426,051 branch-misses # 1.78% of all
branches ( +- 0.01% )

        0.991619294 seconds time elapsed
           ( +- 0.06% )

The biggest difference that shows up is that lld has 1,152 context
switches, but the cpu utilization is still < 1. Maybe there is just a
threading bug somewhere?

lld apparently is highly multithreaded, but I see your point. May be trying to do this exercise on /dev/shm can show more cpu utilization ?

Shankar Easwaran

Rui_Ueyama · March 13, 2015, 4:59pm

I tried benchmarking it on linux by linking clang Release+asserts (but
lld itself with no asserts). The first things I noticed were:

missing options:

warning: ignoring unknown argument: --no-add-needed
warning: ignoring unknown argument: -O3
warning: ignoring unknown argument: --gc-sections

I just removed them from the command line.

Looks like --hash-style=gnu and --build-id are just ignored, so I
removed them too.

Looks like --strip-all is ignored, so I removed and ran strip manually.

Looks like .note.GNU-stack is incorrectly added, neither gold nor
bfd.ld adds it for clang.

Looks like .gnu.version and .gnu.version_r are not implemented.

Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
it is not included in .got.

Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
bfd puts everything in .data.rel. I have to research a bit to find out
what this is. For now I just added the sizes into a single entry.

.eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
from the command line.

With all that, the sections that increased in size the most when using lld
were:

.rodata: 9 449 278 bytes bigger
.eh_frame: 438 376 bytes bigger
.comment: 77 797 bytes bigger
.data.rel.ro: 48 056 bytes bigger

The comment section is bigger because it has multiple copies of

clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)

The lack of duplicate entry merging would also explain the size
difference of .rodata and .eh_frame. No idea why .data.rel.ro is
bigger.

So, with the big warning that both linkers are not doing exactly the
same thing, the performance numbers I got were:

lld:

       1961.842991 task-clock (msec) # 0.999 CPUs
utilized ( +- 0.04% )
             1,152 context-switches # 0.587 K/sec
                 0 cpu-migrations # 0.000 K/sec
               ( +-100.00% )
           199,310 page-faults # 0.102 M/sec
               ( +- 0.00% )
     5,893,291,145 cycles # 3.004 GHz
               ( +- 0.03% )
     3,329,741,079 stalled-cycles-frontend # 56.50% frontend
cycles idle ( +- 0.05% )
   <not supported> stalled-cycles-backend
     6,255,727,902 instructions # 1.06 insns per
cycle
                                                  # 0.53 stalled
cycles per insn ( +- 0.01% )
     1,295,893,191 branches # 660.549 M/sec
               ( +- 0.01% )
        26,760,734 branch-misses # 2.07% of all
branches ( +- 0.01% )

       1.963705923 seconds time elapsed
          ( +- 0.04% )

gold:

        990.708786 task-clock (msec) # 0.999 CPUs
utilized ( +- 0.06% )
                 0 context-switches # 0.000 K/sec
                 0 cpu-migrations # 0.000 K/sec
               ( +-100.00% )
            77,840 page-faults # 0.079 M/sec
     2,976,552,629 cycles # 3.004 GHz
               ( +- 0.02% )
     1,384,720,988 stalled-cycles-frontend # 46.52% frontend
cycles idle ( +- 0.04% )
   <not supported> stalled-cycles-backend
     4,105,948,264 instructions # 1.38 insns per
cycle
                                                  # 0.34 stalled
cycles per insn ( +- 0.00% )
       868,894,366 branches # 877.043 M/sec
               ( +- 0.00% )
        15,426,051 branch-misses # 1.78% of all
branches ( +- 0.01% )

       0.991619294 seconds time elapsed
          ( +- 0.06% )

The biggest difference that shows up is that lld has 1,152 context
switches, but the cpu utilization is still < 1. Maybe there is just a
threading bug somewhere?

The implementation of the threading class inside LLD is different between
Windows and other platforms. On Windows, it's just a wrapper for Microsoft
Concrt threading library. On other platforms, we have a simple
implementation to mimic it. So, first of all, I don't know about the number
measured on Unix. (I didn't test that.)

But, 1,152 context switches is small number, I guess? It's unlikely that
that number of context switches would make LLD two times slower than gold.
I believe bottleneck is something else. I think no one really optimized ELF
reader, passes and writers, there might be some bad code there, but
probably I shouldn't make a guess but instead measure.

From your description, we build a hash of symbols in an archive and
for each undefined symbol in the overall link check if it is there. It
would probably be more efficient to walk the symbols defined in an
archive and check if it is needed by the overall link status, no? That
would save building a hash table for each archive.

One big difference of how lld works is the atom model. It basically
creates one Atom per symbol. That is inherently more work than what is
done by gold. IMHO it is confusing what atoms are and one way to
specify atoms in the object files.

It would be interesting to define an atom as the smallest thing that
cannot be split. It could still have multiple symbols in it for
example, and there would be no such thing as a AbsoltueAtom, just an
AbsoluteSymbol. In this model, the MachO reader would use symbols to
create atoms, but that is just one way to do it. The elf reader would
create 1 atom per regular section and special case SHF_MERGE and
.eh_frame (but we should really fix this one in LLVM too).

The atoms created in this way (for ELF at least) would be freely
movable, further reducing the cost.

I think I agree. Or, at least, the term "atom" is odd because it's not
atomic. What we call atom is symbol with associated data. Usually all atoms
created from the same section will be linked or excluded as a group.
Section is not divisible (or atomic).

We don't have notion of section in the resolver. Many linker features are
defined in terms of sections, so in order to handle them in the atom model,
we need to do something not straightforward. (For example, we copy section
attributes to atoms so that they are preserved during linking process.
Which means we need to copy attributes to atoms although atoms created from
the same section will have the exact same values.)

Rafael_Avila_de_Espi · March 13, 2015, 5:15pm

Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
it is not included in .got.

I have a fix for this. Will merge it.

Thanks.

.rodata: 9 449 278 bytes bigger
.eh_frame: 438 376 bytes bigger
.comment: 77 797 bytes bigger
.data.rel.ro: 48 056 bytes bigger

Did you try --merge-strings with lld ? --gc-sections

I got

warning: ignoring unknown argument: --gc-sections

I will do a run with --merge-strings. This should probably the the
default to match other ELF linkers.

The biggest difference that shows up is that lld has 1,152 context
switches, but the cpu utilization is still < 1. Maybe there is just a
threading bug somewhere?

lld apparently is highly multithreaded, but I see your point. May be trying
to do this exercise on /dev/shm can show more cpu utilization ?

Yes, the number just under 1 cpu utilized is very suspicious. As Rui
points out, there is probably some issue in the threading
implementation on linux. One interesting experiment would be timing
gold and lld linking ELF on windows (but I have only a windows VM and
no idea what the "perf" equivalent is on windows.

I forgot to mention, the tests were run on tmpfs already.

Cheers,
Rafael

Davide_Italiano1 · March 13, 2015, 5:53pm

Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
it is not included in .got.

I have a fix for this. Will merge it.

Thanks.

.rodata: 9 449 278 bytes bigger
.eh_frame: 438 376 bytes bigger
.comment: 77 797 bytes bigger
.data.rel.ro: 48 056 bytes bigger

Did you try --merge-strings with lld ? --gc-sections

I got

warning: ignoring unknown argument: --gc-sections

I will do a run with --merge-strings. This should probably the the
default to match other ELF linkers.

Unfortunately, --gc-sections isn't implemented on the GNU driver. I
tried to enable it but I hit quite a few issues I'm slowly fixing. At
the time of writing the Resolver reclaims live atoms.

The biggest difference that shows up is that lld has 1,152 context
switches, but the cpu utilization is still < 1. Maybe there is just a
threading bug somewhere?

lld apparently is highly multithreaded, but I see your point. May be trying
to do this exercise on /dev/shm can show more cpu utilization ?

Yes, the number just under 1 cpu utilized is very suspicious. As Rui
points out, there is probably some issue in the threading
implementation on linux. One interesting experiment would be timing
gold and lld linking ELF on windows (but I have only a windows VM and
no idea what the "perf" equivalent is on windows.

I forgot to mention, the tests were run on tmpfs already.

I think we can make an effort to reduce the number of context
switches. In particular, we might try to switch to a model where task
is the basic unit of computation and a thread pool of worker(s)
responsible for executing these tasks.
This way we can tune the number of threads fighting at the same time
for the CPU, maybe with a reasonable default, that can be overriden by
the user using cmdline options.
That said, as long as this would require some substantial changes I
wouldn't go for that path until we have some strong evidence that the
change is gonna improve the performances significantly. I feel like
that while context switches may have some impact on the final numbers,
they hardly will account for large part of the performance loss.

Another thing that come to my mind is that the number of context
switches being relatively high might be the effect of lock contention.
If somebody has access to a VTune license and can run 'lock analysis'
on it that would be greatly appreciated. I don't have a Linux
laptop/setup but I'll try to collect some numbers on FreeBSD and
investigate further over the weekend.

Thanks,

Rui_Ueyama · March 13, 2015, 6:01pm

>>> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
>>> it is not included in .got.
>>
>> I have a fix for this. Will merge it.
>
> Thanks.
>
>>> .rodata: 9 449 278 bytes bigger
>>> .eh_frame: 438 376 bytes bigger
>>> .comment: 77 797 bytes bigger
>>> .data.rel.ro: 48 056 bytes bigger
>>
>> Did you try --merge-strings with lld ? --gc-sections
>
>
> I got
>
> warning: ignoring unknown argument: --gc-sections
>
> I will do a run with --merge-strings. This should probably the the
> default to match other ELF linkers.
>

Unfortunately, --gc-sections isn't implemented on the GNU driver. I
tried to enable it but I hit quite a few issues I'm slowly fixing. At
the time of writing the Resolver reclaims live atoms.

>>> The biggest difference that shows up is that lld has 1,152 context
>>> switches, but the cpu utilization is still < 1. Maybe there is just a
>>> threading bug somewhere?
>>
>> lld apparently is highly multithreaded, but I see your point. May be
trying
>> to do this exercise on /dev/shm can show more cpu utilization ?
>
> Yes, the number just under 1 cpu utilized is very suspicious. As Rui
> points out, there is probably some issue in the threading
> implementation on linux. One interesting experiment would be timing
> gold and lld linking ELF on windows (but I have only a windows VM and
> no idea what the "perf" equivalent is on windows.
>
> I forgot to mention, the tests were run on tmpfs already.
>

I think we can make an effort to reduce the number of context
switches. In particular, we might try to switch to a model where task
is the basic unit of computation and a thread pool of worker(s)
responsible for executing these tasks.
This way we can tune the number of threads fighting at the same time
for the CPU, maybe with a reasonable default, that can be overriden by
the user using cmdline options.

We do split tasks that way. Please take a look at
include/lld/Core/Parallel.h. ThreadExecutor is a class to execute tasks,
which you can submit by calling add() method. Tasks are any callable
objects. The number of threads we spawn for each ThreadExecutor is the same
as std::hardware_concurrency(), and we only instantiate one
ThreadExecutor. They shouldn't compete against each other for processor
time slots (unless there's a bug).

That said, as long as this would require some substantial changes I

Rafael_Avila_de_Espi · March 13, 2015, 6:59pm

I will do a run with --merge-strings. This should probably the the
default to match other ELF linkers.

Trying --merge-strings with today's trunk I got

* comment got 77 797 bytes smaller.
* rodata got 9 394 257 bytes smaller.

Comparing with gold, comment now has the same size and rodata is 55
021 bytes bigger.

Amusingly, merging strings seems to make lld a bit faster. With
today's files I got:

lld:

run.sh (5.52 KB)

Shankar_Easwaran · March 13, 2015, 7:35pm

I will do a run with --merge-strings. This should probably the the
default to match other ELF linkers.

Trying --merge-strings with today's trunk I got

* comment got 77 797 bytes smaller.
* rodata got 9 394 257 bytes smaller.

We can significantly improve merge string performance by delaying merging strings until the sections/atoms are garbage collected. We do it very early in the reader.

Are you using oprofile to get this stats ?

echristo · March 13, 2015, 7:38pm

I will do a run with --merge-strings. This should probably the the
default to match other ELF linkers.

Trying --merge-strings with today’s trunk I got

comment got 77 797 bytes smaller.

rodata got 9 394 257 bytes smaller.

Comparing with gold, comment now has the same size and rodata is 55
021 bytes bigger.

Amusingly, merging strings seems to make lld a bit faster. With

As a side note this is a really good thing. The idea is that the linker should largely be I/O bound and not processor bound.

-eric

Rui_Ueyama · March 13, 2015, 7:55pm

I will do a run with --merge-strings. This should probably the the

default to match other ELF linkers.

Trying --merge-strings with today's trunk I got

* comment got 77 797 bytes smaller.
* rodata got 9 394 257 bytes smaller.

We can significantly improve merge string performance by delaying merging
strings until the sections/atoms are garbage collected. We do it very early
in the reader.

From my experience, I can say it's very hard to predict doing something can

significantly improve performance just by taking a look at code and
thinking.

Rafael_Avila_de_Espi · March 13, 2015, 9:16pm

Are you using oprofile to get this stats ?

No, pref. The exact command line is in the script I included in the
previous email.

Cheers,
Rafael

James_Grosbach · March 14, 2015, 1:22am

I will do a run with --merge-strings. This should probably the the
default to match other ELF linkers.

Trying --merge-strings with today’s trunk I got

comment got 77 797 bytes smaller.

rodata got 9 394 257 bytes smaller.

Comparing with gold, comment now has the same size and rodata is 55
021 bytes bigger.

Amusingly, merging strings seems to make lld a bit faster. With

As a side note this is a really good thing. The idea is that the linker should largely be I/O bound and not processor bound.

+1

This is very cool, everyone. Thank you for your work here.

Topic		Replies	Views
Concurrent Hashmap? LLVM Dev List Archives	14	144	April 13, 2021
[RFC] lld: mostly-concurrent symbol resolution LLVM Dev List Archives	4	124	April 23, 2019
LLD: time to enable --threads by default LLVM Dev List Archives	43	329	November 24, 2016
LLD status update and performance chart LLVM Dev List Archives	81	297	December 15, 2016
Making LLD PDB generation faster LLVM Dev List Archives	35	254	March 1, 2019

On LLD performance

Related topics