LLD: time to enable --threads by default

LLD supports multi-threading, and it seems to be working well as you can see in a recent result. In short, LLD runs 30% faster with --threads option and more than 50% faster if you are using --build-id (your mileage may vary depending on your computer). However, I don’t think most users even don’t know about that because --threads is not a default option.

I’m thinking to enable --threads by default. We now have real users, and they’ll be happy about the performance boost.

Any concerns?

I can’t think of problems with that, but I want to write a few notes about that:

  • We still need to focus on single-thread performance rather than multi-threaded one because it is hard to make a slow program faster just by using more threads.

  • We shouldn’t do “too clever” things with threads. Currently, we are using multi-threads only at two places where they are highly parallelizable by nature (namely, copying and applying relocations for each input section, and computing build-id hash). We are using parallel_for_each, and that is very simple and easy to understand. I believe this was a right design choice, and I don’t think we want to have something like workqueues/tasks in GNU gold, for example.

  • Run benchmarks with --no-threads if you are not focusing on multi-thread performance.

I will do a quick benchmark run.

Other than the observations you have my only concern is the situation
where many lld invocations run in parallel, like in a llvm build where
there many outputs in bin/. Our task system doesn't know about load,
so I worry that it might degrade performance in that case.

Cheers,
Rafael

I'm thinking to enable --threads by default. We now have real users, and
they'll be happy about the performance boost.

Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?

Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.

Rafael's concern is also very real. I/O and memory consumption are
important factors on small footprint systems, though I'd be happy to
have a different default per architecture or even carry the burden of
forcing a --no-threads option every run if the benefits are
substantial.

If those issues are not a concern, than I'm in favour!

- We still need to focus on single-thread performance rather than
multi-threaded one because it is hard to make a slow program faster just by
using more threads.

Agreed.

- We shouldn't do "too clever" things with threads. Currently, we are using
multi-threads only at two places where they are highly parallelizable by
nature (namely, copying and applying relocations for each input section, and
computing build-id hash). We are using parallel_for_each, and that is very
simple and easy to understand. I believe this was a right design choice, and
I don't think we want to have something like workqueues/tasks in GNU gold,
for example.

Strongly agreed.

cheers,
--renato

> I'm thinking to enable --threads by default. We now have real users, and
> they'll be happy about the performance boost.

Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?

Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.

As a quick test, I ran the benchmark again with "taskset -c 0" to use only
one core. LLD still spawns 40 threads because my machine has 40 cores (20
physical cores), so 40 threads ran on one core.

With --no-threads (one thread on a single core), it took 6.66 seconds to
self-link. With -thread (40 threads on a single core), it took 6.70
seconds. I guess they are mostly in error margin. So I think it wouldn't
hurt single core machine.

Rafael may be running his benchmarks and will bring his results.

Rafael's concern is also very real. I/O and memory consumption are

> I'm thinking to enable --threads by default. We now have real users, and
> they'll be happy about the performance boost.

Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?

Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.

Rafael's concern is also very real. I/O and memory consumption are
important factors on small footprint systems, though I'd be happy to
have a different default per architecture or even carry the burden of
forcing a --no-threads option every run if the benefits are
substantial.

On such a computer, you don't want to enable threads at all, no? If so, you
can build LLVM without LLVM_ENABLE_THREADS.

ARM hardware varies greatly, you don't want to restrict that much.

Some boards have one core and 512MB of RAM, others have 8 cores and
1GB, others 8 cores and 32GB, others 96 cores, and so on.

You have mostly answered my questions, though.

Single -> multi thread in a single core is mostly noise and the number
of threads are detected from the number of available cores. That
should be fine on most cases, even on old ARM.

cheers,
--renato

On a mac pro (running linux) the results I got with all cores available:

firefox
  master 7.146418217
  patch 5.304271767 1.34729488437x faster
firefox-gc
  master 7.316743822
  patch 5.46436812 1.33899174824x faster
chromium
  master 4.265597914
  patch 3.972218527 1.07385781648x faster
chromium fast
  master 1.823614026
  patch 1.686059427 1.08158348205x faster
the gold plugin
  master 0.340167513
  patch 0.318601465 1.06768973269x faster
clang
  master 0.579914119
  patch 0.520784947 1.11353855817x faster
llvm-as
  master 0.03323043
  patch 0.041571719 1.251013574x slower
the gold plugin fsds
  master 0.36675887
  patch 0.350970944 1.04498356992x faster
clang fsds
  master 0.656180056
  patch 0.591607603 1.10914743602x faster
llvm-as fsds
  master 0.030324313
  patch 0.040045353 1.32056917497x slower
scylla
  master 3.23378908
  patch 2.019191831 1.60152642773x faster

With only 2 cores:

firefox
  master 7.174839911
  patch 6.319808477 1.13529388384x faster
firefox-gc
  master 7.345525844
  patch 6.493005841 1.13129820362x faster
chromium
  master 4.180752414
  patch 4.129515199 1.01240756179x faster
chromium fast
  master 1.847296843
  patch 1.78837299 1.0329483018x faster
the gold plugin
  master 0.341725451
  patch 0.339943222 1.0052427255x faster
clang
  master 0.581901114
  patch 0.566932481 1.02640284955x faster
llvm-as
  master 0.03381059
  patch 0.036671392 1.08461260215x slower
the gold plugin fsds
  master 0.369184003
  patch 0.368774353 1.00111084189x faster
clang fsds
  master 0.660120583
  patch 0.641040511 1.02976422187x faster
llvm-as fsds
  master 0.031074029
  patch 0.035421531 1.13990789543x slower
scylla
  master 3.243011681
  patch 2.630991522 1.23261958615x faster

With only 1 core:

firefox
  master 7.174323116
  patch 7.301968002 1.01779190649x slower
firefox-gc
  master 7.339104117
  patch 7.466171668 1.01731376868x slower
chromium
  master 4.176958448
  patch 4.188387233 1.00273615003x slower
chromium fast
  master 1.848922713
  patch 1.858714219 1.00529578978x slower
the gold plugin
  master 0.342383846
  patch 0.347106743 1.01379415838x slower
clang
  master 0.582476955
  patch 0.600524655 1.03098440178x slower
llvm-as
  master 0.033248459
  patch 0.035622988 1.07141771593x slower
the gold plugin fsds
  master 0.369510236
  patch 0.376390506 1.01861997133x slower
clang fsds
  master 0.661267753
  patch 0.683417482 1.03349585535x slower
llvm-as fsds
  master 0.030574688
  patch 0.033052779 1.08105041006x slower
scylla
  master 3.236604638
  patch 3.325831407 1.02756801617x slower

Given that we have an improvement even with just two cores available, LGTM.

Cheers,
Rafael

By the way, while running benchmark, I found that our SHA1 function seems much slower than the one in gold. gold slowed down by only 1.3 seconds to compute a SHA1 of output, but we spent 6.0 seconds to do the same thing (I believe). Something doesn’t seem right.

Here is a table to link the same binary with -no-threads and -build-id={none,md5,sha1}. The numbers are in seconds.

LLD gold
none 7.82 13.78
MD5 9.68 14.56
SHA1 13.85 15.05

Thanks for the extensive benchmarking! :slight_smile:

LGTM, too.

cheers,
--renato

SHA1 in LLVM is very naive, any improvement is welcome there!
It think Amaury pointed it originally and he had an alternative implementation IIRC.

Can we just copy-and-paste optimized code from somewhere?

The current implementation was “copy/pasted” from somewhere (it was explicitly public domain).

I should’ve said that do you know if there’s an optimized SHA1 implementation that we can use?

No, otherwise I’d have pick it instead of this one :wink:

The alternative plan was either to:

  1. reach out to someone who has written an optimized one and convince him to contribute it to LLVM
  2. “optimize” the one in tree.

But not high enough in my priority list.

Have you considered using OpenSSL’s implementation? (just wondering, I don’t know much about OpenSSL, but I guess their code is optimized.)

The NetBSD version is also PD and uses much more aggressive loop
unrolling:

https://github.com/jsonn/src/blob/trunk/common/lib/libc/hash/sha1/sha1.c

It's still a bit slower than an optimised assembler version, but
typically good enough.

Joerg

What is the total time consumped, not just the real time? When building
a large project, linking is often done in parallel with other tasks, so
wasting a lot of CPU to save a bit of real time is not necessarily a net
win.

Joerg

Did you see this http://llvm.org/viewvc/llvm-project?view=revision&revision=287140 ? Interpreting these numbers may be tricky because of hyper threading, though.

Can you try that with a CPU set that explicitly doesn't include the HT
cores? That's more likely to give a reasonable answer for "what is the
thread overhead".

Joerg

Here is the result of running 20 threads on 20 physical cores (40 virtual cores).

19002.081139 task-clock (msec) # 2.147 CPUs utilized ( ± 2.88% )
23,006 context-switches # 0.001 M/sec ( ± 2.24% )
1,491 cpu-migrations # 0.078 K/sec ( ± 22.50% )
2,607,076 page-faults # 0.137 M/sec ( ± 0.83% )
56,818,049,785 cycles # 2.990 GHz ( ± 2.54% )
41,072,435,357 stalled-cycles-frontend # 72.29% frontend cycles idle ( ± 3.36% )
stalled-cycles-backend
41,090,608,917 instructions # 0.72 insns per cycle

1.00 stalled cycles per insn ( ± 0.46% )

7,621,825,115 branches # 401.105 M/sec ( ± 0.52% )
139,383,452 branch-misses # 1.83% of all branches ( ± 0.18% )

8.848611242 seconds time elapsed ( ± 2.72% )

and this is the single-thread result.

12738.416627 task-clock (msec) # 1.000 CPUs utilized ( ± 5.04% )
1,283 context-switches # 0.101 K/sec ( ± 5.49% )
3 cpu-migrations # 0.000 K/sec ( ± 55.20% )
2,614,435 page-faults # 0.205 M/sec ( ± 2.52% )
41,732,843,312 cycles # 3.276 GHz ( ± 5.76% )
26,816,171,736 stalled-cycles-frontend # 64.26% frontend cycles idle ( ± 8.48% )
stalled-cycles-backend
39,776,444,917 instructions # 0.95 insns per cycle

0.67 stalled cycles per insn ( ± 0.84% )

7,288,624,141 branches # 572.177 M/sec ( ± 1.02% )
135,684,171 branch-misses # 1.86% of all branches ( ± 0.12% )

12.734335840 seconds time elapsed ( ± 5.03% )