I've no objections to changing the default as multi-threading can
always be turned off and it stands to benefit most.
One possible thing to consider is would multi-threading increase
memory usage? I'm most concerned about virtual address space as this
can get eaten up very quickly on a 32-bit machine, particularly when
debug is used. Given that the data set isn't increased when enabling
multiple threads I speculate that the biggest risk would be different
threads mmapping overlapping parts of the files in a non-shared way.
It will be worth keeping track of how much memory is being used as
people may need to alter their maximum number of parallel link jobs to
compensate. From prior experience building clang with debug on a 16-gb
machine using -j8 will bring it to a halt.
Peter
Sounds like threading isn't beneficial much beyond the second CPU...
Maybe blindly creating one thread per core isn't the best plan...
--renato
Sounds like threading isn't beneficial much beyond the second CPU...
Maybe blindly creating one thread per core isn't the best plan...
parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.
Cheers,
Rafael
> Sounds like threading isn't beneficial much beyond the second CPU...
> Maybe blindly creating one thread per core isn't the best plan...
parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.
Instead of using std:
:hardware_concurrency (which is one per SMT),
you may be interested in using the facility I added for setting default
ThinLTO backend parallelism so that one per physical core is created,
llvm::heavyweight_hardware_concurrency() (see D25585 and r284390). The
name is meant to indicate that this is the concurrency that should be used
for heavier weight tasks (that may use a lot of memory e.g.).
Teresa
That's an average. When it's at peak, it's using more than two cores.
I submitted r287237 to make -threads default. You can disable it with -no-threads. Thanks!
> Sounds like threading isn't beneficial much beyond the second CPU...
> Maybe blindly creating one thread per core isn't the best plan...
parallel.h is pretty simplistic at the moment. Currently it creates
one per SMT. One per core and being lazy about it would probably be a
good thing, but threading is already beneficial and improving
parallel.h an welcome improvement.
Instead of using std:
:hardware_concurrency (which is one per
SMT), you may be interested in using the facility I added for setting
default ThinLTO backend parallelism so that one per physical core is
created, llvm::heavyweight_hardware_concurrency() (see D25585 and
r284390). The name is meant to indicate that this is the concurrency that
should be used for heavier weight tasks (that may use a lot of memory e.g.).
Sorry for my ignorance, but what's the point of running the same number of
threads as the number of physical cores instead of HT virtual cores? If we
can get better throughput by not running more than one thread per a
physical core, it feels like HT is a useless technology.
Teresa
It depends on the use-case: with ThinLTO we scale linearly with the number of physical cores. When you get over the number of physical cores you still get some improvements, but that’s no longer linear.
The profitability question is a tradeoff one: for example if each of your task is very memory intensive, you may not want to overcommit the cores or increase the ratio of available mem per physical core.
To take some number as an example: if your average user has a 8GB machine with 4 cores (8 virtual cores with HT), and you know that each of your parallel tasks is consuming 1.5GB of memory on average, then having 4 parallel workers threads to process your tasks will lead to a peak memory of 6GB, having 8 parallel threads will lead to a peak mem of 12GB and the machine will start to swap.
Another consideration is that having the linker issuing threads behind the back of the build system isn’t great: the build system is supposed to exploit the parallelism. Now if it spawn 10 linker jobs in parallel, how many threads are competing for the hardware?
So, HT is not useless, but it is not universally applicable or universally efficient in the same way.
Hope it makes sense!
Thank you for the explanation! That makes sense.
Unlike ThinLTO, each thread in LLD consumes very small amount of memory
(probably just a few megabytes), so that's not a problem for me. At the
final stage of linking, we spawn threads to copy section contents and apply
relocations, and I guess that causes a lot of memory traffic because that's
basically memcpy'ing input files to an output file, so the memory bandwidth
could be a limiting factor there. But I do not see a reason to limit the
number of threads to the number of physical core. For LLD, it seems like we
can just spawn as many threads as HT provides.
Ok, sure - I was just suggesting based on Rafael's comment above about lld
currently creating one thread per SMT and possibly wanting one per core
instead. It will definitely depend on the characteristics of your parallel
tasks (which is why the name of the interface was changed to include
"heavyweight" aka large memory intensive, since the implementation may
return something other than # physical cores for other architectures -
right now it is just implemented for x86 otherwise returns
thread::hardware_concurrency()).
Teresa
Indeed, in HT, cores have two execution units on the same cache/bus
line, so memory access is likely to be contrived. Linkers are memory
hungry, which add to the I/O bottleneck which makes most of the gain
disappear. 
Furthermore, the FP unit is also shared among the ALUs, so
FP-intensive code does not make good use of HT. Not the case, here,
though.
cheers,
-renato
We literally have no variables of type float or double in LLD. It would
work fine on 486SX. 
Thank you for the explanation! That makes sense.
Unlike ThinLTO, each thread in LLD consumes very small amount of memory
(probably just a few megabytes), so that's not a problem for me. At the
final stage of linking, we spawn threads to copy section contents and apply
relocations, and I guess that causes a lot of memory traffic because that's
basically memcpy'ing input files to an output file, so the memory bandwidth
could be a limiting factor there. But I do not see a reason to limit the
number of threads to the number of physical core. For LLD, it seems like we
can just spawn as many threads as HT provides.
It is quite common for SMT to *not* be profitable. I did notice some
small wins by not using it. On an intel machine you can do a quick
check by running with half the threads since they always have 2x SMT.
Cheers,
Rafael
I had the same experience. Ideally I would like to have a way to
override the number of threads used by the linker.
gold has a plethora of options for doing that, i.e.
--thread-count COUNT Number of threads to use
--thread-count-initial COUNT
Number of threads to use in initial pass
--thread-count-middle COUNT Number of threads to use in middle pass
--thread-count-final COUNT Number of threads to use in final pass
I don't think we need the full generality/flexibility of
initial/middle/final, but --thread-count could be useful (at least for
experimenting). The current interface of `parallel_for_each` doesn't
allow to specify the number of threads to be run, so, assuming lld
goes that route (it may not), that should be extended accordingly.
I agree that these options would be useful for testing, but I'm reluctant
to expose them as user options because I wish LLD would just work out of
the box without turning lots of knobs.
I share your view that lld should work fine out-the-box. I think an alternative
is having the option as hidden, maybe. I consider the set of users
tinkering with linker options not large, although there are some
people who like to override/"tune" the linker anyway, so IMHO we
should expose a sane default and let users decide if they care or not
(a similar example is what we do for --thinlto-threads or
--lto-partitions, even if in the last case we still have that set to 1
because it's not entirely clear what's a reasonable number).
I've seen a case where the linker was pinned to a specific subset of the CPUs
and many linker invocations were launched in parallel.
(actually, this is the only time when I've seen --threads for gold used).
I personally don't expect this to be the common use-case, but it's not hard
to imagine complex build systems adopting a similar strategy.
Sure. If you want to add --thread-count (but not other options, such as --thread-count-initial), that’s fine with me.
LLD supports multi-threading, and it seems to be working well as you can
see in a recent result
<http://llvm.org/viewvc/llvm-project?view=revision&revision=287140>\. In
short, LLD runs 30% faster with --threads option and more than 50% faster
if you are using --build-id (your mileage may vary depending on your
computer). However, I don't think most users even don't know about that
because --threads is not a default option.
I'm thinking to enable --threads by default. We now have real users, and
they'll be happy about the performance boost.
Any concerns?
I can't think of problems with that, but I want to write a few notes about
that:
- We still need to focus on single-thread performance rather than
multi-threaded one because it is hard to make a slow program faster just by
using more threads.
- We shouldn't do "too clever" things with threads. Currently, we are
using multi-threads only at two places where they are highly parallelizable
by nature (namely, copying and applying relocations for each input section,
and computing build-id hash). We are using parallel_for_each, and that is
very simple and easy to understand. I believe this was a right design
choice, and I don't think we want to have something like workqueues/tasks
in GNU gold, for example.
Sorry for the late response.
Copying and applying relocations is actually are not as parallelizable as
you would imagine in current LLD. The reason is that there is an implicit
serialization when mutating the kernel's VA map (which happens any time
there is a minor page fault, i.e. the first time you touch a page of an
mmap'd input). Since threads share the same VA, there is an implicit
serialization across them. Separate processes are needed to avoid this
overhead (note that the separate processes would still have the same output
file mapped; so (at least with fixed partitioning) there is no need for
complex IPC).
For `ld.lld -O0` on Mac host, I measured <1GB/s copying speed, even though
the machine I was running on had like 50 GB/s DRAM bandwidth; so the VA
overhead is on the order of a 50x slowdown for this copying operation in
this extreme case, so Amdahl's law indicates that there will be practically
no speedup for this copy operation by adding multiple threads. I've also
DTrace'd this to see massive contention on the VA lock. LInux will be
better but no matter how good, it is still a serialization point and
Amdahl's law will limit your speedup significantly.
-- Sean Silva
Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.
Cheers,
Rafael