lld and thread over-subscription

Hi Rui, et al.,

I was experimenting yesterday with building lld on my POWER7 PPC64/Linux machine, and ran into an unfortunate problem. When running the regressions tests under lit, almost all of the tests fail like this:

terminate called after throwing an instance of 'std::system_error'
  what(): Resource temporarily unavailable
...
5 libc.so.6 0x00000080b7847238 abort + 4293480680
6 libstdc++.so.6 0x00000fff94f0f004 __gnu_cxx::__verbose_terminate_handler() + 4294099316
7 libstdc++.so.6 0x00000fff94f0bc84
8 libstdc++.so.6 0x00000fff94f0bccc std::terminate() + 4294087956
9 libstdc++.so.6 0x00000fff94f0c0c4 __cxa_throw + 4294088780
10 libstdc++.so.6 0x00000fff94f816e0 std::__throw_system_error(int) + 4294526808
11 libstdc++.so.6 0x00000fff94f83d30 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) + 4294534936
12 lld 0x000000001002a278
...

which seems to indicate a core problem here with dealing with thread-resource exhaustion. For almost all tests, running them individually (or using lit -j 1) works without a problem. We could deal with this by limiting the number of threads lld uses when running regression tests, or limit the number of threads that lit uses when running lld tests (as we currently do with the OpenMP runtime tests), but I'm somewhat concerned that users will run into this program regardless with heavily-parallelized builds.

We could try to catch exceptions that otherwise come from ThreadPoolExecutor's constructor, but do we compile with exceptions enabled?

Thanks again,
Hal

Hi Rui, et al.,

I was experimenting yesterday with building lld on my POWER7 PPC64/Linux
machine, and ran into an unfortunate problem. When running the regressions
tests under lit, almost all of the tests fail like this:

terminate called after throwing an instance of 'std::system_error'
  what(): Resource temporarily unavailable
...
5 libc.so.6 0x00000080b7847238 abort + 4293480680
6 libstdc++.so.6 0x00000fff94f0f004
__gnu_cxx::__verbose_terminate_handler() + 4294099316
7 libstdc++.so.6 0x00000fff94f0bc84
8 libstdc++.so.6 0x00000fff94f0bccc std::terminate() + 4294087956
9 libstdc++.so.6 0x00000fff94f0c0c4 __cxa_throw + 4294088780
10 libstdc++.so.6 0x00000fff94f816e0 std::__throw_system_error(int) +
4294526808
11 libstdc++.so.6 0x00000fff94f83d30
std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) +
4294534936
12 lld 0x000000001002a278
...

which seems to indicate a core problem here with dealing with
thread-resource exhaustion. For almost all tests, running them individually
(or using lit -j 1) works without a problem. We could deal with this by
limiting the number of threads lld uses when running regression tests, or
limit the number of threads that lit uses when running lld tests (as we
currently do with the OpenMP runtime tests), but I'm somewhat concerned
that users will run into this program regardless with heavily-parallelized
builds.

We could try to catch exceptions that otherwise come from
ThreadPoolExecutor's constructor, but do we compile with exceptions enabled?

I guess we do not want to enable exceptions to deal with the issue. Are
COFF tests failing, or just ELF tests? If ELF tests for the old LLD are
failing, the best way would be to not use threads in the old LLD. It has
lingering threading issues.

To provide a data point; my default environment has this:

$ ulimit -a | grep proc
max user processes (-u) 1024

This machine has 48 cores, so with lit running 48 tests leaves each test with only about 20 available threads, much less than the 48 each test believes it can use.

This is somewhat non-deterministic, but I just reran things both ways, and here's what I see:

During my last run, these tests fail when running under lit with many parallel tests, but do not fail when run otherwise:

    lld :: elf2/basic.s
    lld :: elf/AArch64/general-dyn-tls-0.test
    lld :: elf/AArch64/initial-exec-tls-0.test
    lld :: elf/AArch64/rel-prel32-overflow.test
    lld :: elf/AArch64/rel-prel64.test
    lld :: elf/AMDGPU/hsa.test
    lld :: elf/ARM/arm-symbols.test
    lld :: elf/ARM/dynamic-symbols.test
    lld :: elf/ARM/entry-point.test
    lld :: elf/ARM/exidx.test
    lld :: elf/ARM/header-flags.test
    lld :: elf/ARM/mapping-code-model.test
    lld :: elf/ARM/mapping-symbols.test
    lld :: elf/ARM/missing-symbol.test
    lld :: elf/ARM/plt-dynamic.test
    lld :: elf/ARM/plt-ifunc-interwork.test
    lld :: elf/ARM/plt-ifunc-mapping.test
    lld :: elf/ARM/rel-arm-call.test
    lld :: elf/ARM/rel-arm-jump24-veneer-b.test
    lld :: elf/ARM/rel-arm-mov.test
    lld :: elf/ARM/rel-arm-prel31.test
    lld :: elf/ARM/rel-arm-target1.test
    lld :: elf/ARM/rel-arm-thm-interwork.test
    lld :: elf/ARM/undef-lazy-symbol.test
    lld :: elf/Hexagon/dynlib-data.test
    lld :: elf/Mips/exe-dynamic.test
    lld :: elf/Mips/exe-dynsym.test
    lld :: elf/Mips/exe-fileheader-64.test
    lld :: elf/Mips/exe-fileheader-micro-64.test
    lld :: elf/Mips/exe-fileheader-n32.test
    lld :: elf/Mips/exe-got-micro.test
    lld :: elf/Mips/exe-got.test
    lld :: elf/Mips/got16-2.test
    lld :: elf/Mips/got16-micro.test
    lld :: elf/Mips/got-page-32-micro.test
    lld :: elf/Mips/got-page-64-micro.test
    lld :: elf/Mips/got-page-64.test
    lld :: elf/X86_64/sectionchoice.test
    lld :: elf/X86_64/sectionmap.test
    lld :: mach-o/arm-interworking.yaml
    lld :: mach-o/arm-shims.yaml
    lld :: mach-o/data-only-dylib.yaml
    lld :: mach-o/executable-exports.yaml
    lld :: mach-o/exe-offsets.yaml
    lld :: mach-o/exported_symbols_list-undef.yaml
    lld :: mach-o/fat-archive.yaml
    lld :: mach-o/flat_namespace_undef_error.yaml
    lld :: mach-o/flat_namespace_undef_suppress.yaml
    lld :: mach-o/force_load-x86_64.yaml
    lld :: mach-o/got-order.yaml
    lld :: mach-o/hello-world-arm64.yaml
    lld :: mach-o/hello-world-armv6.yaml
    lld :: mach-o/hello-world-x86_64.yaml
    lld :: mach-o/hello-world-x86.yaml
    lld :: mach-o/keep_private_externs.yaml
    lld :: mach-o/lazy-bind-x86_64.yaml
    lld :: mach-o/library-rescan.yaml
    lld :: mach-o/mh_bundle_header.yaml
    lld :: mach-o/mh_dylib_header.yaml
    lld :: mach-o/objc_export_list.yaml
    lld :: mach-o/order_file-basic.yaml
    lld :: mach-o/parse-aliases.yaml
    lld :: mach-o/parse-cfstring32.yaml
    lld :: mach-o/parse-cfstring64.yaml
    lld :: mach-o/parse-compact-unwind32.yaml
    lld :: mach-o/parse-compact-unwind64.yaml
    lld :: mach-o/parse-data-in-code-armv7.yaml
    lld :: mach-o/parse-data-in-code-x86.yaml
    lld :: mach-o/parse-data-relocs-arm64.yaml
    lld :: mach-o/parse-data-relocs-x86_64.yaml
    lld :: mach-o/parse-data.yaml
    lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml
    lld :: mach-o/parse-eh-frame-x86-anon.yaml
    lld :: mach-o/parse-eh-frame-x86-labeled.yaml
    lld :: mach-o/parse-eh-frame.yaml
    lld :: mach-o/parse-function.yaml
    lld :: mach-o/parse-initializers32.yaml
    lld :: mach-o/parse-initializers64.yaml
    lld :: mach-o/parse-literals-error.yaml
    lld :: mach-o/parse-literals.yaml
    lld :: mach-o/parse-non-lazy-pointers.yaml
    lld :: mach-o/parse-relocs-x86.yaml
    lld :: mach-o/parse-section-no-symbol.yaml
    lld :: mach-o/parse-tentative-defs.yaml
    lld :: mach-o/parse-text-relocs-x86_64.yaml
    lld :: mach-o/parse-tlv-relocs-x86-64.yaml
    lld :: mach-o/re-exported-dylib-ordinal.yaml
    lld :: mach-o/rpath.yaml
    lld :: mach-o/run-tlv-pass-x86-64.yaml
    lld :: mach-o/sectalign.yaml
    lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml
    lld :: mach-o/usage.yaml
    lld :: mach-o/use-simple-dylib.yaml
    lld :: mach-o/write-final-sections.yaml
    lld :: mach-o/wrong-arch-error.yaml
    lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range
    lld-Unit :: CoreTests/CoreTests/Range.slice
    lld-Unit :: CoreTests/CoreTests/Range.user1
    lld-Unit :: CoreTests/CoreTests/Range.user2

Of these, the following tests don't fail, but are reported as 'Unresolved' (which does not happen if I run lit -j 1):

    lld :: elf/ARM/mapping-code-model.test
    lld :: elf/ARM/mapping-symbols.test
    lld :: elf/ARM/missing-symbol.test
    lld :: elf/ARM/plt-ifunc-interwork.test
    lld :: elf/ARM/rel-arm-jump24-veneer-b.test
    lld :: elf/Mips/exe-got-micro.test
    lld :: elf/Mips/exe-got.test
    lld :: elf/Mips/got16-micro.test
    lld :: mach-o/parse-cfstring64.yaml
    lld :: mach-o/parse-compact-unwind32.yaml
    lld :: mach-o/parse-compact-unwind64.yaml
    lld :: mach-o/parse-data-in-code-armv7.yaml
    lld :: mach-o/parse-data-in-code-x86.yaml
    lld :: mach-o/parse-data-relocs-arm64.yaml
    lld :: mach-o/parse-data-relocs-x86_64.yaml
    lld :: mach-o/parse-data.yaml
    lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml
    lld :: mach-o/parse-eh-frame-x86-anon.yaml
    lld :: mach-o/parse-eh-frame-x86-labeled.yaml
    lld :: mach-o/parse-eh-frame.yaml
    lld :: mach-o/parse-function.yaml
    lld :: mach-o/parse-initializers32.yaml
    lld :: mach-o/parse-initializers64.yaml
    lld :: mach-o/parse-literals-error.yaml
    lld :: mach-o/parse-literals.yaml
    lld :: mach-o/parse-non-lazy-pointers.yaml
    lld :: mach-o/parse-relocs-x86.yaml
    lld :: mach-o/parse-section-no-symbol.yaml
    lld :: mach-o/parse-tentative-defs.yaml
    lld :: mach-o/parse-text-relocs-arm64.yaml
    lld :: mach-o/parse-text-relocs-x86_64.yaml
    lld :: mach-o/parse-tlv-relocs-x86-64.yaml
    lld :: mach-o/rpath.yaml
    lld :: mach-o/run-tlv-pass-x86-64.yaml
    lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml
    lld :: mach-o/usage.yaml
    lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range
    lld-Unit :: CoreTests/CoreTests/Range.slice
    lld-Unit :: CoreTests/CoreTests/Range.user1
    lld-Unit :: CoreTests/CoreTests/Range.user2

these are listed as unresolved for the same underlying reason, for example:

I honestly think that the ulimit of 1024 max threads is too strict for 48 core machine. Processes are independent each other, so it is not strange for them to spawn as many threads as the number of cores. What’s the reason you cannot increase the limit?

From: "Rui Ueyama" <ruiu@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers" <llvm-dev@lists.llvm.org>, "Rafael Espindola" <rafael.espindola@gmail.com>
Sent: Thursday, October 1, 2015 12:55:20 PM
Subject: Re: lld and thread over-subscription

I honestly think that the ulimit of 1024 max threads is too strict
for 48 core machine. Processes are independent each other, so it is
not strange for them to spawn as many threads as the number of
cores.

It is an understandable misconfiguration, but not something desirable in production.

What's the reason you cannot increase the limit?

It is a soft limit, and I can. Running 'ulimit -u 3072' and then re-running lit causes these failures to go away. My concern is that a soft process limit of 1024 is a common default (at least on any RedHat-derived Linux distribution) regardless of the number of cores on the machine. And, obviously, parallel makes are still very common.

Regardless, do you think it would be reasonable for lit to adjust the soft process limit by default to avoid these kinds of issues, at least when running our regression tests?

Thanks again,
Hal

> From: "Rui Ueyama" <ruiu@google.com>
> To: "Hal Finkel" <hfinkel@anl.gov>
> Cc: "LLVM Developers" <llvm-dev@lists.llvm.org>, "Rafael Espindola" <
rafael.espindola@gmail.com>
> Sent: Thursday, October 1, 2015 12:55:20 PM
> Subject: Re: lld and thread over-subscription
>
>
> I honestly think that the ulimit of 1024 max threads is too strict
> for 48 core machine. Processes are independent each other, so it is
> not strange for them to spawn as many threads as the number of
> cores.

It is an understandable misconfiguration, but not something desirable in
production.

> What's the reason you cannot increase the limit?
>

It is a soft limit, and I can. Running 'ulimit -u 3072' and then
re-running lit causes these failures to go away. My concern is that a soft
process limit of 1024 is a common default (at least on any RedHat-derived
Linux distribution) regardless of the number of cores on the machine. And,
obviously, parallel makes are still very common.

Regardless, do you think it would be reasonable for lit to adjust the soft
process limit by default to avoid these kinds of issues, at least when
running our regression tests?

Yes, I do. If we can avoid the issue by adjusting the soft limit in lit, I
don't see any reason to not do that.

We've seen a similar failure on OS X running tests in another LLVM project besides lld. Filipe may remember what it was.

I think the potential improvement here should be to lit since it can know the limit and schedule work accordingly.

Alex

From: "Rui Ueyama" <ruiu@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Developers" <llvm-dev@lists.llvm.org>, "Rafael Espindola" <rafael.espindola@gmail.com>
Sent: Thursday, October 1, 2015 1:48:34 PM
Subject: Re: lld and thread over-subscription

> From: "Rui Ueyama" < ruiu@google.com >
> To: "Hal Finkel" < hfinkel@anl.gov >
> Cc: "LLVM Developers" < llvm-dev@lists.llvm.org >, "Rafael
> Espindola" < rafael.espindola@gmail.com >
> Sent: Thursday, October 1, 2015 12:55:20 PM
> Subject: Re: lld and thread over-subscription
>
>
> I honestly think that the ulimit of 1024 max threads is too strict
> for 48 core machine. Processes are independent each other, so it is
> not strange for them to spawn as many threads as the number of
> cores.

It is an understandable misconfiguration, but not something desirable
in production.

> What's the reason you cannot increase the limit?
>

It is a soft limit, and I can. Running 'ulimit -u 3072' and then
re-running lit causes these failures to go away. My concern is that
a soft process limit of 1024 is a common default (at least on any
RedHat-derived Linux distribution) regardless of the number of cores
on the machine. And, obviously, parallel makes are still very
common.

Regardless, do you think it would be reasonable for lit to adjust the
soft process limit by default to avoid these kinds of issues, at
least when running our regression tests?

Yes, I do. If we can avoid the issue by adjusting the soft limit in
lit, I don't see any reason to not do that.

http://reviews.llvm.org/D13389

Thanks again,
Hal

I honestly think that the ulimit of 1024 max threads is too strict for 48
core machine. Processes are independent each other, so it is not strange
for them to spawn as many threads as the number of cores. What's the reason
you cannot increase the limit?

Yeah, this is it. We've run into this internally on our linux bots.
Basically, the threading abstractions inside LLD spawn #cores threads for
their thread pool as one of the very first things. So if your build is
#cores wide, you end up with #cores ^ 2 threads total.

The simplest solutions is just upping the ulimit. This may be something we
can even do inside lit so users automatically see it.
Beyond that, changes to LLD could ameliorate this; fundamentally though it
has to do with thread pools knowing how many threads they need to spin up.
A nasty solution could be an environment variable like LLD_NUM_THREADS. We
could also have a command line flag, and do something like `%lld` in the
tests like we do for clang like `%clang_cc1`, where some extra stuff is
inserted in the expansion telling lld to use a smaller thread count (for
the tests, --num-threads=1 would be fine I think).

-- Sean Silva

From: "Sean Silva" <chisophugis@gmail.com>
To: "Rui Ueyama" <ruiu@google.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "LLVM Developers" <llvm-dev@lists.llvm.org>
Sent: Friday, October 2, 2015 9:37:17 PM
Subject: Re: [llvm-dev] lld and thread over-subscription

I honestly think that the ulimit of 1024 max threads is too strict
for 48 core machine. Processes are independent each other, so it is
not strange for them to spawn as many threads as the number of
cores. What's the reason you cannot increase the limit?

Yeah, this is it. We've run into this internally on our linux bots.
Basically, the threading abstractions inside LLD spawn #cores
threads for their thread pool as one of the very first things. So if
your build is #cores wide, you end up with #cores ^ 2 threads total.

The simplest solutions is just upping the ulimit. This may be
something we can even do inside lit so users automatically see it.

r249161 should do exactly this.

Thanks again,
Hal

> From: "Sean Silva" <chisophugis@gmail.com>
> To: "Rui Ueyama" <ruiu@google.com>
> Cc: "Hal Finkel" <hfinkel@anl.gov>, "LLVM Developers" <
llvm-dev@lists.llvm.org>
> Sent: Friday, October 2, 2015 9:37:17 PM
> Subject: Re: [llvm-dev] lld and thread over-subscription
>
>
>
>
>
> I honestly think that the ulimit of 1024 max threads is too strict
> for 48 core machine. Processes are independent each other, so it is
> not strange for them to spawn as many threads as the number of
> cores. What's the reason you cannot increase the limit?
>
>
> Yeah, this is it. We've run into this internally on our linux bots.
> Basically, the threading abstractions inside LLD spawn #cores
> threads for their thread pool as one of the very first things. So if
> your build is #cores wide, you end up with #cores ^ 2 threads total.
>
>
> The simplest solutions is just upping the ulimit. This may be
> something we can even do inside lit so users automatically see it.

r249161 should do exactly this.

Thanks. Apologies for the issue, ... when we ran into this internally we
should have done just such a fix to avoid the yak shaving for the next guy.
Not sure why that didn't happen.

-- Sean Silva