(Thin)LTO llvm build

I can look into and check why Arch Linux has it configured like that.

In the meantime, Mehdi's suggestion to explicitly pass BINUTILS_INCDIR
restored the previous configure behavior, and the new llvm build has
lib/LLVMgold.so. Thanks to both of you for pointing out the missing cmake
flag.

I've checked the configure step and it didn't fail as it did before, but before
I try to build in ThinLTO mode: since the configure step checks for the gold
plugin, is it safe to assume that I don't have to change the default system
ld to gold and ThinLTO will work, or is that a build requirement for
bootstrapping llvm in ThinLTO mode?

Try to build llvm-tblgen, you’ll know quite quickly :slight_smile:

Also, you should limit the number of parallel link jobs: cmake -DLLVM_PARALLEL_LINK_JOBS=1
And use ninja if you don’t already: cmake -GNinja

You probably missed -DLLVM_BINUTILS_INCDIR.

See: The LLVM gold plugin — LLVM 18.0.0git documentation

plugin-api.h is in /usr/include, so I'd expect it to be found, but I
can explicitly set BINUTILS_INCDIR and re-bootstrap with gcc 6.2.1.

I have ld.gold, but I'm not sure if /usr/bin/ld uses it, though I'd expect
it to since it's been in for a couple releases now.

$ ld -v
GNU ld (GNU Binutils) 2.27
$ ld.bfd -v
GNU ld (GNU Binutils) 2.27
$ ld.gold -v
GNU gold (GNU Binutils 2.27) 1.12

Looks like your default ld is GNU ld.bfd not ld.gold. You can either
change your
/usr/bin/ld (which probably is a link to /usr/bin/ld.bfd) to point instead
to
/usr/bin/ld.gold, or if you prefer, set your PATH before the stage1
compile to a
location that has ld linked to ld.gold.

I can look into and check why Arch Linux has it configured like that.

In the meantime, Mehdi's suggestion to explicitly pass BINUTILS_INCDIR
restored the previous configure behavior, and the new llvm build has
lib/LLVMgold.so. Thanks to both of you for pointing out the missing cmake
flag.

I've checked the configure step and it didn't fail as it did before, but
before
I try to build in ThinLTO mode: since the configure step checks for the
gold
plugin, is it safe to assume that I don't have to change the default system
ld to gold and ThinLTO will work, or is that a build requirement for
bootstrapping llvm in ThinLTO mode?

Yeah, perhaps this is working somehow anyway.

Try to build llvm-tblgen, you’ll know quite quickly :slight_smile:

Also, you should limit the number of parallel link jobs: cmake
-DLLVM_PARALLEL_LINK_JOBS=1
And use ninja if you don’t already: cmake -GNinja

Yes and to add on - the ThinLTO backend by default will kick off
std::thread::hardware_concurrency # of threads, which I'm finding is too
much for machines with hyperthreading. If that ends up being an issue I can
give you a workaround (I've been struggling to find a way that works on
various OS and arches to compute the max number of physical cores to fix
this in the source).

Teresa

Yes and to add on - the ThinLTO backend by default will
kick off std::thread::hardware_concurrency # of threads, which I'm finding is

Is it just me or does that sound not very intuitive or at least a
little unexpected?
It's good that it uses the resources eagerly, but in terms of build systems this
is certainly surprising if there's no control of that parameter via
make/ninja/xcode.

too much for machines with hyperthreading. If that ends up being an issue I can
give you a workaround (I've been struggling to find a way that works on various
OS and arches to compute the max number of physical cores to fix this in the source).

I've been using ninja -jN so far. I suppose when building with ThinLTO I should
run ninja -j1. Would that

What's the workaround?

Yes and to add on - the ThinLTO backend by default will
kick off std::thread::hardware_concurrency # of threads, which I'm finding is

Is it just me or does that sound not very intuitive or at least a
little unexpected?
It's good that it uses the resources eagerly, but in terms of build systems this
is certainly surprising if there's no control of that parameter via
make/ninja/xcode.

You can control the parallelism used by the linker, but the option is linker dependent
(On MacOS: -Wl,-mllvm,-threads=1)

too much for machines with hyperthreading. If that ends up being an issue I can
give you a workaround (I've been struggling to find a way that works on various
OS and arches to compute the max number of physical cores to fix this in the source).

I've been using ninja -jN so far. I suppose when building with ThinLTO I should
run ninja -j1. Would that

What's the workaround?

Seems like you missed my previous email: : cmake -DLLVM_PARALLEL_LINK_JOBS=1
Also, ninja is parallel by default, so no need to pass -j.

This way you get nice parallelism during the compile phase, and ninja will issue only one link job at a time.

I should add: this is an issue for LLVM because we link *a lot* of various binaries.
However, I don’t believe it is the case of most projects, which is why we have this default (which matches ninja’s default).

Yes and to add on - the ThinLTO backend by default will
kick off std::thread::hardware_concurrency # of threads, which I'm finding is

Is it just me or does that sound not very intuitive or at least a
little unexpected?
It's good that it uses the resources eagerly, but in terms of build systems this
is certainly surprising if there's no control of that parameter via
make/ninja/xcode.

You can control the parallelism used by the linker, but the option is linker dependent
(On MacOS: -Wl,-mllvm,-threads=1)

That's what I meant. Maybe lld can gain support for that and allow us to use
the same ld pass-through via the compile driver so that it works on Linux and
BSD too.

too much for machines with hyperthreading. If that ends up being an issue I can
give you a workaround (I've been struggling to find a way that works on various
OS and arches to compute the max number of physical cores to fix this in the source).

I've been using ninja -jN so far. I suppose when building with ThinLTO I should
run ninja -j1. Would that

What's the workaround?

Seems like you missed my previous email: : cmake -DLLVM_PARALLEL_LINK_JOBS=1

I didn't miss that and it will hopefully help with limiting LTO link phase
resource use, but I do wonder if it means it's either linking one binary
or compiling objects and not both at parallel.

Also, ninja is parallel by default, so no need to pass -j.

This way you get nice parallelism during the compile phase,
and ninja will issue only one link job at a time.

I know, but I use ninja for C++ project because those are the most
frequent CMake user, and compiling C++ often requires limited it
to less than NUM_CORES.

Yes and to add on - the ThinLTO backend by default will

kick off std::thread::hardware_concurrency # of threads, which I’m finding is

Is it just me or does that sound not very intuitive or at least a

little unexpected?

It’s good that it uses the resources eagerly, but in terms of build systems this

is certainly surprising if there’s no control of that parameter via

make/ninja/xcode.

You can control the parallelism used by the linker, but the option is linker dependent

(On MacOS: -Wl,-mllvm,-threads=1)

That’s what I meant. Maybe lld can gain support for that and allow us to use
the same ld pass-through via the compile driver so that it works on Linux and
BSD too.

too much for machines with hyperthreading. If that ends up being an issue I can

give you a workaround (I’ve been struggling to find a way that works on various

OS and arches to compute the max number of physical cores to fix this in the source).

I’ve been using ninja -jN so far. I suppose when building with ThinLTO I should

run ninja -j1. Would that

What’s the workaround?

Seems like you missed my previous email: : cmake -DLLVM_PARALLEL_LINK_JOBS=1

I didn’t miss that and it will hopefully help with limiting LTO link phase
resource use, but I do wonder if it means it’s either linking one binary
or compiling objects and not both at parallel.

I believe it only limits the number of concurrent links, without interacting with the compile phase, but I’d have to check to be sure.

Also, ninja is parallel by default, so no need to pass -j.

This way you get nice parallelism during the compile phase,

and ninja will issue only one link job at a time.

I know, but I use ninja for C++ project because those are the most
frequent CMake user, and compiling C++ often requires limited it
to less than NUM_CORES.

ThinLTO is quite lean on the memory side. What’s you bottleneck to throttle you -j?

>
>
>> Yes and to add on - the ThinLTO backend by default will
>> kick off std::thread::hardware_concurrency # of threads, which I'm
finding is
>
> Is it just me or does that sound not very intuitive or at least a
> little unexpected?
> It's good that it uses the resources eagerly, but in terms of build
systems this
> is certainly surprising if there's no control of that parameter via
> make/ninja/xcode.

You can control the parallelism used by the linker, but the option is
linker dependent
(On MacOS: -Wl,-mllvm,-threads=1)

Wait - this is to control the ThinLTO backend parallelism, right? In which
case you wouldn't want to use 1, but rather the number of physical cores.

When using gold the option is -Wl,-plugin-opt,jobs=N, where N is the amount
of backend parallel ThinLTO jobs that will be issued in parallel. So you
could try with the default, but if you have HT on then you might want to
try with the number of physical cores instead.

Well it depends what behavior you want :slight_smile:

I should have used N to match ninja -jN.

How does it affects parallel LTO backends?
(I hope it doesn’t)

>
>
>> Yes and to add on - the ThinLTO backend by default will
>> kick off std::thread::hardware_concurrency # of threads, which I'm
finding is
>
> Is it just me or does that sound not very intuitive or at least a
> little unexpected?
> It's good that it uses the resources eagerly, but in terms of build
systems this
> is certainly surprising if there's no control of that parameter via
> make/ninja/xcode.

You can control the parallelism used by the linker, but the option is
linker dependent
(On MacOS: -Wl,-mllvm,-threads=1)

Wait - this is to control the ThinLTO backend parallelism, right? In which
case you wouldn't want to use 1, but rather the number of physical cores.

Well it depends what behavior you want :slight_smile:

I should have used N to match ninja -jN.

When using gold the option is -Wl,-plugin-opt,jobs=N, where N is the
amount of backend parallel ThinLTO jobs that will be issued in parallel. So
you could try with the default, but if you have HT on then you might want
to try with the number of physical cores instead.

How does it affects parallel LTO backends?
(I hope it doesn't)

In regular LTO mode, the option will also affect parallel LTO codegen,
which is off by default. Is that what you meant?

Yes. I’m sad that it is the same option: the parallel LTO changes the final binary, which is really not great in my opinion.
In ThinLTO the parallel level has this important property that the codegen is unchanged!

So, when I embark on the next ThinLTO try build, probably this Sunday,
should I append -Wl,-plugin-opt,jobs=NUM_PHYS_CORES to LDFLAGS
and run ninja without -j or -jNUM_PHYS_CORES?

It's an acquired behavior from previous compiles of heavy users of C++ templates
or some other C++ feature that blows up space use, having to be careful not
to swap. I usually try -jNUM_VIRT_CORES, but could also run -j of course.
gcc's space overhead has been optimized significantly for OpenOffice and Mozilla
a couple releases back, so it's probably less of an issue these days.

ThinLTO is fairly lean on memory: It should not consume more memory per thread than if you launch the same number of clang process in parallel to process C++ files.

For example when linking the clang binary itself, without debug info it consumes 0.6GB with 8 threads, 0.9GB with 16 threads, and 1.4GB with 32 threads.
With full debug info, we still have room for improvement, right now it consumes 2.3GB with 8 threads, 3.5GB with 16 threads, and 6.5GB with 32 threads.

So I believe that configuring with -DDLLVM_PARALLEL_LINK_JOBS=1 should be enough without other constrains, but your mileage may vary.

Sure, I'll try that to not introduce too many variables into the
configure changes,
though I have to ask if using lld would make it possible to have a common -Wl
that works across platforms, being able to ignore if it's binutils.

If I really wanted to pass that to cmake, overriding LDFLAGS would work, right?

As Mehdi mentioned, thinLTO backend processes use very little memory, you may get away without any additional flags (neither -Wl,–plugin-opt=jobs=…, nor -Dxxx for cmake to limit link parallesm) if your build machine has enough memory. Here is some build time data of parallel linking (with ThinLTO) 52 binaries in clang build (linking parallelism equals ninja parallelism). The machine has 32 logical cores and 64GB memory.

  1. Using the default ninja parallelism, the peak 1min load-average is 537. The total elapse time is 9m43s
  2. Using ninja -j16, the peak load is 411. The elapse time is 8m26s
  3. ninja -j8 : elapse time is 8m34s
  4. ninja -j4 : elapse time is 8m50s
  5. ninja -j2 : elapse time is 9m54s
  6. ninja -j1 : elapse time is 12m3s

As you can see, doing serial thinLTO linking across multiple binaries do not give you the best performance. The build performance peaked at j16 in this configuration. You may need to find your best LLVM_PARALLEL_LINK_JOBS value.

Having said that, there is definitely room for ThinLTO usability improvement so that ThinLTO parallel backend can coordinate well with the build system’s parallelism so that user does not need to figure out the sweet spot.

thanks,

David

So, when I embark on the next ThinLTO try build, probably this Sunday,
should I append -Wl,-plugin-opt,jobs=NUM_PHYS_CORES to LDFLAGS
and run ninja without -j or -jNUM_PHYS_CORES?

ThinLTO is fairly lean on memory: It should not consume more memory per thread than if you launch the same number of clang process in parallel to process C++ files.

For example when linking the clang binary itself, without debug info it consumes 0.6GB with 8 threads, 0.9GB with 16 threads, and 1.4GB with 32 threads.
With full debug info, we still have room for improvement, right now it consumes 2.3GB with 8 threads, 3.5GB with 16 threads, and 6.5GB with 32 threads.

So I believe that configuring with -DDLLVM_PARALLEL_LINK_JOBS=1 should be enough without other constrains, but your mileage may vary.

Sure, I'll try that to not introduce too many variables into the
configure changes,
though I have to ask if using lld would make it possible to have a common -Wl
that works across platforms, being able to ignore if it's binutils.

I’m not sure I understand the question about lld.
Lld will be a different linker, with its own set of option.
Actually we usually rely on the clang driver to hide platform specific option and provide a common interface to the user.

If I really wanted to pass that to cmake, overriding LDFLAGS would work, right?

I don’t believe LDFLAGS is a valid cmake flag. You need to define both CMAKE_EXE_LINKER_FLAGS and CMAKE_SHARED_LINKER_FLAGS.

I am trying to build httpd.bc and for this I am configuring as

./configure --disable-shared CC=“llvm-gcc -flto -use-gold-plugin -Wl,-plugin-opt=also-emit-llvm” CFLAGS=“-g” RANLIB=“ar --plugin /home/awanish/llvm-2.9/llvm-gcc-4.2-2.9.source/libexec/gcc/x86_64-unknown-linux-gnu/4.2.1/LLVMgold.so -s” AR_FLAGS=“–plugin /home/awanish/llvm-2.9/llvm-gcc-4.2-2.9.source/libexec/gcc/x86_64-unknown-linux-gnu/4.2.1/LLVMgold.so -cru”

but I am getting an error which states that

checking for gcc… llvm-gcc -flto -use-gold-plugin -Wl,-plugin-opt=also-emit-llvm
checking whether the C compiler works… no
configure: error: in `/home/awanish/PHD/benchmark/httpd-2.2.16/myBuild/srclib/apr’:
configure: error: C compiler cannot create executables

I got reference for configuring like this from . Can anyone please tell me where I am doing wrong and what is correct procedure for generating .bc for for httpd which can be run on klee?