FYI: ENABLE_MODULES would make building faster

I was testing efficiency with LLVM_ENABLE_MODULES to build clang/llvm tree.

  • Summary

** Efficiency of Modules increases as the degree of parallelism decreases.
For example with -j8, Modules is 67% of elapsed time than no-modules.

** With higher parallelism, Modules is inefficient.
For example with -j72, Modules is just 23 seconds faster than no-modules.
Then, processor usage of Modules is about 55%.
(Assuming (user+sys)/72 is ideal)

** If each module(s) is not rebuilt, rebuilding is sufficiently efficient.
For example with -j72 to remove just *.o, processor usage is 84%.

  • Random notes for improvements
  • Get rid of -DCLANG_ENABLE_(ARCMT|REWRITER|SATATIC_ANALYZER), => clang-config.h
  • Propagate definitions in unittests to whole the tree.
    Modules is sensitive of -D in command line.
  • Recognize CMake and Ninja to rebuild module cache.
    IIRC, there was the discussion about Fortran modules.
  • Parse and issue “module rebuilder” from modules.cache in advance of building the tree.
    Anyways, Ninja doesn’t do anything while each compilation unit is waiting for module lock.

I expect developers and users would be happier with Modules.
Thanks,
Takumi

Below, building clang with “/usr/bin/time ninja -jN clang”
Host compiler is clang with libc++ and lld, -Asserts
The host is Xeon 36 cores, 72 logical processors.

Columns are;
N,user,system,elapsed,Ideal:(u+s)/N,(Ideal/elapsed)

N, Number of jobs -jN
user, user time (sec)
system, system time (sec)
elapsed, elapsed time (sec)
Ideal:(u+s)/N, Ideal elapsed time w/o idle
(Ideal/elapsed): Efficiency – elapsed processor usage

*ENABLE_MODULES=OFF

96,11959.10,413.57,184.52,128.882,69.8%
80,12000.47,411.62,184.67,155.151,84.0%
72,11952.46,407.66,184.98,171.668,92.8%
64,10970.09,375.14,189.08,177.269,93.8%
48,8716.43,310.69,198.75,188.065,94.6%
41,7651.71,274.48,202.32,193.322,95.6%
40,7496.75,270.23,205.38,194.175,94.5%
39,7377.94,266.18,206.45,196.003,94.9%
38,7227.33,259.33,206.22,197.017,95.5%
37,7068.51,254.84,207.64,197.928,95.3%
36,6914.62,250.31,208.13,199.026,95.6%
35,6815.70,247.86,210.31,201.816,96.0%
34,6728.49,244.93,214.57,205.101,95.6%
33,6608.13,239.37,216.54,207.500,95.8%
32,6585.52,235.59,221.93,213.160,96.0%
28,6502.79,231.50,248.85,240.510,96.6%
24,6451.13,230.06,289.14,278.383,96.3%
20,6386.95,225.27,342.18,330.611,96.6%
16,6183.61,222.80,411.88,400.401,97.2%
8,5558.17,205.07,728.88,720.405,98.8%

*ENABLE_MODULES=ON

96,6396.47,330.73,169.28,70.075,41.4%
88,6249.93,329.12,160.22,74.762,46.7%
80,6259.91,322.27,163.59,82.277,50.3%
72,6092.58,315.70,161.55,89.004,55.1%
64,5727.81,297.64,168.78,94.148,55.8%
56,5421.81,283.95,168.71,101.889,60.4%
48,4896.81,260.07,171.05,107.435,62.8%
40,4375.71,235.90,177.60,115.290,64.9%
32,3959.32,214.67,188.10,130.437,69.3%
24,3892.54,206.40,230.70,170.789,74.0%
16,3690.52,201.41,294.12,243.246,82.7%
8,3298.95,185.68,488.59,435.579,89.2%

*ENABLE_MODULES_ON building to remove just *.o

96,6898.51,347.36,120.62,75.478,62.6%
88,6908.61,345.52,121.14,82.433,68.0%
80,6823.66,338.48,118.72,89.527,75.4%
72,6819.25,339.82,118.30,99.432,84.1%
64,6311.53,310.03,120.06,103.462,86.2%
56,5729.12,287.76,123.73,107.444,86.8%
48,5108.16,264.21,127.25,111.924,88.0%
40,4449.20,231.17,131.42,117.009,89.0%
32,3933.69,205.94,142.74,129.363,90.6%
24,3844.17,201.83,181.55,168.583,92.9%
16,3669.73,193.59,251.15,241.458,96.1%
8,3225.63,178.68,434.85,425.539,97.9%

Thanks for sharing this summary. Do you have some numbers about the performance if build with libstdc++? I assume this is the penalty of using implicit modules. Building modules locks which might lead to quadratic compile times (we had an issue describing the problem somewhere in bugzilla). I’ve seen in the past using make that we build modules but we fail to pick them up. I tried to fix the issue but didn’t test thoroughly. If you compile with -H you should probably see which files are still textually included. Do you mean we are 84% faster? +1 Thanks a lot for working on this! --Vassil

I was testing efficiency with LLVM_ENABLE_MODULES to build clang/llvm tree.

Awesome - thanks for trying it out & gathering all this data!

  • Summary

** Efficiency of Modules increases as the degree of parallelism decreases.
For example with -j8, Modules is 67% of elapsed time than no-modules.

** With higher parallelism, Modules is inefficient.
For example with -j72, Modules is just 23 seconds faster than no-modules.
Then, processor usage of Modules is about 55%.
(Assuming (user+sys)/72 is ideal)

As Vasil mentioned, probably implicit modules.

I have some hope/aspirations of implementing explicit modules* support in cmake & so it’ll be interesting to compare how much more parallelism can be achieved by that. If anyone else is interesting in doing/helping with this work, I’d love any help - I’ve never touched cmake… so it’ll be an adventure. (I assume it’ll need changes to cmake itself, but I could be wrong)

  • Explicit modules are used at Google & implemented in Clang (though only accessible via cc1 at the moment) - where an explicit clang invocation must be made by the build system to build a .pcm file, and then explicit arguments given to a clang invocation of a file using those modules, etc.

I was testing efficiency with LLVM_ENABLE_MODULES to build clang/llvm
tree.

Awesome - thanks for trying it out & gathering all this data!

* Summary

** Efficiency of Modules increases as the degree of parallelism decreases.
For example with -j8, Modules is 67% of elapsed time than no-modules.

** With higher parallelism, Modules is inefficient.
For example with -j72, Modules is just 23 seconds faster than no-modules.
Then, processor usage of Modules is about 55%.
(Assuming (user+sys)/72 is ideal)

As Vasil mentioned, probably implicit modules.

I have some hope/aspirations of implementing explicit modules* support in
cmake & so it'll be interesting to compare how much more parallelism can be
achieved by that. If anyone else is interesting in doing/helping with this
work, I'd love any help - I've never touched cmake... so it'll be an
adventure. (I assume it'll need changes to cmake itself, but I could be
wrong)

I vaugely looked at this at one point in the past, though I got side
tracked before I could try it out.

Basically, what it seemed like could be done was something like:

1. use a PRE_BUILD add_custom_command for a given add_library command to
build the pcm (
https://cmake.org/cmake/help/v3.0/command/add_custom_command.html)
2. use an INTERFACE target_compile_definitions to add the command line flag
that dependent code needs to add to the clang invocation (
target_compile_definitions — CMake 3.0.2 Documentation
)
3. an IMPORTED target library could be used to model external system
dependencies, and you would have one such IMPORTED target for each
different libc or C++ standard library

Of course, to have it properly integrated into CMake, ideally add_library
would take a list of headers and do 1. and 2. by itself. CMake would need
to have built-in knowledge of 3. also.

-- Sean Silva

David Blaikie via cfe-dev <cfe-dev@lists.llvm.org> writes:

I have some hope/aspirations of implementing explicit modules* support in
cmake & so it'll be interesting to compare how much more parallelism can be
achieved by that.

I think it will be hard to implement this "properly" in CMake until the
underlying build systems are module-aware, similar to how (most) of them
being header-aware. Specifically, generating all the .pcm's during some
sort of a pre-build step will hinder parallelism since in a sense you will
have a "barrier" between compiling module interfaces and other sources.
Ideally, you would want to start compiling sources as soon as all the
module interfaces that they actually use are ready. Especially so if
you have a -j72 kind of machine ;-).

Then there is the issue of change detection: you probably don't want to
make all your sources depend on all your module interfaces.

FWIW, we have implemented this "proper" module support (though for
-fmodule-ts only) in build2[1], if you (or anyone else) would like
to try it.

[1] https://build2.org/

Boris

There is nice presentation from google on what it takes to enable modules in massively parallel and distributed build system:

Best regards,
Serge.