I’m not sure if this is the right place for this… so please feel free to move this if there is a better place for it.
I was debugging an issue with upx compression of libraries (specifically the libraries for llvm 16-rc4) with the upx folks, and John Reiser pointed out that the libLLVM-16.so I was building creates 600 page faults during initialization… which slows it down by 1-3 seconds.
Using readelf --headers --dynamic libLLVM-16.so > readelf.out for the various architectures, I see that the INIT_ARRAY information has more than 600 entries. For example on x86_64:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[23] .init_array INIT_ARRAY 0000000007c08438 07c06438
0000000000001318 0000000000000000 WA 0 0 8
Dynamic section at offset 0x7540b68 contains 41 entries:
Tag Type Name/Value
0x0000000000000019 (INIT_ARRAY) 0x7c08438
0x000000000000001b (INIT_ARRAYSZ) 4888 (bytes)
and using a file dump, the beginning of the list of addresses of static initialization subroutines is:
Notice that the subroutines are almost all on separate 4KiB pages. This means that during initialization there will be 600 page faults. You can save around 1 to 3 seconds by not having so many static initializers. Find a .h header file that is #included into 600 *.o compilation units, and which performs a static initialization that requires run-time execution, and get rid of the run-time execution. For a simple variable you can do this by changing the .h file to say extern, and then creating a new .cpp file which actually uses a static init to set the value. Other times it is necessary do a topological sort of all the static inits, figure out the nest of dependencies, then make a PREINIT function for the few initializers that will cover all the rest.
In 1992 (31 years ago), I [John Reiser] presented Static Initializers: Reducing the Value-Added Tax on Programs at the USENIX C++ Technical Conference, Portland, Oregon, pp 171-180: Conference Proceedings and others. Your current libLLVM-16.so definitely has this problem. It is not a problem of compilation as such, but a problem of using allocation hook initialization with a static class instance, in order to initialize the API. Please read the paper; it is a thorough explanation, complete with 4 pages of quantitative diagrams.
You don’t need to advertise your own paper here about what is these days a well-understood problem, certainly within a compiler community that has documentation stating this construct should be avoided LLVM Coding Standards — LLVM 17.0.0git documentation. This post could have just been a single paragraph stating the statistics of how bad it currently is (even the dump of addresses is pretty useless without symbolication).
There’s a pattern that a number of places are using in which the cl::opts are wrapped like this, but it is far from being applied consistently:
namespace {
struct LocalOptions {
cl::opt<T> MyOption(...);
...
};
LocalOptions &getOptions() {
static LocalOptions O;
return O;
}
}
// The trick here is that users must remember to call this function:
void registerFooOptions() {
(void)getOptions();
}
There’s also a question of how much this pattern really saves. The most common ways of using LLVM will genuinely want to initialize most of those options, so just making the initialization optional wouldn’t actually help.
The actual initialization work just moves from the .init_array to the local function, so it wouldn’t save any user-mode time; but hopefully the local function would be calling code that is already paged in, so it saves sys/wall time.
(I am the author and maintainer of UPX for ELF files, the author that satmandu quoted, and the author of the 1992 paper.)
Finding the culprits of so many static initialization functions:
Most projects use GnuMake, cmake, or another build system with a convention which keeps track of #include dependencies, so that only minimal re-compilation is necessary when less than every *.h file changes. Collect the dependencies (the build system that I use puts them into files such as ./CMakeFiles/upx.dir/src/p_lx_elf.cpp.o.d), split all lines at whlespace and punctuation (so that there is only one filename per line), then make a histogram using sort | uniq -c | sort -rn. The heavily-used transitive #include files wll be at the top of the list.
I use two instances of gdb. One gdb targets a program which maps the entire libLLVM-16.so into address space, then I apply the DT_xxxxx entries to look at the INIT_ARRAY table from its Offset in the file according to readelf. The first 32 entries are shown earlier in the post that satmandu copied. Then the second gdb is run on libLLVM-16.so. Use the mouse to copy an address from the INIT_ARRAY table, and ask gdb what code is there:
In 1992, a Sun SPARC workstation with a spinning hard drive could service 50 page faults/second. Today’s SSD are something like 10 times faster than a spinning harddrive: 500 page faults/second. That’s how I estimated 1 to 3 seconds of wall-clock time during initialization. A human user has no trouble noticing such a delay.
This name indicates that it is an initialization function generated for a static variable in a .cpp file, not in a .h file. I’m really not sure why you’re harping on header files so much; static/global variables (with constructors) in general will need the .init_array mechanism, even if zero of them are defined in headers.
(Using two gdb instances seems a bit tedious; I’d expect most systems would have a symbolic dumper/disassembler that could produce the information more simply and directly.)
I suspect a fair number of these .init_array entries are for global const data that ends up being initialized at runtime. We could probably take more advantage of constexpr than we do today, which would help. Seems like a tedious but not unreasonable student/intern project to work through the list and get some startup time improvements.
If your process startup time actually takes multiple seconds, sure humans will notice. I just tried time clang --version on a system I haven’t touched in days, and it reported about 0.43 seconds. Rerunning it reported 0.03 seconds. The 0.4s overhead is not great, but it’s not 3 seconds either. (FTR, running clang off a hard drive not SSD.)
Actually, no it doesn’t, unless your system chooses not to load anything into memory until it’s faulted in, which seems like a particularly inefficient way to start up a process. We knew better in the '80s.
Thanks for reporting this, and it’s clear there’s room for improvement, but the dire thought-experiment predictions don’t seem to be panning out.