might be creating too many page faults

I’m not sure if this is the right place for this… so please feel free to move this if there is a better place for it.

I was debugging an issue with upx compression of libraries (specifically the libraries for llvm 16-rc4) with the upx folks, and John Reiser pointed out that the I was building creates 600 page faults during initialization… which slows it down by 1-3 seconds.

Copying those comments from linking to upx compressed linux libraries is failing · Issue #659 · upx/upx · GitHub
by John Reiser here:

Using readelf --headers --dynamic > readelf.out for the various architectures, I see that the INIT_ARRAY information has more than 600 entries. For example on x86_64:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [23] .init_array       INIT_ARRAY       0000000007c08438  07c06438
       0000000000001318  0000000000000000  WA       0     0     8

Dynamic section at offset 0x7540b68 contains 41 entries:        
  Tag        Type                         Name/Value 
 0x0000000000000019 (INIT_ARRAY)         0x7c08438
 0x000000000000001b (INIT_ARRAYSZ)       4888 (bytes)

and using a file dump, the beginning of the list of addresses of static initialization subroutines is:

0x107c06438:	0x000000000372bdf0	0x00000000038ebf30
0x107c06448:	0x0000000003920a00	0x0000000003981a30
0x107c06458:	0x00000000039b1160	0x00000000039b24b0
0x107c06468:	0x00000000039dd8e0	0x00000000039ed130
0x107c06478:	0x0000000003a20650	0x0000000003a2e920
0x107c06488:	0x0000000003a40040	0x0000000003a6c1a0
0x107c06498:	0x0000000003a6f4d0	0x0000000003a7d370
0x107c064a8:	0x0000000003a812b0	0x0000000003a8b080
0x107c064b8:	0x0000000003a9f6f0	0x0000000003adde80
0x107c064c8:	0x0000000003b29160	0x0000000003b62800
0x107c064d8:	0x0000000003b6c120	0x0000000003b77b00
0x107c064e8:	0x0000000003b7d880	0x0000000003b8d060
0x107c064f8:	0x0000000003bcb200	0x0000000003be0f00
0x107c06508:	0x0000000003be9cf0	0x0000000003bf4380
0x107c06518:	0x0000000003bf5e30	0x0000000003bfa430
0x107c06528:	0x0000000003c040f0	0x0000000003c09250

Notice that the subroutines are almost all on separate 4KiB pages. This means that during initialization there will be 600 page faults. You can save around 1 to 3 seconds by not having so many static initializers. Find a .h header file that is #included into 600 *.o compilation units, and which performs a static initialization that requires run-time execution, and get rid of the run-time execution. For a simple variable you can do this by changing the .h file to say extern, and then creating a new .cpp file which actually uses a static init to set the value. Other times it is necessary do a topological sort of all the static inits, figure out the nest of dependencies, then make a PREINIT function for the few initializers that will cover all the rest.

In 1992 (31 years ago), I [John Reiser] presented Static Initializers: Reducing the Value-Added Tax on Programs at the USENIX C++ Technical Conference, Portland, Oregon, pp 171-180: Conference Proceedings and others. Your current definitely has this problem. It is not a problem of compilation as such, but a problem of using allocation hook initialization with a static class instance, in order to initialize the API. Please read the paper; it is a thorough explanation, complete with 4 pages of quantitative diagrams.

You don’t need to advertise your own paper here about what is these days a well-understood problem, certainly within a compiler community that has documentation stating this construct should be avoided LLVM Coding Standards — LLVM 17.0.0git documentation. This post could have just been a single paragraph stating the statistics of how bad it currently is (even the dump of addresses is pretty useless without symbolication).

Sorry. It’s not my paper or my comments I was reposting. This was mentioned to me yesterday at linking to upx compressed linux libraries is failing · Issue #659 · upx/upx · GitHub . Sorry if I didn’t make that clear.

It would still be interesting too have a look at what the ~600 static initializers contain… :slight_smile:

Certainly a lot of them are cl::opts.

There’s a pattern that a number of places are using in which the cl::opts are wrapped like this, but it is far from being applied consistently:

namespace {

struct LocalOptions {
  cl::opt<T> MyOption(...);

LocalOptions &getOptions() {
  static LocalOptions O;
  return O;


// The trick here is that users must remember to call this function:
void registerFooOptions() {

There’s also a question of how much this pattern really saves. The most common ways of using LLVM will genuinely want to initialize most of those options, so just making the initialization optional wouldn’t actually help.

1 Like

The actual initialization work just moves from the .init_array to the local function, so it wouldn’t save any user-mode time; but hopefully the local function would be calling code that is already paged in, so it saves sys/wall time.

(I am the author and maintainer of UPX for ELF files, the author that satmandu quoted, and the author of the 1992 paper.)

Finding the culprits of so many static initialization functions:

  1. Most projects use GnuMake, cmake, or another build system with a convention which keeps track of #include dependencies, so that only minimal re-compilation is necessary when less than every *.h file changes. Collect the dependencies (the build system that I use puts them into files such as ./CMakeFiles/upx.dir/src/p_lx_elf.cpp.o.d), split all lines at whlespace and punctuation (so that there is only one filename per line), then make a histogram using sort | uniq -c | sort -rn. The heavily-used transitive #include files wll be at the top of the list.
  2. I use two instances of gdb. One gdb targets a program which maps the entire into address space, then I apply the DT_xxxxx entries to look at the INIT_ARRAY table from its Offset in the file according to readelf. The first 32 entries are shown earlier in the post that satmandu copied. Then the second gdb is run on Use the mouse to copy an address from the INIT_ARRAY table, and ask gdb what code is there:
(gdb) x/5i 0x000000000372bdf0
   0x372bdf0 <frame_dummy>:	endbr64 
   0x372bdf4 <frame_dummy+4>:	jmp    0x372bd70 <register_tm_clones>
   0x372bdf9 <frame_dummy+9>:	add    %al,(%rax)
   0x372bdfb <frame_dummy+11>:	add    %al,(%rax)
   0x372bdfd <frame_dummy+13>:	add    %al,(%rax)
gdb) x/5i 0x00000000038ebf30
   0x38ebf30 <_GLOBAL__sub_I_Assumptions.cpp>:	push   %r15
   0x38ebf32 <_GLOBAL__sub_I_Assumptions.cpp+2>:	push   %r14
   0x38ebf34 <_GLOBAL__sub_I_Assumptions.cpp+4>:	push   %rbx
   0x38ebf35 <_GLOBAL__sub_I_Assumptions.cpp+5>:	xorps  %xmm0,%xmm0
   0x38ebf38 <_GLOBAL__sub_I_Assumptions.cpp+8>:	mov    0x431e051(%rip),%rbx        # 0x7c09f90

etc. Automating this is a few minute job for a “hacker”.

1 Like

In 1992, a Sun SPARC workstation with a spinning hard drive could service 50 page faults/second. Today’s SSD are something like 10 times faster than a spinning harddrive: 500 page faults/second. That’s how I estimated 1 to 3 seconds of wall-clock time during initialization. A human user has no trouble noticing such a delay.

This name indicates that it is an initialization function generated for a static variable in a .cpp file, not in a .h file. I’m really not sure why you’re harping on header files so much; static/global variables (with constructors) in general will need the .init_array mechanism, even if zero of them are defined in headers.

(Using two gdb instances seems a bit tedious; I’d expect most systems would have a symbolic dumper/disassembler that could produce the information more simply and directly.)

I suspect a fair number of these .init_array entries are for global const data that ends up being initialized at runtime. We could probably take more advantage of constexpr than we do today, which would help. Seems like a tedious but not unreasonable student/intern project to work through the list and get some startup time improvements.

If your process startup time actually takes multiple seconds, sure humans will notice. I just tried time clang --version on a system I haven’t touched in days, and it reported about 0.43 seconds. Rerunning it reported 0.03 seconds. The 0.4s overhead is not great, but it’s not 3 seconds either. (FTR, running clang off a hard drive not SSD.)

Actually, no it doesn’t, unless your system chooses not to load anything into memory until it’s faulted in, which seems like a particularly inefficient way to start up a process. We knew better in the '80s.

Thanks for reporting this, and it’s clear there’s room for improvement, but the dire thought-experiment predictions don’t seem to be panning out.

1 Like

@admins This is probably better placed in LLVM Project, not Code Generation, as it has more to do with coding style than LLVM’s own code generation.

1 Like

Apologies for unnecessary sounding alarms here.