Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

+llvm-dev

Discussion here: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

Sri and I have been working on this over the past few months, and we've made
some good progress that we'd like to share and get feedback on.

Our work is based on the 'experimental-relr' prototype from Cary that is
available at 'users/ccoutant/experimental-relr' branch in the binutils
repository [1], and was described earlier in this thread:
https://sourceware.org/ml/gnu-gabi/2017-q2/msg00003.html

We've taken the '.relr.dyn' section from Cary's prototype, and implemented a
custom encoding to compactly represent the list of offsets. We're calling the
new compressed section '.relrz.dyn' (for relocations-relative-compressed).

The encoding used is a simple combination of delta-encoding and a bitmap of
offsets. The section consists of 64-bit entries: higher 8-bits contain delta
since last offset, and lower 56-bits contain a bitmap for which words to apply
the relocation to. This is best described by showing the code for decoding the
section:

typedef struct
{
  Elf64_Xword r_data; /* jump and bitmap for relative relocations */
} Elf64_Relrz;

#define ELF64_R_JUMP(val) ((val) >> 56)
#define ELF64_R_BITS(val) ((val) & 0xffffffffffffff)

#ifdef DO_RELRZ
  {
    ElfW(Addr) offset = 0;
    for (; relative < end; ++relative)
      {
        ElfW(Addr) jump = ELFW(R_JUMP) (relative->r_data);
        ElfW(Addr) bits = ELFW(R_BITS) (relative->r_data);
        offset += jump * sizeof(ElfW(Addr));
        if (jump == 0)
          {
            ++relative;
            offset = relative->r_data;
          }
        ElfW(Addr) r_offset = offset;
        for (; bits != 0; bits >>= 1)
          {
            if ((bits&1) != 0)
              elf_machine_relrz_relative (l_addr, (void *) (l_addr + r_offset));
            r_offset += sizeof(ElfW(Addr));
          }
      }
  }
#endif

Note that the 8-bit 'jump' encodes the number of _words_ since last offset. The
case where jump would not fit in 8-bits is handled by setting jump to 0, and
emitting the full offset for the next relocation in the subsequent entry.

The above code is the entirety of the implementation for decoding and
processing '.relrz.dyn' sections in glibc dynamic loader.

This encoding can represent up to 56 relocation offsets in a single 64-bit
word. For many of the binaries we tested, this encoding provides >40x
compression for storing offsets over the original `.relr.dyn` section.

For 32-bit targets, we use 32-bit entries: 8-bits for 'jump' and 24-bits for
the bitmap.

Here are three real world examples that demonstrate the savings:

1) Chrome browser (x86_64, built as PIE):
   File size (stripped): 152265064 bytes (145.21MB)
   605159 relocation entries (24 bytes each) in '.rela.dyn'
   594542 are R_X86_64_RELATIVE relocations (98.25%)
       14269008 bytes (13.61MB) in use in '.rela.dyn' section
         109256 bytes (0.10MB) if moved to '.relrz.dyn' section

   Savings: 14159752 bytes, or 9.29% of original file size.

2) Go net/http test binary (x86_64, 'go test -buildmode=pie -c net/http')
   File size (stripped): 10238168 bytes (9.76MB)
   83810 relocation entries (24 bytes each) in '.rela.dyn'
   83804 are R_X86_64_RELATIVE relocations (99.99%)
       2011296 bytes (1.92MB) in use in .rela.dyn section
         43744 bytes (0.04MB) if moved to .relrz.dyn section

   Savings: 1967552 bytes, or 19.21% of original file size.

3) Vim binary in /usr/bin on my workstation (Ubuntu, x86_64)
   File size (stripped): 3030032 bytes (2.89MB)
   6680 relocation entries (24 bytes each) in '.rela.dyn'
   6272 are R_X86_64_RELATIVE relocations (93.89%)
       150528 bytes (0.14MB) in use in .rela.dyn section
         1992 bytes (0.00MB) if moved to .relrz.dyn section

   Savings: 148536 bytes, or 4.90% of original file size.

Recent releases of Debian, Ubuntu, and several other distributions build
executables as PIE by default. Suprateeka posted some statistics earlier in
this thread on the prevalence of relative relocations in executables residing
in /usr/bin: Re: Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

The third example above shows that using '.relrz.dyn' sections to encode
relative relocations can bring decent savings to executable sizes in /usr/bin
across many distributions.

We have working ld.gold and ld.so implementations for arm, aarch64, and x86_64,
and would be happy to send patches to the binutils and glibc communities for
review.

However, before that can happen, we need agreement on the ABI side for the new
section type and the encoding. We haven't worked on a change of this magnitude
before that touches so many different pieces from the linker, elf tools, and
the dynamic loader. Specifically, we need agreement and/or guidance on where
and how should the new section type and its encoding be documented. We're
proposing adding new defines for SHT_RELRZ, DT_RELRZ, DT_RELRZSZ, DT_RELRZENT,
and DT_RELRZCOUNT that all the different parts of the toolchains can agree on.

Thanks,
Rahul

[1]: sourceware.org Git - binutils-gdb.git/shortlog

Sounds like good work.

The place to hold a discussion on ELF ABI issues is
generic-abi@googlegroups.com and gnu-gabi@sourceware.org. Inasmuch as
there is an official ELF ABI any more now that SCO has gone under, it
is maintained on the generic-abi list.

Ian

We've taken the '.relr.dyn' section from Cary's prototype, and implemented a
custom encoding to compactly represent the list of offsets. We're calling the
new compressed section '.relrz.dyn' (for relocations-relative-compressed).

I'd suggest just using .relr.dyn -- your encoding is straightforward
enough that I'd just make that the standard representation for this
section type.

The encoding used is a simple combination of delta-encoding and a bitmap of
offsets. The section consists of 64-bit entries: higher 8-bits contain delta
since last offset, and lower 56-bits contain a bitmap for which words to apply
the relocation to. This is best described by showing the code for decoding the
section:

...

The above code is the entirety of the implementation for decoding and
processing '.relrz.dyn' sections in glibc dynamic loader.

This encoding can represent up to 56 relocation offsets in a single 64-bit
word. For many of the binaries we tested, this encoding provides >40x
compression for storing offsets over the original `.relr.dyn` section.

For 32-bit targets, we use 32-bit entries: 8-bits for 'jump' and 24-bits for
the bitmap.

Very nice! Simple and effective.

Here are three real world examples that demonstrate the savings:

Impressive numbers. I've gotta admit, the savings are better than I expected.

However, before that can happen, we need agreement on the ABI side for the new
section type and the encoding. We haven't worked on a change of this magnitude
before that touches so many different pieces from the linker, elf tools, and
the dynamic loader. Specifically, we need agreement and/or guidance on where
and how should the new section type and its encoding be documented. We're
proposing adding new defines for SHT_RELRZ, DT_RELRZ, DT_RELRZSZ, DT_RELRZENT,
and DT_RELRZCOUNT that all the different parts of the toolchains can agree on.

Yes, as Ian mentioned, the generic ABI discussion is at
generic-abi@googlegroups.com. Most people who would be interested are
already on the gnu-gabi@sourceware.org list, but there are a few who
are not, and who may not yet have seen this discussion. I'll support
the proposal.

Thanks for taking this idea the extra mile!

-cary

* Rahul Chaudhry via gnu-gabi:

The encoding used is a simple combination of delta-encoding and a
bitmap of offsets. The section consists of 64-bit entries: higher
8-bits contain delta since last offset, and lower 56-bits contain a
bitmap for which words to apply the relocation to. This is best
described by showing the code for decoding the section:

typedef struct
{
  Elf64_Xword r_data; /* jump and bitmap for relative relocations */
} Elf64_Relrz;

#define ELF64_R_JUMP(val) ((val) >> 56)
#define ELF64_R_BITS(val) ((val) & 0xffffffffffffff)

#ifdef DO_RELRZ
  {
    ElfW(Addr) offset = 0;
    for (; relative < end; ++relative)
      {
        ElfW(Addr) jump = ELFW(R_JUMP) (relative->r_data);
        ElfW(Addr) bits = ELFW(R_BITS) (relative->r_data);
        offset += jump * sizeof(ElfW(Addr));
        if (jump == 0)
          {
            ++relative;
            offset = relative->r_data;
          }
        ElfW(Addr) r_offset = offset;
        for (; bits != 0; bits >>= 1)
          {
            if ((bits&1) != 0)
              elf_machine_relrz_relative (l_addr, (void *) (l_addr + r_offset));
            r_offset += sizeof(ElfW(Addr));
          }
      }
  }
#endif

That data-dependent “if ((bits&1) != 0)” branch looks a bit nasty.

Have you investigated whether some sort of RLE-style encoding would be
beneficial? If there are blocks of relative relocations, it might even
be possible to use vector instructions to process them (although more
than four relocations at a time are probably not achievable in a
power-efficient manner on current x86-64).

Yes, we originally investigated RLE style encoding but I guess the key
insight which led us towards the proposed encoding is the following.
The offset addresses which contain the relocations are close but not
necessarily contiguous. We experimented with an encoding strategy
where we would store the initial offset and the number of words after
that which contained dynamic relocations. This gave us good
compression numbers but the proposed scheme was way better. I will
let Rahul say more as he did quite a bit of experiments with different
strategies.

Thanks
Sri

A simple combination of delta-encoding and run_length-encoding is one of the
first schemes we experimented with (32-bit entries with 24-bit 'delta' and an
8-bit 'count'). This gave really good results, but as Sri mentions, we observed
several cases where the relative relocations were not on consecutive offsets.
There were common cases where the relocations applied to alternate words, and
that totally wrecked the scheme (a bunch of entries with delta==16 and
count==1).

I dug up some numbers on how that scheme compared with the current proposal on
the three examples I posted before:

delta+run_length encoding is using 32-bit entries (24-bit delta, 8-bit count).
delta+bitmap encoding is using 64-bit entries (8-bit delta, 56-bit bitmap).

1. Chrome browser (x86_64, built as PIE):
   605159 relocation entries (24 bytes each) in '.rela.dyn'
   594542 are R_X86_64_RELATIVE relocations (98.25%)
       14269008 bytes (13.61MB) in use in '.rela.dyn' section
         385420 bytes (0.37MB) using delta+run_length encoding
         109256 bytes (0.10MB) using delta+bitmap encoding

2. Go net/http test binary (x86_64, 'go test -buildmode=pie -c net/http')
   83810 relocation entries (24 bytes each) in '.rela.dyn'
   83804 are R_X86_64_RELATIVE relocations (99.99%)
       2011296 bytes (1.92MB) in use in .rela.dyn section
        204476 bytes (0.20MB) using delta+run_length encoding
         43744 bytes (0.04MB) using delta+bitmap encoding

3. Vim binary in /usr/bin on my workstation (Ubuntu, x86_64)
   6680 relocation entries (24 bytes each) in '.rela.dyn'
   6272 are R_X86_64_RELATIVE relocations (93.89%)
       150528 bytes (0.14MB) in use in .rela.dyn section
        14388 bytes (0.01MB) using delta+run_length encoding
         1992 bytes (0.00MB) using delta+bitmap encoding

Rahul

Thanks for your encouraging words, Ian and Cary.

We're drafting a more detailed proposal and would post it on the generic-abi
list this week. I'll also post a link here for cross-reference.

Based on Cary's suggestion here, we're renaming '.relrz.dyn' to
'.relr.dyn' in the
proposal.

Rahul

For the same issue in a different context, I recently implemented a scheme using run-length-encoding but using a variable stride. So for a run of alternate words, you still get a single entry, but with stride 16 instead of 8. In my application, most cases of strides > 8 are a run of only 2 or 3 but there are a few cases of dozens or hundreds with a stride of 16. My case is a solution tailored to exactly one application (a kernel), so there is a closed sample set that’s all that matters and the trade-off between simplicity of the analysis and compactness of the results is different than the general case you’re addressing (my “analysis” consists of a few lines of AWK). But I wonder if it might be worthwhile to study the effect a variable-stride RLE scheme or adding the variable-stride ability into your hybrid scheme has on your sample applications.

Since we’re talking about specifying a new ABI that will be serving us for many years to come and will be hard to change once deployed, it seems worth spending quite a bit of effort up front to come to the most compact scheme that’s feasible.

A simple combination of delta-encoding and run_length-encoding is one of the
first schemes we experimented with (32-bit entries with 24-bit 'delta' and an
8-bit 'count'). This gave really good results, but as Sri mentions, we observed
several cases where the relative relocations were not on consecutive offsets.
There were common cases where the relocations applied to alternate words, and
that totally wrecked the scheme (a bunch of entries with delta==16 and
count==1).

For the same issue in a different context, I recently implemented a scheme using run-length-encoding but using a variable stride. So for a run of alternate words, you still get a single entry, but with stride 16 instead of 8. In my application, most cases of strides > 8 are a run of only 2 or 3 but there are a few cases of dozens or hundreds with a stride of 16. My case is a solution tailored to exactly one application (a kernel), so there is a closed sample set that's all that matters and the trade-off between simplicity of the analysis and compactness of the results is different than the general case you're addressing (my "analysis" consists of a few lines of AWK). But I wonder if it might be worthwhile to study the effect a variable-stride RLE scheme or adding the variable-stride ability into your hybrid scheme has on your sample applications.

Since we're talking about specifying a new ABI that will be serving us for many years to come and will be hard to change once deployed, it seems worth spending quite a bit of effort up front to come to the most compact scheme that's feasible.

I agree. Can you share more details of the encoding scheme that you found
useful (size of each entry, number of bits used for stride/count etc.)?

I just ran some experiments with an encoding with 32-bit entries: 16-bits for
delta, 8-bits for stride, and 8-bits for count. Here are the numbers, inlined
with those from the previous schemes for comparison:

1. Chrome browser (x86_64, built as PIE):
   605159 relocation entries (24 bytes each) in '.rela.dyn'
   594542 are R_X86_64_RELATIVE relocations (98.25%)
       14269008 bytes (13.61MB) in use in '.rela.dyn' section
         385420 bytes (0.37MB) using delta+count encoding
         232540 bytes (0.22MB) using delta+stride+count encoding
         109256 bytes (0.10MB) using jump+bitmap encoding

2. Go net/http test binary (x86_64, 'go test -buildmode=pie -c net/http')
   83810 relocation entries (24 bytes each) in '.rela.dyn'
   83804 are R_X86_64_RELATIVE relocations (99.99%)
       2011296 bytes (1.92MB) in use in .rela.dyn section
        204476 bytes (0.20MB) using delta+count encoding
        132568 bytes (0.13MB) using delta+stride+count encoding
         43744 bytes (0.04MB) using jump+bitmap encoding

3. Vim binary in /usr/bin on my workstation (Ubuntu, x86_64)
   6680 relocation entries (24 bytes each) in '.rela.dyn'
   6272 are R_X86_64_RELATIVE relocations (93.89%)
       150528 bytes (0.14MB) in use in .rela.dyn section
        14388 bytes (0.01MB) using delta+count encoding
         7000 bytes (0.01MB) using delta+stride+count encoding
         1992 bytes (0.00MB) using jump+bitmap encoding

delta+count encoding is using 32-bit entries:
  24-bit delta: number of bytes since last offset.
   8-bit count: number of relocations to apply (consecutive words).

delta+stride+count encoding is using 32-bit entries:
  16-bit delta: number of bytes since last offset.
   8-bit stride: stride (in bytes) for applying 'count' relocations.
   8-bit count: number of relocations to apply (using 'stride').

jump+bitmap encoding is using 64-bit entries:
   8-bit jump: number of words since last offset.
  56-bit bitmap: bitmap for which words to apply relocations to.

While adding a 'stride' field is definitely an improvement over simple
delta+count encoding, it doesn't compare well against the bitmap based
encoding.

I took a look inside the encoding for the Vim binary. There are some instances
in the bitmap based encoding like
  [0x3855555555555555 0x3855555555555555 0x3855555555555555 ...]
that encode sequences of relocations applying to alternate words. The stride
based encoding works very well on these and turns it into much more compact
  [0x0ff010ff 0x0ff010ff 0x0ff010ff ...]
using stride==0x10 and count==0xff.

However, for the vast majority of cases, the stride based encoding ends up with
count <= 2, and that kills it in the end.

I could try something more complex with 16-bit entries, but that can only give
2x improvement at best, so it still won't be better than the bitmap approach.

Thanks,
Rahul

While adding a 'stride' field is definitely an improvement over simple
delta+count encoding, it doesn't compare well against the bitmap based
encoding.

I took a look inside the encoding for the Vim binary. There are some instances
in the bitmap based encoding like
  [0x3855555555555555 0x3855555555555555 0x3855555555555555 ...]
that encode sequences of relocations applying to alternate words. The stride
based encoding works very well on these and turns it into much more compact
  [0x0ff010ff 0x0ff010ff 0x0ff010ff ...]
using stride==0x10 and count==0xff.

Have you looked much at where the RELATIVE relocations are coming from?

I've looked at a PIE build of gold, and they're almost all for
vtables, which mostly have consecutive entries with 8-byte strides.
There are a few for the GOT, a few for static constructors (in
.init_array), and a few for other initialized data, but vtables seem
to account for the vast majority. (Gold has almost 19,000 RELATIVE
dynamic relocs, and only about 500 non-RELATIVE dynamic relocs.)

Where do the 16-byte strides come from? Vim is plain C, right? I'm
guessing its RELATIVE relocation count is fairly low compared to big
C++ apps. I'm also guessing that the pattern comes from some large
structure or structures in the source code where initialized pointers
alternate with non-pointer values. I'm also curious about Roland's
app.

In my opinion, the data and my intuition both support your choice of a
jump + bit vector representation.

-cary

I took a look inside vim for the source of the ..5555.. pattern (relative
relocations applying to alternate words). One of the sources is the
"builtin_termcaps" symbol, which is an array of "struct builtin_term":

  struct builtin_term
  {
    int bt_entry;
    char *bt_string;
  };

So the pattern makes sense. An encoding using strides will work really well
here with stride == 0x10.

There is another repeating pattern I noticed in vim ..9999... One of the
sources behind this pattern is the "cmdnames" symbol, which is an array of
"struct cmdname":

  struct cmdname
  {
    char_u *cmd_name; /* name of the command */
    ex_func_T cmd_func; /* function for this command */
    long_u cmd_argt; /* flags declared above */
    int cmd_addr_type; /* flag for address type */
  };

In this struct, the first two fields are pointers, and the next two are
scalars. This explains the ..9999.. pattern for relative relocations. This is
an example where a stride based encoding does not work well, simply because
there is no single stride. The deltas are 8,24,8,24,8,24,...

I think these two examples demonstrate the main weakness of using a simple
stride based encoding: it is too sensitive to how the data structures are laid
out in the program source.

Rahul

We sent a proposal to generic-abi last week.
Here's the link for cross-reference, as promised:
https://groups.google.com/forum/#!topic/generic-abi/bX460iggiKg

Rahul

The generic-abi thread has gone into broader subjects of the benefits and desireability of the work. I’m willing to take it as given that the encoded size of pure-relative address relocs (i.e. R_*_RELATIVE equivalents)–ultimately the RODATA segment size of a given ET_DYN file–as sole metric is a worthy goal and the ballpark savings ratios we’re seeing are worth committing to a new ABI. But I am circumspect about choosing an encoding we will be supporting for decades to come. Whatever we do now will surely be good enough that nobody will want to innovate again for many years just to get to a little or a fair bit better. It behooves us to be deliberate in getting it as good as we reasonably can get it now for the broad range of ET_DYN files that will appear in years to come.

I tend to share the intuitions people have expressed about what kinds of patterns of offsets are likely. I also have a contrary intuition that there are large codebases with lots of formulaic or generated code and data tables that may well have truly enormous numbers of such relocs that fit highly regular patterns just slightly different from the ones we’re considering most likely.

Moreover I don’t think there is any excuse for relying on our intuition when there are vast quantities of actual data pretty readily available. I don’t mean picking a few “important” real-world binaries and using their real data. Examples like Chrome and Firefox have already been tuned by sophisticated developers to minimize relocation overhead and may well not be representative of other programs of similar size and complexity. I mean collecting data from a large and varied corpus of ET_DYN files across machines, operating systems, and codebases.

A pretty simple analysis tool can extract from any ET_DYN file the essential statistics (byte sizes of segments and relocs) and the list of R_*_RElATIVE reloc r_offset values. (I started writing one in Python and I can finish it if people want to use it.) It’s easy enough to feed that with lots of ET_DYN files available in public collections such as Linux distros. The tool is simple to run and the data extracted not really revealing (beyond rough binary size), so it can be applied to lots of proprietary sets of ET_DYN files and the data contributed to the public corpus, from places like Google’s internal binaries, Android system binaries, ET_DYN files in APKs in app stores, etc. across many companies.

Given this corpus of “reloc traces” you can code up many competing encoding formats and do serious measurements of their space savings across the entire corpus from simple simulations without having to implement each encoding in an actual toolchain and dynamic linker to do the analysis. This is some work, but I don’t think it’s a huge undertaking in comparison to the actual full deployment roll-out of a new ELF dynamic linking ABI feature and its impact, which always wind up being much more than just the actual new code in toolchain and runtime implementations. I think the work all over that will ripple out from the deployment, and the many-year commitment to the new format that will inevitably be incurred, merit a rigorous and public data-driven approach.

* Roland McGrath:

Given this corpus of "reloc traces" you can code up many competing encoding
formats and do serious measurements of their space savings across the
entire corpus from simple simulations without having to implement each
encoding in an actual toolchain and dynamic linker to do the analysis.

On the other hand, the linker currently assumes that the order in
which relative relocations are emitted does not matter. With a
different encoding scheme, future linkers might reorder the
relocations or data objects themselves, either to reduce relocation
encoding size, or to make relocations more efficient (perhaps to
support vectorization). Unfortunately, the numbers which can be
derived from ET_DYN files will not reflect to what extent such
reordering is possible.

Currently the relocation entries are sorted by address (-z combreloc), but that’s irrelevant since I extract all the individual addresses for the traces to feed encoding analyses and the new encodings we’re talking about entail exactly things like how you order the addresses to be relocated. Link time reordering of data itself to make the data’s relocation needs more patterned is conceivable I suppose. In practice today it has some locality by dint of .data.rel.ro and such section names, but certainly nothing particularly to encourage patterns of relocation offsets beyond e.g. simple contiguity of vtables.