[RFC] CREL: A compact relocation format for ELF

MaskRay · March 11, 2024, 8:43pm

The relocation formats REL and RELA for ELF are inefficient. In a release build of Clang for x86-64, .rela.* sections consume a significant portion (approximately 20.9%) of the file size. I have developed an alternative relocation format (named CREL, previous tentative name: RELLEB), which yields a significant reduction: x86-64: 18.0%, aarch64: 18.0%, riscv64: 34.3%!

(Results for RELLEB:x86-64: 17.6%, aarch64: 16.9%, riscv64: 32.4%)
(Results for an older version of RELLEB:x86-64: 17.2%, aarch64: 16.5%, riscv64: 32.4%)

Elf32_Rel and Elf32_Rela sacrifice flexibility to maintain a smaller size, limiting relocation types to a maximum of 255. CREL allows 2**32 relocation types, aligning with Elf64_Rel/Elf64_Rela.

I have analyzed many architectures including Arm (AArch32/AArch64), Power, RISC-V, MIPS, RISC-V, z/Architecture, and x86, written a detailed analysis of the size problem and my solution at A compact relocation format for ELF | MaskRay ,
created a generic ABI proposal https://groups.google.com/g/generic-abi/c/yb0rjw56ORw,
and developed a prototype at GitHub - MaskRay/llvm-project at demo-crel

Define SHT_CREL and DT_CREL
clang
- -mcrel: generate SHT_CREL instead of SHT_REL/SHT_RELA
- -flto -mcrel: pass -plugin-opt=-crel to generate SHT_CREL for relocatable file generation
- -Xclang --compress-relocations={none,zlib,zstd}: compress
  SHT_REL/SHT_RELA/SHT_CREL
ld.lld
- handle SHT_CREL input sections. Convert CREL to RELA for
  relocation scanning
- -z crel: use .crel.dyn instead of .rel.dyn/.rela.dyn for dynamic
  relocations. Can be used together with --pack-dyn-relocs=relr
- -r and --emit-relocs: copy SHT_CREL and rewrite (OutputSection::finalizeNonAllocCrel). supports mixing CREL/RELA
llvm-readelf
- -S: recognize SHT_CREL
- -d: recognize DT_CREL
- -r: dump SHT_CREL sections
llvm-objdump
- -r
- -dr
yaml2obj/obj2yaml
MCTargetOptions
- llvm-mc -crel
- llc -crel
- ld.lld -mllvm -crel for LTO

Example:

# apply https://github.com/MaskRay/llvm-project/tree/demo-crel
clang++ -fuse-ld-lld -Wa,--crel a.cc b.cc

I am looking forward to your thoughts:)

Here are a few use cases that will benefit from a compact relocation type:

Marker relocations (R_RISCV_RELAX, etc)
.llvm_addrsig
.llvm.call-graph-profile
.debug_names Smaller object file cost for .debug_names intended only for linking

MaskRay · March 23, 2024, 12:05am

The format was tentatively named RELLEB. As I refine the original pure LEB-based format, “RELLEB” might not be the most fitting name.

I have switched to SHT_CREL/DT_CREL/.crel and updated
https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf
and my LLVM prototype
https://github.com/MaskRay/llvm-project/tree/demo-crel.

The new format is simpler and better than RELLEB even in the absence of the shifted offset technique.

While alternatives like PrefixVarInt (or a suffix-based variant) might excel when encoding larger integers, LEB128 offers advantages when most integers fit within one or two bytes, as it avoids the need for shift operations in the common one-byte representation.

While we could utilize zigzag encoding (i>>31) ^ (i<<1) to convert SLEB128-encoded type/addend to use ULEB128 instead, the generate code is inferior to or on par with SLEB128 for one-byte encodings.

MaskRay · May 6, 2024, 10:49pm

I’ve heard interests from a few users caring about relocatable file sizes and linker memory usage.

Some psABI groups (including RISC-V and s390x) are receptive to including SHT_CREL in the generic range.
They might be open to adopting the feature once the toolchain support is finalized and user demand picks up.

I’ve sent a subset of https://github.com/MaskRay/llvm-project/tree/demo-crel as [MC,llvm-readobj,yaml2obj] Support CREL relocation format by MaskRay · Pull Request #91280 · llvm/llvm-project · GitHub for review. lld/llvm-objdump/obj2yaml changed will follow.

Specification:

In https://www.sco.com/developers/gabi/latest/ch4.sheader.html, make the following changes.

In Figure 4-9: Section Types,sh_type, append a row

SHT_CREL | 20

Add text:

SHT_CREL - The section holds compact relocation entries with explicit addends. An object file may have multiple relocation sections. See ‘‘Relocation’’ below for details.

In Figure 4-16: Special Sections, append

.crelname | SHT_CREL | see below

Change the text below:

.relname, .relaname, and .crelname

These sections hold relocation information, as described in ‘‘Relocation’’. If the file has a loadable segment that includes relocation, the sections’ attributes will include the SHF_ALLOC bit; otherwise, that bit will be off. Conventionally, name is supplied by the section to which the relocations apply. Thus a relocation section for .text normally would have the name .rel.text, .rela.text, or .crel.text.

In Figure 4-23: Relocation Entries, add:

typedef struct {
  Elf32_Addr r_offset;
  Elf32_Word r_symidx;
  Elf32_Word r_type;
  Elf32_Sword r_addend;
} Elf32_Crel;

typedef struct {
  Elf64_Addr r_offset;
  Elf64_Word r_symidx;
  Elf64_Word r_type;
  Elf64_Sxword r_addend;
} Elf64_Crel;

Add text above “A relocation section references two other sections”:

A SHT_CREL section holds compact relocation entries that decode to Elf32_Crel or Elf64_Crel depending on the object file class (32-bit or 64-bit).
Its content begins with a ULEB128-encoded value count * 8 + addend_bit * 4 + shift (35-bit or 67-bit unsigned), where:

count: Relocation count (32-bit or 64-bit unsigned).
addend_bit: 1 indicates that relocation entries encode addends. 0 indicates implicit addends (stored in the location to be modified).
shift: The shift value (0 to 3) applies to delta_offset in relocation entries.

Relocation entries (which encode r_offset, r_symidx, r_type, and r_addend) follow the header.
Note: r_info in traditional REL/RELA formats has been split into r_symidx and r_type, allowing uint32_t relocation types for ELFCLASS32 as well.

Delta offset and flags (ULEB128): Holds delta_offset * (addend_bit ? 8 : 4) + flags (35-bit or 67-bit unsigned), where:
- delta_offset: Difference in r_offset from the previous entry (truncated to Elf32_Addr or Elf64_Addr), right shifted by shift.
- flags: 0 to 7 if addend_bit is 1; otherwise 0 to 3.
- flags & 1: Indicate if delta symbol index is present.
- flags & 2: Indicate if delta type is present.
- flags & 4: Indicate if delta addend is present.
Delta symbol index (SLEB128, if present): The difference in symbol index from the previous entry, truncated to a 32-bit signed integer.
Delta type (SLEB128, if present): The difference in relocation type from the previous entry, truncated to a 32-bit signed integer.
Delta addend (SLEB128, if present): The difference in addend from the previous entry, truncated to a 32-bit or 64-bit signed integer depending on the object file class.

ULEB128 or SLEB128 encoded values use the canonical representation (i.e., the shortest byte sequence).
For the first relocation entry, the previous offset, symbol index, type, and addend members are treated as zero.

Encoding/decoding delta offset and flags does not need multi-precision arithmetic. We can just unroll and special case the first iteration.
The header can be encoded/decoded in a similar way. An implementation can assume that the relocation count cannot be larger than 2**61 and simplify the code.

Example C++ encoder:

// encodeULEB128(uint64_t, raw_ostream &os);
// encodeSLEB128(int64_t, raw_ostream &os);

const uint8_t addendBit = config->isRela ? 4 : 0, flagBits = config->isRela ? 3 : 2;
Elf_Addr offsetMask = 8, offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (const Elf_Crel &rel : relocs)
  offsetMask |= rel.r_offset;
int shift = std::countr_zero(offsetMask)
encodeULEB128(relocs.size() * 8 + addendBit + shift, os);
for (const Elf_Crel &rel : relocs) {
  Elf_Addr deltaOffset = (rel.r_offset - offset) >> shift;
  uint8_t b = (deltaOffset << flagBits) + (symidx != rel.r_symidx) +
              (type != rel.r_type ? 2 : 0) + (addend != rel.r_addend ? 4 : 0);
  if (deltaOffset < (0x80 >> flagBits)) {
    os << char(b);
  } else {
    os << char(b | 0x80);
    encodeULEB128(deltaOffset >> (7 - flagBits), os);
  }
  if (b & 1) {
    encodeSLEB128(static_cast<int32_t>(rel.r_symidx - symidx), os);
    symidx = rel.r_symidx;
  }
  if (b & 2) {
    encodeSLEB128(static_cast<int32_t>(rel.r_type - type), os);
    type = rel.r_type;
  }
  if (b & 4 & addendBit) {
    encodeSLEB128(std::make_signed_t<Elf_Addr>(rel.r_addend - addend), os);
    addend = rel.r_addend;
  }
}

Example C++ decoder:

// uint64_t decodeULEB128(uint8_t *&p);
// int64_t decodeSLEB128(uint8_t *&p);

const auto hdr = decodeULEB128(p);
const size_t count = hdr / 8, flagBits = hdr & 4 ? 3 : 2, shift = hdr % 4;
Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (size_t i = 0; i != count; ++i) {
  const uint8_t b = *p++;
  offset += b >> flagBits;
  if (b >= 0x80)
    offset += (decodeULEB128(p) << (7 - flagBits)) - (0x80 >> flagBits);
  if (b & 1)
    symidx += decodeSLEB128(p);
  if (b & 2)
    type += decodeSLEB128(p);
  if (b & 4 & hdr)
    addend += decodeSLEB128(p);
  rels[i] = {offset << shift, symidx, type, addend};
}

Both encoder and decoder can be simplified if the desired addend_bit is hardcoded, making flagBits an integer literal.

In Figure 5-10: Dynamic Array Tags, d_tag, add:

DT_CREL | 38 | d_ptr | optional | optional

Add text below:

DT_CREL - This element is similar to DT_REL, except its table uses the CREL format. The relocation count can be inferred from the header.

Update DT_PLTREL and DT_PLTRELSZ:

DT_PLTRELSZ: This element holds the total size, in bytes, of the relocation entries associated with the procedure linkage table. If an entry of type DT_JMPREL is present and the DT_PLTREL entry value is DT_REL or DT_RELA, a DT_PLTRELSZ must accompany it.
DT_PLTREL: This member specifies the type of relocation entry to which the procedure linkage table refers. The d_val member holds DT_REL, DT_RELA, or DT_CREL, as appropriate. All relocations in a procedure linkage table must use the same relocation type.

debadri · May 8, 2024, 5:05am

Speaking on behalf of a team working on limiting the growth of monolithic binaries to keep different build metric limits below the thresholds imposed by the build infrastructure: Currently our team deals with an extraordinarily large binary whose size has become a growing concern. The amount of input objects passed to the linker is very close to the upper limit supported by the in-house build system. This patch has been tested with that binary and it will cause nearly 20% reduction in the total object size, which will effectively eliminate this problem. The wins are more pronounced for builds including sanitizer instrumentation (~30% in some cases).

Therefore, submission of this patch will be beneficial and critical for many such large binaries.

MaskRay · May 9, 2024, 7:25am

GitHub labels will keep interested parties (https://github.com/orgs/llvm/teams/?query=pr-subscribers) informed about the patches (#91280 has many labels). Additionally, directly tag people to signal boost the message.

@ayermolo @bd1976bris @compnerd @dblaikie @emaste @jh7370 @jyknight @rnk @ruiu @petrhosek @smithp35 @smeenai

ayermolo · May 9, 2024, 9:58pm

Sounds interesting.
What is the impact on link time of this?

MaskRay · May 9, 2024, 10:33pm

I haven’t measured a link time difference. The reduction in peak memory usage seems significant (high single digit percentage).

With my previous work improving relocation scanning (⚙ D133003 [ELF] Parallelize relocation scanning) and section writing for lld/ELF, relocation processing consumes a less significant portion of the link time.

Release+Assert clang

% /tmp/out/custom2/bin/ld.lld --threads=8 @response.txt --time-trace && jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"' clang-14.time-trace G 'Exe|relocation'
0.550256 Total ExecuteLinker
0.068888 Total Scan relocations

Debug clang

% /tmp/out/custom2/bin/ld.lld --threads=8 @response.txt --time-trace && jq -r '.traceEvents[] | select(.name|contains("Total")) | "\(.dur/1000000) \(.name)"' clang-14.time-trace G 'Exe|relocation'
2.575125 Total ExecuteLinker
0.076005 Total Scan relocations

MaskRay · July 9, 2024, 6:53am

The lld/ELF patch [ELF] Support relocatable files using CREL by MaskRay · Pull Request #98115 · llvm/llvm-project · GitHub should be the last major piece.

With that clang++ -fuse-ld-lld -Wa,--crel,--allow-experimental-crel a.cc b.cc will work.

igorkudrin · July 12, 2024, 3:37am

It would be great if the specification left some room for further enhancements. The current way of defining flags seems too rigid and does not allow for new flags if we need them later. For example, you mentioned that on s390/s390x offsets can have negative deltas, so ULEB128 cannot represent the sequence. If someone wants to adapt CREL for such targets in the future, they might want to add a flag to switch the encoding for offsets to SLEB128, or zigzag, or something more appropriate but exotic.

What about moving all the flags to a separate byte? It could be the first byte of the sections, and if an unknown bit is set, the parser would reject the section. WDYT?

MaskRay · July 12, 2024, 6:01am

In s390/s390x, the general dynamic/local dynamic TLS code sequences might have local decreasing r_offset members. This is due to legacy reason. This could be fixed it they migrate to TLSDESC

0000000000000016  0000000100000013 R_390_PC32DBL          0000000000000000 .data.rel.ro + 2
000000000000001c  0000000400000014 R_390_PLT32DBL         0000000000000000 __tls_get_offset + 2
000000000000001a  0000000300000026 R_390_TLS_GDCALL       0000000000000008 a + 0

ULEB128 can represent decreasing r_offset. We simply encode uint64_t(-2) as ULEB128. While this costs 10 bytes, R_390_TLS_GDCALL is uncommon, so the waste is negligible. Using ULEB128 instead of SLEB128 for r_offset members has noticeable saving. I have reviewed many psABIs and believe there is no argument for newer architectures to introduce decreasing r_offset.

What about moving all the flags to a separate byte? It could be the first byte of the sections, and if an unknown bit is set, the parser would reject the section. WDYT?

I try to balance savings and complexity. Flags don’t seem to justify their complexity

igorkudrin · July 12, 2024, 6:38am

Well, my point was not about the specific case, but more about the possibility of extensions in the future which we cannot predict right now. I believe that this is important when you propose a global standard.

MaskRay · July 12, 2024, 8:10pm

I explored extension needs. Since we will remain the basic RELA form, to keep the complexity controlled we could only pick LEB128 or alternatives (example).
My survey also includes different object file formats still used today and various architectures for ELF.

I concluded that an extra flag in the header is unnecessary. It would not justify the complexity and potential value.

For dynamic relocations, if ELF ever wants to adopt something like Solaris direct binding and Mach-O two-level namespace,
we would require additional relocation information beyond offset/type/symidx/addend.
Complicating the current CREL format for this uncertain possibility wouldn’t be efficient.
The uncertain dynamic relocation needs would justify a new format, which can be made similar to CREL.

dblaikie · July 15, 2024, 4:22pm

Would it not be better if relocations were always in the same section, with a version/enum that identified what format they were in within that section? (rather than introducing a new section next time someone comes up with a new scheme) - if I understand correctly, one of the issues with introducing a new section is that linkers that don’t know about it won’t know they’re missing relocation information/might silently continue without applying the relocations, yes? Whereas if the section is the same, and the version is unknown, it’d create a clearer failure mode for linkers without support for a given format?

Though that would be a diversion from past precedent where new formats go in new sections and those past sections aren’t versioned in any way, I take it?

MaskRay · July 15, 2024, 10:09pm

I have some write-up here: https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#linker-notes

GNU ld allows certain unknown section types:

[SHT_LOUSER,SHT_HIUSER] and non-SHF_ALLOC
[SHT_LOOS,SHT_HIOS] and non-SHF_OS_NONCONFORMING

but reports errors and stops linking for others (unless --no-warn-mismatch is specified). When linking a relocatable file using SHT_CREL, you might encounter errors like the following: […]

lld and mold have been patched. They will not report errors for unknown section types as well.

Therefore, we have the desired error model. Adding a version number instead of adding a new section type might provide slight better diagnostics once a tool gets initial CREL support, but I believe this is not significant value to justify diverging from the ELF convention.

Topic		Replies	Views
Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section LLVM Dev List Archives	15	321	January 7, 2018
[RFC] Implicit addends for non-code sections LLVM Project debuginfo , clang , llvm	3	185	March 29, 2024
[LLD] Linker Relaxation LLVM Dev List Archives	23	379	July 17, 2017
Generate ELF in x86 with RELA relocations LLVM Dev List Archives	1	127	February 6, 2014
Increasing address pool reuse/reducing .o file size in DWARFv5 LLVM Dev List Archives	14	159	February 11, 2021

[RFC] CREL: A compact relocation format for ELF

Related topics