RFC: Enhancing function alignment attributes

FYI @efriedma-quic @MaskRay

This RFC describes the proposed changes to ELF and the IR to support a notion of preferred alignment for GlobalObjects which is separate from the minimum alignment.

The initial use case is a recently proposed enhancement for CFI jump tables known as jump table relaxation. CFI jump table relaxation reduces the overhead of a CFI protected indirect call by inlining the function body into the jump table itself, as long as it is small enough. In order to achieve this, we must know the function’s minimum alignment as well as its preferred alignment. The purpose of using a preferred alignment larger than the minimum alignment is generally to enhance performance, but in the case of jump table relaxation we can expect it to be better for performance to inline the jump table entry (8 bytes on x86) than to obey the function’s preferred alignment (16 bytes on x86). Additionally, we must ensure that relaxing the jump table entry will not cause misbehavior at runtime, so we must obey the function’s minimum alignment and refrain from inlining the function body if it requires too much alignment.

IR

There is currently a single align field on GlobalObject. Prior to #149444 we have the following logic:

  1. If the align field is unset, a function’s alignment is the backend-determined minimum alignment if -Os or -Oz, otherwise the backend-determined preferred alignment.
  2. If the align field is set, the function’s alignment is the max of the align attribute and the alignment computed by (1).

The alignment via -falign-functions was ignored if it was less than the preferred alignment. This behavior was observed to be inconsistent with GCC. With #149444, step 2 was changed to “If the align field is set, the function’s alignment is the max of the align attribute and the minimum alignment”, bringing the behavior in line with GCC.

A new minalign field will be added to GlobalObject. This shall have type Align (instead of MaybeAlign) and will default to 1. For brevity (and to avoid churn), minalign will only be printed if it is not equal to 1.

As part of splitting up the attributes, the following is proposed as the logic for deciding a function’s alignment:

  1. A function’s minimum alignment is the max of the minalign attribute and backend-determined minimum alignment.
  2. A function’s preferred alignment is the align attribute if set, otherwise the backend-determined minimum alignment if -Os or -Oz, otherwise the backend-determined preferred alignment. Additionally, the preferred alignment shall be at least the minimum alignment.

The existing accessor names will be updated for clarity: GlobalObject::{getAlign,setAlignment} shall be renamed GlobalObject::{get,set}PreferredAlignment. The new accessors shall be named GlobalObject::{get,set}MinAlignment().

Clang will be updated to set minalign 2 on member functions instead of align 2. As a result, member functions will usually receive the preferred alignment, fixing the regression from #149444.

Object file (ELF)

To represent the preferred and minimum alignments in ELF, it is proposed to introduce a new SHT_LLVM_MIN_ADDRALIGN section, which is used to specify the minimum alignment of a section where that differs from its preferred alignment. Its sh_link field identifies the section whose alignment is
being specified, its sh_addralign field specifies the linked section’s minimum alignment and the sh_addralign field of the linked section’s section header specifies its preferred alignment. This section has the SHF_EXCLUDE flag so that it is stripped from the final executable or shared library, and the SHF_LINK_ORDER flag so that the sh_link field is updated by tools such as ld -r and objcopy. The contents of the section must be empty.

The new asm directive:

.prefalign n

specifies that the preferred alignment of the current section is determined by taking the maximum of n and the section’s minimum alignment, and causes an SHT_LLVM_MIN_ADDRALIGN section to be emitted if necessary.

The preferred alignment section is an opt-in feature. Because the initial anticipated use case (specifically CFI jump tables) requires LTO, it is expected that LTO clients (linkers) with support for the minimum alignment section will opt in via the API. For the same reason, there will not be a user facing (clang driver) flag for opting in for the time being. If preferred alignment is disabled (or unrepresentable, in the case of non-ELF object formats), the preferred alignment shall be stored as the only alignment in the object file, and CodeGen will emit .balign or .p2align instead of .prefalign.

The proposed ELF extension is backwards compatible with linkers that do not recognize the new section type. Linkers that do not support the section type will read the section’s sh_addralign field containing the preferred alignment and treat it as the minimum alignment, which will result in conservatively correct behavior, as the preferred alignment will always be at least as large as the minimum alignment.

The initial change to support the ELF extension is #150151. If this RFC is accepted, further changes will be developed to teach CodeGen to emit the new directive, and reimplement part of #147424 to read the new section.

The effect on -falign-functions

-falign-functions shall set both the preferred alignment and minimum alignment attributes, to maintain consistency with GCC.

To set the preferred function alignment on its own, a new flag is proposed, which shall be named -fpreferred-function-alignment.

For reference, I already did some cleanups in Remove GlobalObject::getAlign/setAlignment by efriedma-quic ¡ Pull Request #143188 ¡ llvm/llvm-project ¡ GitHub , so we can modify the Function alignment APIs without impacting other GlobalObjects.

Changing the existing “Alignment” to be the minimum alignment seems obvious: it matches the ways we naturally query alignment. And allowing the frontend to specify a preferred alignment on a per-function basis, which can be overridden by later optimizations if necessary, also seems like a good idea. For example, you can specify your preferred alignment, and let PGO override that preference for cold functions, or something along those lines.

The need for the ELF extension seems predicated on the assumption that we request 16-byte alignment for functions that have size 8 bytes or less. But that seems like something we could fix: there’s not much point to requesting 16-byte alignment for a function of size 8 bytes or less. The primary reason for aligning a function entry is to ensure the beginning of the function doesn’t cross icache boundaries, so we only really need 8-byte alignment for an 8-byte function. We could extend the assembler so it requests less alignment for such functions. (I don’t think there’s any way to write that right now in GNU-style assembly, but I can’t think of any obstacle to implementing an assembler directive.)

If we have that, I don’t think you need the ELF extension for CFI jump tables? And I’m not sure how helpful the ELF extension is outside of that.

I don’t think that changing the meaning of align to no longer mean “minimum alignment” as done in CodeGen: Respect function align attribute if less than preferred alignment. by pcc · Pull Request #149444 · llvm/llvm-project · GitHub is appropriate. The meaning of align (on allocations/objects) everywhere in LLVM is a minimum alignment, which can be increased if considered profitable.

If you want to introduce a separate preferred alignment property, the way to do it is to leave align alone and add a separate prefalign. Not to change the meaning of align and add minalign.

Thanks for the feedback. I reverted #149444 while we figure out what to do. Keeping the existing attribute as minimum alignment and adding a preferred alignment attribute sounds reasonable to me.

This might work. We currently pass -falign-functions=32 in our internal builds in order to reduce the measurement bias effect of functions changing size. Without the ELF extension, this flag would also affect sh_addralign and would therefore end up preventing jump table relaxation. But we could also consider setting sh_addralign to the lowest power of 2 >= the function size if the function’s size is between the minimum and preferred alignment. Then instead of passing -falign-functions=32 we could start passing the new flag -fpreferred-function-alignment=32. This may also lead to a general performance improvement due to lower TLB/icache pressure. Let me experiment with that and see how well it works.

I agree that the function attribute align to indicate the minimum alignment (the original behavior before the reverted #149444) is useful. I jotted down some notes in the " Aligning code for performance" chapter of this post: https://maskray.me/blog/2025-08-24-understanding-alignment-from-source-to-object-file

Implementing this as an assembler directive with a complex expression (label difference) operand is likely impractical.
Ideally LLVMCodeGen should estimate the function size and emit a suitable alignment directive:

// rejected as intended today
.p2align 4, , b-a
a:
  nop
b:

The first draft of GCC’s -flimit-function-alignment actually found a lowest power of 2 >= the function size, but it was considered not useful.

Aligning small functions can be inefficient and may not be worth the overhead. To address this, GCC introduced -flimit-function-alignment in 2016. The option sets .p2align directive’s max-skip operand to the estimated function size minus one.

% echo 'int add1(int a){return a+1;}' | gcc -O2 -S -fcf-protection=none -xc - -o - -falign-functions=16 | grep p2align
        .p2align 4
% echo 'int add1(int a){return a+1;}' | gcc -O2 -S -fcf-protection=none -xc - -o - -falign-functions=16 -flimit-function-alignment | p2align
        .p2align 4,,3

In LLVM, the x86 backend does not implement TargetInstrInfo::getInstSizeInBytes, making it challenging to implement -flimit-function-alignment.

For anything other than x86, I’d say sure, let the backend estimate it; we have accurate codesize estimates anyway for branch relaxation. But x86 does branch relaxation in the assembler, so from what I recall we don’t have good size estimates, so we might need some cooperation from the assembler to estimate the size.

For our heuristics, especially with functions around 32 bytes (as we consider -falign-functions=32), we can make a simple assumption: all JMP and JCC instructions will fit within 2 bytes (If the 5-byte variant is needed, the function would be larger than 128 bytes). A very rough estimate should suffice, as reaching 80% accuracy is likely achievable and provides enough mitigation for the initial jump table issue.

Extending the .p2align directive to support complex expressions–like label differences–is a tricky path. It could lead to layout non-convergence (.align and .org should avoid utilizing information from a subsequent fragment), and would require using that complex attemptToFoldSymbolOffsetDifference code. I think we should reconsider before we get too deep into it.

Instead of using the CodeGen size estimation, the simplest solution would seem to be to have the assembler increase sh_addralign based on a supplied (via the .prefalign directive) preferred alignment which is tracked separately from the minimum alignment.

The downside vs a size estimation based approach is that this won’t be compatible with -fno-function-sections and external assemblers, but maybe that’s fine; the initial use case for preferred alignment (jump tables) depends on function sections and the integrated assembler anyway.

I ran some experiments internally and found that there was no statistically significant performance difference between a fixed alignment of 32 and an alignment based on the function size.

Sent patches implementing this approach:

The revised approach makes sense to me.

Since I posted the message above some patches were requested to be split up, so for clarity here is the full list of patches in order. This list also includes the CFI jump table relaxation series (last 4 patches).

Just a quick ping on all of the patches, which are still awaiting review.

Post-dev-meeting ping on all the patches :slight_smile:

New year’s ping on all the patches.

Hi @pcc,

Your initial RFC indicated that this feature would be opt-in. However, after merging the recent changes, we are observing that the alignment of small functions has changed at -O2 in our downstream x86_64 toolchain.

Can you confirm whether this was intentional? Is function alignment now being adjusted based on function size?

Could you please point me to the specific change in your patch series where this behavior was introduced?

Thanks.

Hi @bd1976bris , this was an intentional change. See CodeGen: Emit .prefalign directives based on the prefalign attribute. by pcc ¡ Pull Request #155529 ¡ llvm/llvm-project ¡ GitHub .

The change that introduced this behavior was #155529 (relanded as #182929).

Thanks very much for the clarification and pointer :slight_smile: I see the size+alignment logic came in with MC: Add directive for specifying a section's preferred alignment. by pcc ¡ Pull Request #150151 ¡ llvm/llvm-project ¡ GitHub - apologies for not following this in more detail. A worry I have is that the 16 byte alignment has been stable for many years and I wonder whether certain things (perhaps hot-patching systems?) are assuming that now? I have opened an internal SIE tracker to evaluate this.

To summarize the behavior difference

% cat a.ll
define void @foo3(i32 %x) {
  ret void
}
define i32 @foo4(i32 %x) {
  ret i32 %x
}

% myllc a.ll -function-sections -filetype=obj -o a.o && readelf -W -S a.o
...
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
...
  [ 3] .text.foo3        PROGBITS        0000000000000000 000040 000001 00  AX  0   0  1
  [ 4] .text.foo4        PROGBITS        0000000000000000 000041 000003 00  AX  0   0  4

# -fno-function-sections is unchanged https://github.com/llvm/llvm-project/pull/155529#issuecomment-3947151340
% myllc a.ll -filetype=obj -o a.o && objdump -dr a.o
...
0000000000000000 <foo3>:
   0:   c3                      ret
   1:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
   8:   00 00 00
   b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

0000000000000010 <foo4>:
  10:   89 f8                   mov    %edi,%eax
  12:   c3                      ret
1 Like

Yes, that’s possible, and those programs would need to start passing -falign-functions. The alignment of 16 was never a guarantee (e.g. -Os/-Oz implied an alignment of 1 even before my changes) and I think these alignment requirements are uncommon enough for it to be reasonable to change the default behavior. A previous version of my change uncovered a bug in the sanitizer runtime (which was in fact hotpatching functions). See: CodeGen: Respect function align attribute if less than preferred alignment. by pcc · Pull Request #149444 · llvm/llvm-project · GitHub

1 Like

I’ve reviewed previous discussions and agree that the legacy assembler-level handling of preferred vs. minimum alignment was suboptimal.
The newly-introduced .prefalign ELF directive is unusual, as it applies to the section as a whole rather than the current location.

On X86, the preferred alignment is set to 16. However, aligning very small functions to a 16-byte boundary is inefficient. While the current patch addresses this for the -ffunction-sections case, the resulting behavioral gap between -ffunction-sections and -fno-function-sections is inconsistent.
I’ve created [MC,CodeGen] Update .prefalign for symbol-based preferred alignment by MaskRay · Pull Request #184032 · llvm/llvm-project · GitHub to unify the behaviors.

Last, I anticipate some debate regarding the 0 <= body_size < pref_align => ComputedAlign = std::bit_ceil(body_size) rule.

Specifically, we should justify why we chose this heuristic instead of simply ignoring preferred alignment for small functions (the MaxBytesToEmit operand of regular align directives). I mentioned this last August:

The first draft of GCC’s -flimit-function-alignment actually found a lowest power of 2 >= the function size, but it was considered not useful.

Possible explanation: If the cache block size is 64 and the goal is to minimize the number of cache blocks a function spans, it suffices to align the function start to min(64, NextPowerOf2(body_size-1)). That’s the minimum alignment that prevents an unnecessary boundary crossing. On x86 we often use 16 or 32 to prevent excessive wasted padding.

Thanks for looking at the -fno-function-sections discrepancy! I did not consider that important, but more consistency doesn’t hurt.

My understanding of the historical choice of 16 bytes for the preferred alignment on x86 has always been that it corresponds to the presumed fetch width of the instruction decoder. Looking through Agner’s optimization manual indicates that the fetch width is usually 16-32 bytes. The same manual also indicates that many microarchitectures have chosen to align instruction fetches to the fetch width.

The goal of the implemented preferred alignment formula is to reduce code size while preserving the property that functions smaller than the preferred alignment do not require fetching multiple decode blocks, assuming a microarchitecture where both of the above points are true.

As you point out, the same is true for icache fetches, assuming an icache line size greater than the preferred alignment.