[RFC] Function-Level flatten_depth Attribute for Depth-Limited Inlining

Abstract

This RFC proposes adding a flatten_depth(N) attribute to Clang and LLVM that enables transitive inlining up to a specified depth. Unlike flatten which only inlines immediate call sites, flatten_depth provides controlled deep inlining of call trees.

Motivation

The existing flatten attribute only inlines immediate call sites (single level). For performance-critical code paths requiring elimination of call overhead across entire call trees, developers currently must either:

  • Use alwaysinline on many functions (affects all call sites globally)

  • Rely on cost-based heuristics that may be insufficient for critical paths

  • flatten_depth(N) fills this gap by providing controlled transitive inlining within specific calling contexts.

Depth Parameter Design

The primary use case is complete flattening of the call tree - eliminating all function calls within a performance-critical path. The depth parameter serves as a safety mechanism to prevent pathological cases:

  • Recursive functions: Without a limit, recursive calls would cause infinite inlining

  • Unexpectedly deep call trees: Prevents compile-time explosions from very deep call hierarchies

  • Compile-time control: Provides a circuit breaker for cases where full flattening becomes too expensive

In normal usage, developers would set a large depth value (e.g., flatten_depth(50)) with the expectation of flattening the entire call tree, while the limit protects against edge cases.

Design

Syntax

C++

_attribute_((flatten_depth(N)))

void critical_function() {

// Inline call tree up to depth N

}

Semantics

  • N: Unsigned integer representing target inlining depth

  • Behavior: Hint, not strict constraint - cooperates with other inlining decisions

  • Depth 1: Equivalent to existing flatten

  • Interactions:

    • alwaysinline overrides depth limits
    • noinline respected at any depth
    • Cost model may inline deeper if beneficial

Edge Cases

  • Recursive functions: Depth limit prevents infinite inlining

  • Template support: Handles dependent arguments

  • Insufficient depth: Inlines as far as possible

Proposed Implementation

Clang Frontend

  • Add FlattenDepthAttr to Attr.td

  • Support dependent template arguments

LLVM IR

  • Add flatten_depth=N function attribute

  • Update bitcode serialization/deserialization

Inlining Pass

  • Extend AlwaysInliner pass with depth-aware logic

  • Implement call tree traversal with depth tracking

  • The original alwaysinline logic is applied first and takes precedence. The flatten_depth logic runs in the same pass but is applied at the end, after the standard alwaysinline processing.

Gradual Introduction

To de‑risk rollout and avoid tying behavior to source APIs prematurely, we can introduce depth‑limited flattening as an LLVM‑side, opt‑in feature. We present two alternative activation mechanisms for discussion.

  • Option A — Parameters (-mllvm flags)

    • Summary: Select target functions and depth from the command line or build rules for quick, reversible trials.

    • Example:

      clang++ -O2 -mllvm -flatten-depth-funcs=my_hot_top -mllvm -flatten-depth=3 source.cc

    • Pros: Fast to prototype, easy A/B and rollback, no new file format.

    • Cons: Harder to manage many functions or per‑function policies at scale.

  • Option B — YAML manifest (application‑owned config)

    • Summary: Pass a small YAML file listing target functions and depth, enabling scalable, per‑function control owned by the application/link stage.

    • Example CLI:

      clang++ -O2 -mllvm -flatten-depth-manifest=flatten.yaml source.cc

    • Example manifest:

      YAML

      functions:
      - name: my_hot_top
        depth: 3
      - name: foo_bar
        depth: 2
      

      Pros: Scales to many entries, clearer ownership at application level, easier long‑term maintenance.

    • Cons: Requires a simple parser and config distribution; slightly higher setup cost.

Reference Implementation

  • Initial PR: #165777 (Clang + LLVM IR attributes)

  • Draft Implementation: full_flattening (complete prototype with AlwaysInliner changes)

cc: @mtrofin, @efriedma-quic, @erichkeane, @boomanaiden154-1 , @WenleiHe , @ych

Echoing some of my comments from the PR, I don’t think this really addresses the stated motivation well. Flattening an entire function is going to increase icache pressure by quite a bit. While the dynamic instruction count will likely also go down, the increase icache pressure will probably end up decreasing performance (we have observed this before when biasing the inlining heuristic when compiling clang). The key to good inlining is selectivity, usually determined with call site hotness information obtained from profiles. Inlining cold callsites (which this attribute will do) will lower performance.

There might other motivations where this makes more sense, like passing a function to a LLM/other downstream tool where it does not need to deal with as many calls. That does not necessarily need a C++ attribute though, and could be easily achieved with just changes to the always inliner pass. The overall code changes aren’t very large, so I think this is fine if it’s useful to someone and they’re willing to maintain it (which seems like the case), but to me this doesn’t really seem to solve the stated problem well.

Thanks for the feedback.

Passing function names via an LLVM parameter would work for our use case. If there’s consensus that this approach is preferable to a source-level attribute, I’m happy to implement it that way instead.

Two things:

  • I was hoping my comment on the referenced initial PR were addressed: “An aspect that I hope the RFC could cover is whether this could be initially implemented LLVM-side only (e.g. via some manifest file saying which functions should be treated this way) - effectively a gradual introduction.” In summary: any design alternatives? Why/why not?
  • suppose this flag exists. Suppose I place it on a function A that accepts a std::function, F, as argument. A calls F. Now someone uses my code and calls A with a F that has a pretty large transitive closure and leads to excessive compilation times. I need to avoid the compilation timeouts. What workarounds do I have? (I’m proposing this as a scenario to discuss tradeoffs of the feature being tied to the definition of an API)

Thanks for the feedback!

I added a “Gradual Introduction” section—does it address your first point?

On the second point: I plan to add the flattening logic to AlwaysInliner (per-TU). Indirect calls will not be inlined; template functors and other direct calls may inline. The depth parameter caps transitive inlining to avoid deep call-tree pathologies. Note that flatten and alwaysinline can also encounter pathological cases, even without transitive flattening.

Thanks.

In my example, whether the lambda is or not an indirect call at the time of AlwaysInliner depends on pass ordering. It is conceivable it’s not indirect.

The API author wouldn’t have knowledge of all possible uses to set the depth parameter correctly.

BTW - to check my understanding - flatten, in Clang/LLVM, really translates to the callsites of the marked function being labeled as “alwaysinline” and does not perform anything recursive; so in this context, the discussion is really about alwaysinline. I don’t disagree it can, I’d argue that it can less, because the effects are local. I think the likelyhood of unexpected / hard to reason about effects, in the presence of code evolution (and API authors different from consumers), is higher with a recursive flatten.

clang question, is there a way to introduce experimental attributes? I.e. something that would allow the community to change our mind at a latter date?

I’d imagine an option “C”:

  • add the attribute experimentally
  • make the LLVM behavior opt-in

This should avoid proliferation before sufficient experimentation and maturation. It should allow easy deprecation in the worst case; in the best case, if all works well, it’d just require a bit of search & replace to the “mature” name.

Separately: what are, or should be, the semantics after thinlink?

Oh wait. This is different from GCC’s flatten attribute. Which does all levels. This also explains why you have 1 being the same as the current flatten attribute when I had expected it to be 0 meaning all.

https://groups.google.com/g/llvm-dev/c/gGRCEi9g4ac/m/sEKFTnTGAwAJ for old reference here.

Note I do like the idea of having flatten_depth attribute; I am just pointing out that it looks like the current flatten attributes are different between the two compilers. File Making sure you're not a bot! for GCC to add this new attribute; as I mentioned I really like the idea of being able to control the depth due to compile time. Since GCC had implemented flattenas an infinite depth before, GCC has got some bug reports with respect to having the attribute and LTO and even compile time difference between LLVM and GCC when using the attribute.

1 Like

I don’t like this. The call depth limit seems problematic for any uses other than “pick a depth large enough to make sure everything is flattened” – the depth is going to be non-obvious as soon as code outside your control if involved, such as STL functions, especially those accepting callbacks. Worse, the call depth is going to randomly change due to library updates. Your performance is at risk of fluctuating whenever the libc++ implementation decides to add or remove and extra intermediate function somewhere.

I think if we want to do something in this area, it should probably be to add flatten_recursively (or change flatten) which fully flattens the call-tree and errors on cycles. That at least gives you reliable behavior (though I echo previous concerns that this will often not actually be profitable).


I am strongly opposed to the “gradual introduction” alternative. We should not invent a special snowflake mechanism for this feature where a standard attribute would do.

Thanks everyone for the thoughtful feedback—this is really helping shape the proposal. Building on the discussion, here’s what I’m thinking:

Optional Depth Parameter

What if we make the depth parameter optional?

// Full recursive flattening (matches GCC behavior)
__attribute__((flatten_depth))
void critical_function();

// With optional depth limit
__attribute__((flatten_depth(10)))
void another_function();

This would give us GCC compatibility by default, while still providing an escape hatch for compile-time control when needed. Does this address the concerns about depth unpredictability?

ThinLTO Scope

For the ThinLTO question—I’m inclined to start with within-TU flattening only. This keeps the initial implementation simpler and easier to reason about. If folks find use cases for cross-TU flattening later, we could explore extending it. Does that seem reasonable, or are there scenarios where cross-TU support would be essential from the start?

On Sharp Edges

I hear the concerns about maintainability and unexpected behavior with callbacks/library code. These are valid tradeoffs. My thinking is that this falls into the same category as asm blocks or aggressive pragmas—tools that most code shouldn’t use, but that can be the right choice in specific situations where developers have profiled and understand the risks.

Would clear documentation outlining these tradeoffs be sufficient? Or do folks see a need for additional guardrails?

Gradual Introduction

Given the feedback, I’m leaning toward a standard attribute rather than the YAML/flag-based approach—but I’m open to other perspectives here.

What do you all think? Happy to iterate further on any of these points.

Right now gcc flatten attribute for lto is a full one across all TUs (even in whopr mode, which i think is similar to thinlto). We (gcc) has talked about have an attribute limited to the current tu but it has not implemented it yet.

1 Like

Note that this has been asked in the Rust community as well, see We may need some sort of #[flatten] - #2 by comex - Rust Internals

However, it would probably require an LLVM change. Clang currently doesn’t seem to respect the recursive aspect of __attribute__((flatten)). It only forces a single level of inlining, unlike GCC.

Frankly, the way flatten is implemented today in Clang almost feel like it’s buggy – flatten implies implies collapsing all levels, inlining one level isn’t really flatten.

Sure it’s not for every average c++ developer, but that can be said for many other tools exposed by compiler (always inline, asm block, etc.). I think given the fact that: 1) some users have a need for it, 2) the proposed change matches GCC behavior, 3) there is a separate need for it from Rust side, we should take the RFC/change.

The specifics on flatten_depth vs flatten_recursive can be worked out in PR. But honestly, I found the current behavior (flatten single level) rather confusing, and it’s probably best to just match GCC. Though also understand the need for backward compatibility.

2 Likes

Another option to consider: instead of (or in addition to) a depth parameter, we could use an instruction limit as an internal guardrail.

Proposal
The __attribute__((flatten_recursively)) attribute would remain simple from the user’s perspective — no parameters needed. Internally, we would track the instruction count of the function being flattened and stop inlining once it exceeds a configurable threshold.

 __attribute__((flatten_recursively))
 void hot_function();

The limit would be controlled via an internal flag:
-mllvm -flatten-recursively-instruction-limit=N
with a sensible default (e.g., 10,000–50,000 instructions) that prevents pathological code bloat while not interfering with legitimate use cases.

Rationale

  • Simpler UX — Users don’t need to reason about call graph depth or tune parameters
  • Direct code size control — Provides a predictable bound on function growth
  • Precedent — A similar instruction limit approach is already used in the sample profile loader pass for its inlining decisions

Diagnostics
When the limit is reached, we could emit an optimization remark (-Rpass-missed=inline) to inform users that flattening was incomplete due to the instruction limit.

Agree – I didn’t realize Clang diverged from GCC and only flattened a single level, until this thread started.

This attribute is so very rarely used (and even more rarely is such use actually justifiable), that I doubt it’s worthwhile to add a bunch of additional variants or options here. Just make it flatten recursively “if possible” as GCC does, and call it a day.

1 Like

Another option to consider: instead of (or in addition to) a depth parameter, we could use an instruction limit as an internal guardrail.

I’d still prefer simple depth limit if we ever want a limit. Instruction count as a limit is more difficult to use as it’s less obvious what limit should be used, and for a given limit, what callees / what part of call tree would be inlined or not. We could move the depth limit to command line flag instead of on attribute though.

I didn’t either. Though I had suspect something because of the compile time performance difference with respect to GCC and LLVM with code using flatten attribute (and yes there are a few code out there that uses the flatten attribute: highway; ladybird’s JavaScript interrupter are examples I know of currently due to compile time performance difference between GCC and LLVM).

Right. If we can get away with it, I’d like to start by just changing the semantics of the current flatten attribute to match GCC.

Limits to flattening is something we can reconsider later if it turns out to be really necessary, with a hopefully better understanding of what the exact requirements for such a limit to be useful and robust are.

Hi all,

I’d like to summarize where we are and check if we have consensus on the path forward.

Based on the discussion so far, the proposed plan is:

  1. Change the existing flatten attribute behavior to match GCC — meaning recursive inlining of all calls within the flattened function, not just top-level call sites.

  2. Error out if a cycle is detected — if we encounter recursive calls that would lead to infinite inlining, we’ll emit an error rather than silently truncating or producing unexpected behavior.

  3. No guardrails initially — we won’t add depth limits or other safety mechanisms in the initial implementation. Users who apply flatten are explicitly requesting aggressive inlining and should understand the implications.

Are there any objections to this plan? If not, I’ll proceed with implementation.

1 Like

Given that the spec is to inline if possible, I’d expect the implementation to simply not inline the same function recursively, rather than throwing an error. That’s the behavior GCC has.

1 Like

Makes sense. Let’s match the GCC behavior.

1 Like

New PR is at [LLVM] Add flatten function attribute to LLVM IR and implement recursive inlining in AlwaysInliner by grigorypas · Pull Request #174899 · llvm/llvm-project · GitHub