[RFC] Towards Disallowing Struct/Array IR Values

tl;dr: LLVM supports arbitrarily large values inside functions, which repeatedly cause problems in the back-end. Therefore, in the long term, disallow struct/array values inside functions (with exceptions for ABI requirements). In the short term, disallow array/struct loads/stores.

LLVM supports structs and arrays as IR values inside functions (i.e., arguments or types of instructions), which can get arbitrarily large (e.g., load [100000 x i32]). Right now, the following IR operations support structs or arrays:

  • Arguments/return values (implying ret, call&friends).
    • Structs are the only way of having multiple return values.
    • Some targets, e.g. AArch64, use arrays to indicate homogeneous aggregates that do not get split. (Some other targets might also have uses.)
  • landingpad. (I don’t know about SEH instructions.)
  • load/store.
  • select.
  • freeze/extractvalue/insertvalue (just needed to support the other instructions).
  • phi nodes.

However, apart from multiple return values and ABI considerations (I don’t want to touch these here), having struct/array loads/stores is not really useful: access/combining elements needs separate instructions anyway and InstCombine already canonicalizes these into separate loads/stores; the cases of store(load(a), b) and store(zeroinitializer, a) are better served by memcpy/memset.

The back-end must decompose aggregate types into their components anyway, but this is generally handled rather poorly and there are some (undocumented) limitations on the number of supported units (hit most recently here, Rust also hit these before). Aggregate values also have a substantial compile-time hit, compiling code with more simple instructions is faster. We also already discourage front-end authors from using aggregates since over a decade.

Therefore, I propose to deprecate aggregate load/store in LLVM 21 and remove them in LLVM 22 (or 23).

After this, we could continue with:

  • Remove struct/array select/phi/freeze. After this, aggregate values inside functions only serve ABI purposes for assigning multiple registers.
  • Disallow instructions/arguments of types other than structs that solely consist of first class types (i.e., no nesting). insertvalue/extractvalue would only support a single constant integer as index.
  • Limit aggregates in parameter/return types to a certain size (e.g., 64 elements (<= #registers) in total).
  • Once we figure out how to handle the ABI properly: perhaps eliminate them entirely from functions or impose additional requirements, e.g. extractvalue must follow call, etc.). Very unclear, though.
  • (Maybe put a limit on vector types as well… or is there a point in supporting a <43321 x i32>?)

Advantages:

  • Avoids a common trap for hitting back-end bugs/rarely tested/used code paths when using large aggregates.
  • Avoids a common trap for ending up with long compile times.
  • Possibly simplifies back-ends.

Disadvantages:

  • Front-ends can no longer use aggregates to generate IR more easily.
  • Possibly a lot of breakage, also downstream (e.g. DXIL/GPU, MLIR/Flang, other targets). (I hope to get an estimate here on the cases and how hard it’d be to fix them through this RFC. :slight_smile: )

I haven’t come up with a more detailed implementation plan (nor can I commit myself to implement this right now). My intention of this RFC is to evaluate whether this would be feasible/realistic at all and to ask for comments/objections on these changes.

2 Likes

I agree that we should remove support for load and store of aggregates. There is no good reason for these operations to exist, but the fact that they do exist is a common footgun for new frontend authors.

What does ā€œdeprecateā€ here mean in a practical sense?

Generally, I’d expect the first step for this RFC be to remove all places in Clang that emit aggregate load/store, as we cannot remove the operations without that.

For the future plans:

I think these are useful to support as long as we have aggregates as first-class values at all. Otherwise we get annoying additional special cases (e.g. tokens also can’t form phis, but here we would actually want to form a phi, but via unpacking+repacking, so it’s not the same situation).

I’d also add that we should only support literal (i.e. unnamed) structs as args/returns. This should mostly prevent accidental use.

Probably just documenting our intention of removal in release notes and LangRef so that users/downstreams (including our in-tree users like Clang) have more time to prepare.

Hmm, I see that having phi/freeze might make things easier. But select could be rewritten as branches with phis? It’s not that our codegen is particularly good in any case (example), but the non-select code is probably better.

Good idea.

I would add: having one release where the API is decorated with the LLVM_DEPRECATED macro (a pre-requisite for this is to eliminate all uses of the API in-tree of course).
(it’s not a requirement of course: we don’t have stable APIs in LLVM in general, but nice to have for widely used things when possible I guess?)

2 Likes

I don’t think there is any API that can be deprecated in this instance (this goes through the general load/store APIs).

1 Like

This will break backwards compatibility with older bitcode. It we’re planning to auto-upgrade such bitcode, why couldn’t we have a ā€œcanonicalizationā€ pass that does what the auto-upgrade would have done?

As a front end author, I think there are some operations that are impossible to express to the LLVM backend without correctly implemented support in LLVM for load and store of aggregates, in particular if those aggregates contain a non-integral addrspace pointer. Even right now, there is a problem that atomic loads and stores will fail in the validator if the addrspace semantics would require them to operate on a struct with a non-integral pointers in order to preserving the correct pointer provenience. There may be no way for the frontend to legally express that operation in any other way than using a first class struct: trying to add ptrtoint and inttoptr calls around the load and store to try to re-write these ptr addrspace(NI) as merely an integer operation might be invalid. I don’t yet know of any good workaround, but it is a correctness problem we’ve run into recently caused by these validator rules when users are trying to use atomics with 65-bit to 128-bit sized structs, which are formed by a pair of a GC-rooted object and anything else.

Although it is a bad idea to feed such a function into the backend, why put such a strict limit on the number of return values for all LLVM IR functions (including always_inline, intrinsics, etc.)?

Do these restrictions really help, or how do they prevent accidental use? For defining and calling a function such as __regcall struct S foo() { ... } a frontend will normally use a named (possibly nested) %struct.S, the same struct type created for other uses of S. Why add the burden that it must maintain an expansion into a (flat) literal struct here?

Shouldn’t this only affect atomics? For non-atomic operations, splitting the load/store should always be legal. Aggregate atomics aren’t supported right now and would require design work (essentially, this is about {ptr addrspace(N),i64}). This rare case could perhaps also be solved with intrinsics. I don’t know enough about this area to come up with a good design here.

For always-inline, SROA should promote sret to components. Which intrinsics have such a large amount of results?

That’s the point: front-ends should, in my opinion, not do this and not mix the in-memory representation of data with the IR value representation of data. Aggregates at function boundaries almost always needs explicit lowering in the front-end anyway.

Yes, but note that atomicrmw and atomic compare exchange intrinsics currently require returning the recursive array {{oldstruct}, success}. I don’t care too much about whether the recursive nesting gets flattened or not though, since nesting is not useful there and generally nesting does make IR more complicated. If there was instead an intrinsic for appendvalue / popvalue I think that would be sufficient though.

I’m not sure what you mean by adding a new intrinsic just for atomic FCA loads and stores, since load and store are already intrinsics that mostly support this (just not in the verifier when marked atomic). If we disallow FCA loads, it sounds like we’d be deprecating load in order to replace it with new-load and new-load-fca, which is the same as old-load but now with duplicate logic everywhere to handle it.

This is not just a verifier limitation, I don’t think any back-end supports aggregate atomics (just checked x86, which runs into an unreachable while lowering the type).

For this case, we essentially need an atomic xchg (the only atomicrmw operation that supports pointers), which can also act as store, and a cmpxchg, which can also act as load, for pairs of (non-integral) pointer and some other pointer-sized type. This is a much more limited type set than arbitrary aggregates. As the use case is apparently very rare, I think that it could be either served by dedicated intrinsics or by a focused expansion of atomicrmw xchg and cmpxchg.

In any case, it doesn’t need non-atomic aggregate load/store, whose deprecation and removal is the main point of this RFC.

Forgot to respond here: for most cases (non-volatile load/store of structs/arrays without padding), InstCombine already does this canonicalization.

Then why do we need to remove it from the IR?

I’m strongly opposed to making changes to the IR that affect existing consumers. If the current handling of aggregate values is inadequate, we should improve it. Changing the IR in such a way is a solution of last resort.

We may discourage people from using aggregate values, but since they are allowed in the IR, any bitcode that uses them is valid. We don’t have to promise good performance, but we should still generate correct code. The (perhaps low) quality of our implementation is not a reason to alter the IR.

In any case, it doesn’t need non-atomic aggregate load/store,

The optimizer considers that to be a valid legalization for some architectures (without SMT) or a valid strength reduction when the memory is provably local, so yes, it is needed still. It also isn’t just a pair, since it can be packed to have either leading or trailing bytes and pointers can be 32 bits with 128 byte atomics, so there is a lot of combinations of legal operations there which you’re proposing deprecating the ability to handle correctly in llvm. The volatile case might also be interesting to some users?

To be clear, I am perfectly fine with eliminating the ability to make a nested FCA and probably combining vector and array too, but there seems likely a need for FCA load and store as long as pointer provenance exists in llvm

Because this is a feature that provides ~zero benefits but is a common trap for users and exposes known-broken, untested, and slow code paths in the back-ends.

IMO, for every feature that we support, there should be a valid use case that is actually practically important for some users.

My motivation of this RFC is to ask for users that have actual, concrete use cases which would be unreasonably difficult to handle without this feature. If no compelling use case exists, there’s no point in keeping a known-broken feature where we tell users to not use it. It’s also worth noting that the limitations are well known and nobody really worked on these in over a decade.

Aggregate-typed values are a design decision which might have made sense 20 years ago, but with the progressing removal of type information from LLVM (e.g., opaque pointers, ptradd; see Nikita’s presentation at this years EuroLLVM), there are increasingly less reasons to keep them around.

… and therefore, back-ends must support this and users will use them to shoot themselves in the foot and then complain about LLVM being slow/broken.

For this transformation, the optimizer can just expand the load/store into its components. I don’t see why this is needed.

What is the actual use case here? I assume it’s pointer tracking for GC? Again, this is not something that is expressible in the IR right now and I’d expect aggregate atomics to be not trivial to implement; deprecating non-atomic aggregate load/store isn’t going to take away anything in that regard.

What’s a concrete motivation here? Why would someone need to use a volatile aggregate load/store over loads/stores of the individual components?

Here are my reasons to oppose this.

The IR is the core API for consumers to interact with the compiler. Making backward-incompatible changes will break a lot of existing code. We can make such changes, but they should be well motivated: don’t touch the IR unless we compelling reasons to do so. This RFC has this reversed: it starts with a claim that aggregates are not very useful, and that we should remove them unless there is a reason not to.

In terms of simplicity, this is just kicking the can down the road (or ā€œupā€ the road in this case). Instead of LLVM, now the frontends will have to deal with generating, now more complex, proper IR. If anything, we should try to make it easier for frontends to generate correct IR, and deal with complexities like this ourselves. That we don’t handle it very well right now is not an argument at all. If our implementation is buggy we should fix it, instead this RFC seems to be trying to make it somebody else’s problem. We (as the LLVM community) are best versed to deal with such things, and if we can’t be bothered to get it right, should we expect the frontend authors to do it for us?

One thing you’ve never addressed is if we already have code that already ā€œcanonicalizesā€ aggreagate loads/stores, at least in most cases, why can’t we extend it to handle all cases and just make it a required part of the compilation pipeline? We could have an initial stage where aggregates are allowed, and the remaining part where they aren’t.

Probably some more context here is in order. The core problem with aggregate values is that the SSA-based IR representation is fundamentally unsuitable for them. SSA implies that any modification of the aggregate requires creating a new aggregate with a single field modified. This is a model that works nicely if you are dealing with 2-element aggregates, but breaks down terribly if you try to use it for a 10000-element aggregate. This is not a question of the implementation being bad, it’s a question of the model being bad.

The correct way to handle aggregates is in-memory, which does not suffer from this problem. You can load and store individual members of the in-memory aggregate – and SROA can convert this to working on individual scalar SSA values where possible and profitable. But working with the whole aggregate as an SSA value generally does not make a lot of sense.

Now, we could improve handling of this representation. We could port optimizations that currently work on memory and try to make them work on SSA aggregates. We could have have ā€œSSA SROAā€, etc. But this would just be wasted effort going down an undesirable road. Our general position is that it doesn’t make sense to significantly improve SSA aggregate handling, because it is not the right representation for the domain.

Now, regarding the canonicalization, what this does is to expand aggregate load/store and into scalar extractvalue+store or load+insertvalue. The load/store is scalarized, but it still works on SSA aggregates. We could easily provide that functionality in IRBuilder so that frontends don’t have to emit this pattern – but this not the final outcome we actually want. This is probably useful for some edge cases (e.g. when interfacing with SSA aggregates at ABI boundaries), but generally the frontend needs to emit the IR in a way that never introduces SSA aggregates in the first place.


As a more philosophical comment, I have received this feedback many times: LLVM has a lot of sharp edges to cut yourself on, where it looks like LLVM supports something, but if you actually try to use it, you are going to have a very bad time. It is easier if you get a hard error right away and can implement correct handling from the start, instead of having to revisit your design months down the line, once things start to break down.

It is telling that we run into these problems even in clang and flang – if even experienced developers familiar with LLVM can’t get things right, how can new frontend authors?

2 Likes

Thanks for the explanation. The motivation you presented does make sense to me.

I tried some tests and when the aggregate value came from a load, instcombine successfully replaced the value with a bunch of separate scalar loads (leaving no aggregate values in the IR anymore).

I’m not sure if it’s acceptable for instcombine to do that, but for aggregate values that can’t be traced back to a load (e.g .the poison + insertvalues idiom), we could store them to the stack at the first moment it shows up, and introduce such a load.

Alternatively we could do the opposite—require that all aggregate values come from loads (i.e. have associated storage). The minimal frontend workarounds for that should be relatively simple.

This would be a bad change on practical and ideological grounds. The proof of that statement is exactly that SSA is a good thing.

SSA values are magical. Easy to work with. No aliasing problems. Pure abstraction over unknown values. They are by far my favourite design choice in LLVM.

The arguments that aggregates might be large or inconvenient to mutate apply to vectors as well. The cost of dropping that abstract would be felt immediately.

That backends sometimes fall over badly on aggregates is noteworthy, but primarily means something needs to expand them into a bunch of separate load/stores, which I’m pretty sure is already implemented. If it isn’t, we should have a pass that backends can call, probably parameterised on how large an object is before that backend doesn’t want it as a single SSA value.

I want more aggregate SSA values in LLVM, as they’re the obvious representation for manipulating tensor operations of various sorts. I’d go so far as to say that if we can’t have first class representations of the data structures used by the majority of machine compute today, they should use a different compiler.

^ I’m essentially taking exactly the opposite stance to this. Working with in memory representations of values is the ā€œwrongā€ way, whether values are i32, vectors or aggregates. I’m pretty sure the general fix for gpu shared memory representation I’ve been fretting about involves lifting the ā€˜global’ values to something closer to SSA. However I would agree that the LLVM operations for acting on the parts of the values are inconvenient / inefficient, and would welcome improvements to those that are not losing the capability.

Probably worth qualifying that I don’t mind there being an upper size (in bytes) to what we accept, but I’d want that limit to be of the order of the register file size of the machines we’re generating code for.

I think it’s pretty important to distinguish what we call ā€œaggregatesā€ (structs and arrays) and ā€œvectorsā€ – while those are also aggregate-like, the way you work with them is actually scalar-like.

A key distinction is that when you are working with vectors, you are usually operating on the whole vector. Yes, you can also manipulate individual elements with insertelement/extractelement, but that’s not the primary purpose of vector types.

(Struct and array) aggregates on the other hand literally only support element-wise operations (and copying). And the element-wise operations are exactly the problematic ones.

I don’t think it’s a problem to have large SSA values per-se, as long as the value is largely used as a unit, rather than manipulated element-wise.

I probably should have qualified that more. Of course we want to lift things from memory to scalar SSA values as much as possible. We try very, very hard to do that. My point here was that if you can’t directly emit IR in terms of scalar SSA values (which most frontends, including clang, can’t) then the correct thing to do is manipulate them in memory, and let LLVM handle the promotion to scalar SSA values where possible (where ā€œscalarā€ here includes vectors, as noted above).

I suppose the difference of opinion here is I also think of the primary purpose of an aggregate being operating on the whole thing, largely through function calls / intrinsics, and mutating individual fields being the edge case.

I strongly don’t want that ripped out of IR, because instead of moving one value around and passing it through call chains, one would get a spew of separate alloca/load/store on each side of the call, for the benefit of working around some backends that don’t want to expand the values? Bad tradeoff.

I supposed I’d characterise this as we should do a better job of promoting aggregates from memory to SSA values. SROA is a good trick but it’s not always applicable. That seems much better than declaring that load/store no longer work on aggregates values.

Vaguely related to this, I do not love the design choice where clang bursts structs into pieces for function calls based on the target abi, instead of leaving it as an aggregate for the backend to disassemble.