[RFC] De-type-ification of LLVM IR: why?

Can somebody explain for what good the LLVM IR gets lower and lower every day (opaque pointers, gep removal, de-type-ification) to the point that type information from the front-ends is intentionally lost in the translation? (oh, wait, is it for faster compilation times? yeah, I know, it’s that hard question of how fast can we compile vs. how good the code produced by the compiler…)

@nikic is speaking about a future of LLVM IR that looks to me more and more like a compiler back-end IR (similar to GCC’s RTL representation only in SSA form.)

Why? Does everybody agree that this is the right direction for LLVM IR ?

Am I the only one still under the impression that the LLVM IR is intended to be a middle-end kind of IR, a bit like Simple IR, or GCC’s adaptation Gimple IR, where scalar, memory, and loop optimizations could independently be implemented of programming languages idiosyncrasies, shielded behind lowering passes. Today (and from day one) LLVM IR does not (and did not) preserve enough info from the front-ends to be able to disambiguate at compile time properties that GCC would be able to just read from Gimple IR, like array dimensions, array access functions, or infer properties based on type info, such as estimates of loop trip counts. What are the alternatives to enable LLVM’s optimizations that are currently lacking type info? Are side-channels (assumes, annotations, attributes) the right way to send info from the front-ends to the optimizers working on LLVM IR ? Or are we supposed to use a higher level MLIR to implement middle-end compiler optimizations that lack type information in LLVM IR?

Here’s a practical problem: In the LLVM de-linear-ization analysis pass, we try to recognize high-level array type information that was intentionally lost during the linearization of the memory access functions. We need the front-end view of the multi-dimensions of array types to be able to reason about memory dependences in a faster way than solving the generic problem that leads to multi-variate Diophantine equations (harder to solve than just pattern matching ZIV, SIV, or MIV zero/single/multi-induction variable dependence tests.) On the news of GEP removal from LLVM IR, I started adapting the delinearization pass not to rely on the GEP info, and for that I can either rely on today’s type info from LLVM’s IR for alloca’s and the global variable declarations, or otherwise attach the type info from the front-ends to a side channel such as an assume. Question: are we going to see in the future the type info disappear from alloca and global variables as well, or should we use side-channels to send the array info from the front-ends to the middle-end LLVM IR ?

(Edit: adding a bit more context)

I speak three languages, each rooted in personal and cultural heritage. Romanian—limba lui Eminescu—is my father’s tongue and the language of my childhood. French—la langue de Molière—is my mother’s voice, which I embraced in my teens. English—the language of Shakespeare—became my third, learned in my twenties.

I grew up poor and hungry under the food and freedom rationing imposed by the Romanian Communist Party. Family members were sent to forced labor camps for speaking out against the tyrannical regime. I became shy and withdrawn, having been instructed at the age of four never to repeat what I heard at home. Our only hope came from the soft-spoken voice of Radio Europa Libera. In December 1989, my parents were out in the streets, while we children stayed with grandma, horrified by the sound of gunfire echoing outside.

After the Romanian Revolution, my parents sought refuge in France, where I grew up and discovered compilers and loop optimizations. At the University of Strasbourg, I began implementing a polyhedral temporal and spatial locality transform based on work and help from Frederic Wagner, Benoit Meister @bmeister, Vincent Loechner, Philippe Clauss, and Catherine Mongenet. In July 2001, we decided to target the polyhedral optimizations to GCC. In 2001, RTL was the only IR available, and I started building around the few loop passes that existed—mostly just an unroller.

Later that year, I came across Laurie Hendren’s work on the Simple representation and adapted it to GCC. I sent the patch to Diego Novillo, who maintained the “ast-optimizer” branch. That effort grew under RedHat’s leadership: Diego Novillo (SSA), Richard Henderson (C/C++), and Jason Merrill (C++). In 2002, I designed an analysis phase to extract array subscripts from Gimple’s 3-address SSA form, and I coined the term “Scalar Evolution” and abbreviation SCEV (to make it clear: under the law of least-effort, natural languages decay over time: P becomes B becomes V becomes F becomes 0(empty): for example, from vulgar Latin “cabalus” becomes over the evolution of languages in thousands of years “cheval” in French and becomes “cal” in Romanian. By design of the abbreviation, SCEV is an evolution from SEB.)

In 2003, Chris Lattner @clattner asked for a write-up on SCEV in GCC. I sent him a draft, and within a week, I reviewed the first SCEV bits in LLVM. Between 2003 and 2004, I wrote GCC’s dependence analysis (DA), later integrated with IBM’s loop vectorizer in GCC. As an IBM intern in 2004, with Daniel Berlin, I implemented the DA-MIV Banerjee test and loop interchange for the SPEC swim benchmark. In 2005, I adapted the DA-Omega test; in 2006–07, I added loop distribution and parallelization to GCC.

In 2008, I started Graphite for polyhedral loop transforms in GCC. Tobias Grosser helped shape SCoP detection during his Google Summer of Code and AMD internship. In 2010, Tobias launched LLVM’s Polly, which I joined in 2011. In 2012, Preston Briggs integrated DA in LLVM, and convinced me that delinearization was key, pointing to Maslov’s earlier work. In 2013, I authored LLVM’s delinearization pass, used today by both Polly and Preston’s DA.

In 2025 I enabled -floop-interchange by default in -O2 in Flang after having consulted with all Flang maintainers. The patch was in the main branch of LLVM for 3 months until a revert happened before LLVM’s Release 21.

8 Likes

Can somebody explain for what good the LLVM IR gets lower and lower every day (opaque pointers, gep removal, de-type-ification) to the point that type information from the front-ends is intentionally lost in the translation?

I would argue that none of these changes are making LLVM IR lower level than previously: The semantics of LLVM have always been that memory is untyped and type information on pointers completely meaningless. The new representation is simply more honest about that fact by attaching types only where it is meaningful: Loading and storing values.

You can think of it also as expressability being completely unchanged: What is now expressed as loads of different types could previously be achieved through bitcasts. “lying” about the type of an alloca or global was also always allowed and had the same execution semantics attached. The type of these is purely used to determine the size of the allocation meaning there is no difference in semantics between a alloca i32 and a alloca [i8 x 4].

The question is therefore, why represent these differently when they do the exact same thing? This not only makes things conceptually more complicated but can hinder optimizations as it is too easy to match against only one representation but not the other. Removing these differing representations therefore only aids canonicalization.

I started adapting the delinearization pass not to rely on the GEP info, and for that I can either rely on today’s type info from LLVM’s IR for alloca’s and the global variable declarations, or otherwise attach the type info from the front-ends to a side channel such as an assume.

Relying on the allocas or global type is meaningless as it doesn’t give you any guarantees besides the number of bytes allocated. If your frontend language gives stronger guaantees than LLVM IR regarding memory accesses then this needs to be threaded through differently somehow.

I would be very curious to learn what information you need from the frontend and what guarantees your input language gives.

3 Likes

Maybe for a little bit of context in case this helps: we are trying to enable loop optimisations in LLVM and this has been proven quite challenging. The title is a bit of a teaser, but we will be talking about a couple of challenges at the next llvm dev conference in our talk Loop optimisations in LLVM: mission impossible? and this de-type-ifcation is one of them. So I guess what I am trying to say is that it is good to have a discussion about this and please find us at the conference, and a discussion here is good too. In addition, maybe the loop optimisation working group meeting that is a good forum to discuss this on a regular basis. I unfortunately couldn’t attend the last one, but noticed that the last one was cancelled and the one before that too, so maybe we need to revive this a bit and add this to the agenda. Or is there a better meeting for this?

Maybe the discussion about what is a low level IR and if simplifications make it lower than before isn’t really addressing the issue (I see good arguments for both). In my simple view the issue is this: there is some type info in the current IR, the claim is that this has no semantics so the proposal is to remove it. But it turns out this type info is useful for some analysis (DependenceAnalysis). So, if the proposal is to simplify the IR to a point that we can’t do DependenceAnalysis anymore, we need to come up with a proposal to restore this somehow. So, I think it is a valid question to ask if we are really achieving what we want if we push the simplifications further. So, that’s the tension that is going on here, I think. And I understand that there is some nuance around type-based analysis and that not everything can be reliably done on LLVM IR, but DependenceAnalysis should hopefully be feasible, and I think it is or could be. That’s just my 2 cents.

1 Like

Looking forward to your talk! This work does seem incredibly challenging, I do not envy performing these optimizations in LLVM.

there is some type info in the current IR, the claim is that this has no semantics so the proposal is to remove it. But it turns out this type info is useful for some analysis (DependenceAnalysis). So, if the proposal is to simplify the IR to a point that we can’t do DependenceAnalysis anymore, we need to come up with a proposal to restore this somehow.

This to me I think is the main confusing part. If the type info were to be useful in an analyiss, this would imply it has different semantics than the type-less representation and that you’d be depending on some sort of guarantee given by the type. But since no such guarantees exist, I would think that either the analysis/transformation is wrong for certain kind of LLVM IR or it relies heavily on it being in a certain shape from a specific frontend with no optimizations affecting the structure.

So, if the proposal is to simplify the IR to a point that we can’t do DependenceAnalysis anymore, we need to come up with a proposal to restore this somehow.

This makes sense to me and I’d love to know what your requirements are

1 Like

There was a discussion a while back about introducing a new gep-like instruction, which would have enough information to recover the original array indexing. I wondering what happened to that idea.

LLVM has a number of important loop optimizations that as far as I know haven’t been harmed at all by these changes, so that talk title does seem a little inflammatory.

If the optimizations you’re pursuing are trying to using type information to prove non-interference of array accesses in adjacent/nested loops, I’d be very interested in understanding the ways that LLVM’s aliasing metadata (for either TBAA or restrict) fall short of your needs. And those sorts of optimization do need metadata along those lines, because they are not valid in non-strict-aliasing C.

2 Likes

The key here is dependence analysis not alias analysis, e.g. the iteration distance between A[i][j] and A[k][m] in the same loop nest, and if one exists in the first place.

That’s interesting, because I’m surprised that GEP type data is especially helpful there. It’s not like striding over nested arrays of different sizes means that those arrays don’t overlap. It seems to me that in either representation you’re reasoning about multiplications; what am I missing?

This overly generalizes (with no foundations) a new trend in LLVM IR.

In the early days of LLVM, @clattner had LLVM IR typed enough for pointer analysis (also one of the subjects of his thesis) and he designed the LLVM IR with optimizations in mind. Please read through Macroscopic Data Structure Analysis and Optimization and in particular pages 16-17 (page 29 in the pdf version:)

2.2.2 Language-Independent Type Information, Cast, and GetElementPtr

Our results show that despite allowing values to be arbitrarily cast to other types, reliable type information is available for a large fraction of memory accesses in C programs compiled to LLVM.

A critical difficulty in preserving type information for low-level code is implementing address arithmetic. The getelementptr instruction is used by the LLVM system to perform pointer arithmetic in a way that both preserves type information and has machine-independent semantics.

It is only recently under the pressure of the “compilation time” drums that LLVM IR relaxed about type info to the point that it becomes hard to implement loop optimizations and pointer analysis that the original LLVM IR allowed.

Chris wrote quite a few things in his early papers that did not prove out in a production compiler. The widespread understanding that the LLVM type system could not be used semantically the way you’re describing is much, much older than the recent push to simplify the IR. I remember having conversations about it in 2009.

If you have an analysis that is looking at GEP types to prove that two memory accesses do not conflict, you are relying on type-based aliasing rules. LLVM must be able to support compiling languages that make different or no type-based aliasing assumptions, including C under -fno-strict-aliasing. I suppose we could have said that those modes had to use i8 GEPs, but that is not what we did.

5 Likes

Speaking as a front-end author, the mapping of this information into LLVM types prior to opaque pointers, gep removal, de-type-ification, etc was often a horrible mess in our front end. Lots of backend passes would assume the information was somehow semantically meaningful, so we had to do a lot of work to pretend it was consistent, but in the front end, none of it was possible to be correctly expressed everywhere. It was a huge improvement in correctness and optimization reliability (particularly of loops) when all of the hard work was done to get rid of that baggage. As a frontend author, I do hope that the same can eventually happen for other operations too such as alloca (e.g. so that iN becomes just an array of N bits), so that optimizations get handled more uniformly for those also, but it doesn’t seem urgent for anything.

1 Like

Supposing that we always allocate and declare globals in number of bytes, and then supposing we have a way to attach the size and type of elements as a side channel. That is, instead of the current global declaration: @array_100x10 = global [100 x [10 x i32]] we would have @array_4000 = global [4000 Bytes]and on a side channel the front-end would also transmit the array type info assume(true) array_info(array_4000, 2 /*dimensions*/, 100, 10). How is the second representation better than the current way of declaring and attaching the info to the global/alloca itself? To me both forms are the same. And if they are the same, why changing the LLVM IR from where it is today?

DependenceAnalysis uses the type information to guide a heuristic to “decompose” the offset expression. For example, given two expressions 35*i + 2*j and 35*i + 2*j + 1, it is easier for DependenceAnalysis to recognize them as accesses to a 2D array, A[i][2*j] and A[i][2*j + 1] with its dimensions [][35], rather than a 1D array. In the former representation, it is trivial to see that these accesses never overlap, since 2*j is always even and 2*j + 1 is always odd. Hence DependenceAnalysis is prefers the address calculation to be expressed in a form like getelementptr [35 x i8], ptr %A, i64 %i, i64 %j_x_2 rather than a flattened one, and to justify the correctness of such a decomposition, we need to ensure that %j_x_2 is less than 35 (and, greater than or equal to 0). That’s why GEP type information is useful in some cases.

That said, I think such heuristics are likely to cause a lot of miscompilations, so personally I believe it would be better not to do it.

1 Like

To clarify, the type would only be used to decide how to break up the expression, but you would prove separately that 0 <= %j_x_2 < 35?

The type is used only to prove that 0 <= %j_x_2 < 35. The decomposition itself merely retrieves the list of GEP offsets.

I can’t speak for the entire history of LLVM, as I only started working
on it around LLVM 3.1 or so.

As I understand it, LLVM was originally intended to be a pretty direct
implementation of the C abstract machine, with the various concepts in C
mapping pretty 1-1 with LLVM features. It was never complete–bitfields
and unions never had first-class support in LLVM so far as I am aware,
but it was an early design goal to enable the kind of
objective-sensitive pointer analyses that are the tour de force in
languages like Java.

That vision did not hold up to reality. Whatever the C and C++ standard
may say with regards to strict aliasing, people in practice expect to
be able to play fast-and-loose with the type rules. (This is to say
nothing of the kind of record layout you see from something like Rust
enums). Worse still, the ABIs you have to support sometimes means that
even for well-behaved C code, you have to create bag-of-bytes
representations just to ensure that you have the right ABI. And this
means all of your analyses and all of your optimizations have to support
effectively type-erased information in the first place.

Around 2012 or 2013, I was working in a research group that was relying
on a pointer analysis that relied on the structural type information of
LLVM, bailing out if anything moderately spicy was happening with type
punning. And we sat down one day and measured how much benefit this
fancy analysis was giving us–absolutely nothing. In the cases where it
worked, more basic alias analyses were also working, but Clang was just
chucking too much of the type information away before it touched the
optimization passes, even before SCEV or somebody else gave up trying to
figure out what the type to reconstruct was. (And @rjmccall puts his
recollection of discovering the futility of this information even
several years before that point.)

The biggest flaw of loss of type information has been for arrays. LLVM
has always had a bias towards C in its structure, and consequently its
representation of multidimensional arrays (a feature not really found in
C/C++) has been less than ideal. In a purely-typeless representation of
pointer information, you could probably recover parts of the array index
access via a fairly simple analysis, but given the history of
oh-just-pattern-match-this-sequence for representing fundamental
operations breaking down because optimizations pull those patterns apart
(e.g., rotates), it is probably worth giving array access its own
special intrinsic, since it’s so common and important.

4 Likes

It sounds like we could really just use a utility function that analyzes a ptradd / gep i8 offset value like i*M*N*S + j*N*S + k*S and tries to decompose it to a canonical form (((i * M) + j) * N + k) * S. If that succeeds, the dependency analysis should just have to do a range analysis on N and k.

Yes, that’s called delinearization: llvm-project/llvm/include/llvm/Analysis/Delinearization.h at main · llvm/llvm-project · GitHub

2 Likes

The type is used only to prove that 0 <= %j_x_2 < 35. The decomposition itself merely retrieves the list of GEP offsets.

But does the type provide that guarantee? Couldn’t %j_x_2 be 35, in which case it would just access the next row?

To make things a bit more concrete, IIUC the type would be used to assume that %c is true below?

https://alive2.llvm.org/ce/z/zvy7N2

define i1 @src(i16 %i, i16 %j) {
  %j.pos = icmp sge i16 %j, 0
  call void @llvm.assume(i1 %j.pos)
  %a = alloca [35 x [ 35 x i8 ]]
  %gep = getelementptr inbounds [35 x [ 35 x i8 ]], ptr %a, i64 0, i16 %i, i16 %j
  store i8 0, ptr %gep
  %c = icmp ult i16 %j, 35
  ret i1 %c
}

define i1 @tgt(i16 %i, i16 %j) {
  ret i1 true
}

My understanding is the same, the type shouldn’t guarantee 0 <= %j_x_2 < 35. What I meant is that, there is code that explicitly checks the condition. The type information (in this case, 35) is taken from the GEP to verify that it is safe to treat %gep as an access to a multidimensional array such as %a[%i][%j_x_2].