Can somebody explain for what good the LLVM IR gets lower and lower every day (opaque pointers, gep removal, de-type-ification) to the point that type information from the front-ends is intentionally lost in the translation? (oh, wait, is it for faster compilation times? yeah, I know, it’s that hard question of how fast can we compile vs. how good the code produced by the compiler…)
@nikic is speaking about a future of LLVM IR that looks to me more and more like a compiler back-end IR (similar to GCC’s RTL representation only in SSA form.)
Why? Does everybody agree that this is the right direction for LLVM IR ?
—
Am I the only one still under the impression that the LLVM IR is intended to be a middle-end kind of IR, a bit like Simple IR, or GCC’s adaptation Gimple IR, where scalar, memory, and loop optimizations could independently be implemented of programming languages idiosyncrasies, shielded behind lowering passes. Today (and from day one) LLVM IR does not (and did not) preserve enough info from the front-ends to be able to disambiguate at compile time properties that GCC would be able to just read from Gimple IR, like array dimensions, array access functions, or infer properties based on type info, such as estimates of loop trip counts. What are the alternatives to enable LLVM’s optimizations that are currently lacking type info? Are side-channels (assumes, annotations, attributes) the right way to send info from the front-ends to the optimizers working on LLVM IR ? Or are we supposed to use a higher level MLIR to implement middle-end compiler optimizations that lack type information in LLVM IR?
Here’s a practical problem: In the LLVM de-linear-ization analysis pass, we try to recognize high-level array type information that was intentionally lost during the linearization of the memory access functions. We need the front-end view of the multi-dimensions of array types to be able to reason about memory dependences in a faster way than solving the generic problem that leads to multi-variate Diophantine equations (harder to solve than just pattern matching ZIV, SIV, or MIV zero/single/multi-induction variable dependence tests.) On the news of GEP removal from LLVM IR, I started adapting the delinearization pass not to rely on the GEP info, and for that I can either rely on today’s type info from LLVM’s IR for alloca’s and the global variable declarations, or otherwise attach the type info from the front-ends to a side channel such as an assume. Question: are we going to see in the future the type info disappear from alloca and global variables as well, or should we use side-channels to send the array info from the front-ends to the middle-end LLVM IR ?
–
(Edit: adding a bit more context)
I speak three languages, each rooted in personal and cultural heritage. Romanian—limba lui Eminescu—is my father’s tongue and the language of my childhood. French—la langue de Molière—is my mother’s voice, which I embraced in my teens. English—the language of Shakespeare—became my third, learned in my twenties.
I grew up poor and hungry under the food and freedom rationing imposed by the Romanian Communist Party. Family members were sent to forced labor camps for speaking out against the tyrannical regime. I became shy and withdrawn, having been instructed at the age of four never to repeat what I heard at home. Our only hope came from the soft-spoken voice of Radio Europa Libera. In December 1989, my parents were out in the streets, while we children stayed with grandma, horrified by the sound of gunfire echoing outside.
After the Romanian Revolution, my parents sought refuge in France, where I grew up and discovered compilers and loop optimizations. At the University of Strasbourg, I began implementing a polyhedral temporal and spatial locality transform based on work and help from Frederic Wagner, Benoit Meister @bmeister, Vincent Loechner, Philippe Clauss, and Catherine Mongenet. In July 2001, we decided to target the polyhedral optimizations to GCC. In 2001, RTL was the only IR available, and I started building around the few loop passes that existed—mostly just an unroller.
Later that year, I came across Laurie Hendren’s work on the Simple representation and adapted it to GCC. I sent the patch to Diego Novillo, who maintained the “ast-optimizer” branch. That effort grew under RedHat’s leadership: Diego Novillo (SSA), Richard Henderson (C/C++), and Jason Merrill (C++). In 2002, I designed an analysis phase to extract array subscripts from Gimple’s 3-address SSA form, and I coined the term “Scalar Evolution” and abbreviation SCEV (to make it clear: under the law of least-effort, natural languages decay over time: P becomes B becomes V becomes F becomes 0(empty): for example, from vulgar Latin “cabalus” becomes over the evolution of languages in thousands of years “cheval” in French and becomes “cal” in Romanian. By design of the abbreviation, SCEV is an evolution from SEB.)
In 2003, Chris Lattner @clattner asked for a write-up on SCEV in GCC. I sent him a draft, and within a week, I reviewed the first SCEV bits in LLVM. Between 2003 and 2004, I wrote GCC’s dependence analysis (DA), later integrated with IBM’s loop vectorizer in GCC. As an IBM intern in 2004, with Daniel Berlin, I implemented the DA-MIV Banerjee test and loop interchange for the SPEC swim benchmark. In 2005, I adapted the DA-Omega test; in 2006–07, I added loop distribution and parallelization to GCC.
In 2008, I started Graphite for polyhedral loop transforms in GCC. Tobias Grosser helped shape SCoP detection during his Google Summer of Code and AMD internship. In 2010, Tobias launched LLVM’s Polly, which I joined in 2011. In 2012, Preston Briggs integrated DA in LLVM, and convinced me that delinearization was key, pointing to Maslov’s earlier work. In 2013, I authored LLVM’s delinearization pass, used today by both Polly and Preston’s DA.
In 2025 I enabled -floop-interchange by default in -O2 in Flang after having consulted with all Flang maintainers. The patch was in the main branch of LLVM for 3 months until a revert happened before LLVM’s Release 21.