I’m preparing a new compiler course which will start in April. The course uses LLVM as the reference compiler and RISC-V as the backend. The goal of the course is to bring up new compiler developers for LLVM and other toolchains. I propose to introduce ten or eleven algorithms that are commonly used in modern compiler systems, and walk-through the source code of these algorithms in LLVM codebase. Although I have several textbooks in my hand, I am not sure which algorithms are expected to be familiar for a new compiler developer. Currently tablegen, GlobalISel, GVN, DCE, and Inlining are in the plan, and there are still a few algorithms to be filled in.
I appreciate it if you can provide some suggestions.
GVN is super interesting, yes. Well worth the time.
I would add an SSA-construction algorithm (SROA in LLVM, but you can go with something simpler). Understanding SSA construction requires other concepts, such as dominators, and is an eye-opener for some of the tradeoffs being done in IR design (caching reaching-defs information).
I would also try a simple static analysis, like range analysis, so students understand the concepts of abstract interpretation, abstraction, widening, etc. You can introduce SSI here if you have time.
Alias analysis is important. But don’t show LLVM’s one I think gcc’s is probably a better example. This combined with some simple optimizations that it enables, like store forwarding.
Linear-scan register allocation is simple and I think it’s important to see the last mile. As well as a simple instructions election algorithm. The Burg family ones are not complicated.
In the end, more than learning many algorithms, I think the point is that they should realize how large programs are and how quick the compiler must be to handle those. So it’s all about doing the right tradeoffs of compilation speed vs benefit in performance/code-size/etc. Spend some time looking at the output of the compiler, both IR & assembly, and discuss in class what could be improved. It’s fascinating.
If your students learn all of this, please send them in for internships
Teaching dominators (and relatedly, the terminology related to natural loops) is critical for compiler developers, but I don’t think the construction algorithms themselves are particularly worth going through, except maybe as part of the compile-time tradeoff Nuno mentions.
So far, I haven’t seen any mention of any loop algorithms, but I’m not sure who is best to include here.
As concepts, covering scalar evolution, dependence analysis, and MemorySSA could be useful, giving a students a sense of what the tools in the toolbox they have are. Further, covering what InstCombine does at a high level (rather than walking through all of its bajillion patterns) can be important.
Echoing Nuno, I worry that a focus on the algorithms may be too narrow for the goal of bringing up new developers. You’ll especially want to cover the compile time versus runtime performance tradeoff, and how passes are designed to try to alleviate compile time performance (e.g., a very strong preference on maintaining dominator tree/loop info because those are so commonly used that recalculating them constantly would measurably increase compile time). Another topic that would be useful is cost models: LLVM IR is not a perfect analogue for machine hardware, and some transformations that seem beneficial in IR land may be harmful in actual machine code. You could try collecting some examples where an optimization helps on one architecture but hurts on another.
Dominator Tree and SSA construction would be included. For the loop optimizations, I plan to have a quick overview, and not dive into the source code too much. The coarse-grained trade-offs are kinda obvious (e.g. linear scan vs graph coloring), while the fine-grained design trade-offs are more subtle, which I plan to demo one or two design heuristics in the course.