Hi all,

Many of us find ourselves spending a great chunk of time comparing LLVM IR dumps at various stages of compilation pipeline or after a given optimization pass. Said process can be extremely laborious, and this is especially true when comparing shaders or compute modules. Important semantic differences are often difficult to spot because of the irregular naming and ordering of instructions.

Looking for a solution we have come across Puyan Lotfi's talk at 2018 EuroLLVM on mir-canon (https://www.youtube.com/watch?v=RHT-bh_xo6U). His project inspired us to invest some time into developing a tool called llvm-canon aiming to achieve a very similar goal - a clean and obvious diff but for LLVM IR dumps.

Currently the tool is used internally. We have decided to post here to receive some feedback from the community, spark a discussion on various canonicalization techniques, and ask for a general code review to eventually push the tool to the public trunk.

The aim of this project is to develop a tool which will transform the LLVM IR code into a canonical form by reducing non-semantic differences. Therefore, in ideal conditions such canonical form will lead to a clean diff based upon the fact that:
1) Two semantically identical modules should show no differences when diffed after canonicalization.
2) When modules are not identical the semantic differences should stand out.

The tool processes just one file/module at a time - there is no intention for this to be a diff tool. The canonicalization techniques are universal for all modules. Additionally, we want the tool to be as modular as possible allowing for easy changes in the future.

All of the canonicalization techniques are focused only on reordering and naming. Having experimented with many options for removing redundant code we came to conclusion that this topic is definitely out of scope of this project and should be left out for other passes.

Scope of this project:
* Reordering instructions
* Naming values (naming initial instructions, naming regular instructions)
* Folding instruction names
* Naming function arguments
* Naming basic blocks

One of the biggest challenges in the development of this project was finding a point of reference for canonicalization - naming and ordering must be based on something. It would be unacceptable for a trivial change (such as different opcode, single operand, name of the called function) in a single instruction to affect the canonicalization of other instructions in the same basic block or even function. Therefore, the point of reference should be fairly "constant". Currently, the canonicalization is highly based on the order of side-effecting instructions (also referred to as outputs) and how they relate to the very first instructions in each basic block (non-redundant instructions with only immediate operands; we call them initial instructions). We have found this reference to be relatively effective in our use cases, yet this is the biggest weakness of this canonicalizer. Modules with different layout of side-effecting instructions may require some tinkering by reordering those instructions by hand. We are still looking for some ways to automate this process.

Below I will briefly describe each component of the canonicalizer.

1. Reordering instructions
Instruction reordering starts by iterating through the vector of output instructions (side-effecting instructions + ReturnInst) collected top-down in a given function. On each of these instructions a recursive reordering method is called. This method walks up the def-use tree (starting from the first operand of an output instruction and going up) and reduces the distance between the definition and the user. Each instruction may be moved just once. In case the user is in a different basic block, the definition is placed at the end of the current basic block.

The method works surprisingly well yet it does not reorder side-effecting instructions. It is one of the subjects for future discussion and improvement of this tool. Another thing to note is that this algorithm only works on a vector of outputs collected top-down. Moving every instruction just once while making sure that it is as close as possible to the first use guarantees that the def-use chain will be valid upon completion.

2. Naming instructions
Naming also starts with a vector of output instructions collected top-down. On each of these instructions a recursive naming method is called. This method identifies the type of an instruction and names it accordingly.

Currently, we distinguish two types of instructions when it comes to naming:
a) Initial instruction (non-redundant instructions with only immediate operands)
b) Regular instruction (all other non-redundant instructions)

In case of initial instructions, the naming follows this general model:
vl <hash> <callee name> ( <operands> )

A sample name is given below:
vl12345Foo (2, 4)

The hash is calculated considering instruction's opcode and output footprint. Output footprint is a vector of unique indices (distance from the beginning of the basic block) of output instructions generated for each initial instruction. This being said, an initial instruction with an output footprint 5, 7, 9, 11 is used by those outputs at least once.

The callee name is only included when the named instruction is a CallInst.

The very last part of the initial instruction name consists of a list of operands in parentheses. The list is sorted when the instruction is commutative.

In case of regular instructions, the naming follows this general model:
op <hash> <callee name> ( <operands call chain> )

With a sample name given below:
op11111Foo (op99999Bar(op55555(...), op22222(...)), op77777(...), vl44444(7))

The hash is calculated considering instruction's opcode and opcodes of all operands.

The callee name is only included when the named instruction is a CallInst.

The last part of the regular instruction name model is a recursively generated list of all operands, and their operands, and so on until we reach initial instructions (depth-first). This may remind more of a call chain than operand list. This naming approach makes differences stand out on all instructions that are affected. This is a desired effect on output instructions as we can instantly see that the semantics have changed. In cases when an output instruction is void, this type of naming is useful on pre-output instructions (direct operands of an output instruction). Besides that this technique pollutes the diff as a single change resonates down the block. This is why names are "folded" in the next phase of canonicalization.

3. Folding instruction names
As I have mentioned before, recursively generated long operand lists are quite useful when comparing output instructions or their proximate operands in case of void output. In all other cases a simple operand list should result in a much more readable diff. This is why after naming all the instructions, we iterate through them one more time to "fold"/shorten their names. In this phase the operands call chain is substituted with an operand list, and the callee name is removed.

Folding initial instructions:
vl00000Calle(1, 2) -> vl00000(1, 2)

Folding regular instructions:
op00000Calle(op00001Calle(...), vl00000Calle(1, 2), ...) -> op00000(op00001, vl00000, ...)

Therefore naming and folding give us this kind of effect:
Folded (no changes): vl00001(1, 2) = ...
Folded (removed callee name): vl00002(6, 5) = call ...
Folded (kept only hash): op10001(vl00001, vl00002) = ...
Unfolded (output): op10002(op10001(vl00001(1, 2), vl00002Foo(6, 5)), vl00001(1, 2)) = ...

4. Other
We have been looking for various ways to canonicalize the names of function arguments. But because of their marginal impact on the overall diff we have decided to stick to just numbering them.

We are also naming basic blocks following a scheme: bb <hash> where the hash value is calculated considering the opcodes and order of output instructions in a given basic block.

5. Areas of further research
- As mentioned before we are constantly looking for some different points of reference for canonicalization.
- Approaches allowing for automatic reordering of side-effecting instructions.
- How sorting operand list on commutative instructions could affect instruction reordering.

The canonicalizer is highly modular allowing for easy modifications and future expansion.

We hope the community will find this tool useful like we have.

I would like to thank Radoslaw Drabinski who really helped me a lot by suggesting numerous improvements and testing the tool.

I have uploaded the tool to the Phabricator and would be very grateful for your review!

Thank you,
Michal Paszkowski

Hi, Michal,

This sounds like a very nice utility, and it will be useful to have it as part of LLVM.

I’ve made some suggestions on phabricator, but I’ll repeat the high-level one here: I’d like to see this as a utility transformation (e.g., under lib/Transforms/Utils/) instead of separate utility. This way we don’t need a separate utility (it can be run with opt just like other transformations) and it can be naturally added to the end of pass pipelines.


Just out of curiosity, how does this compare to using llvm-diff in your
workflow? I typically find that does a really good job of eliminating
trivial renames and such.

To be clear, no objection to the tool being checked in. Having multiple
ways of tackling this annoyingly routine problem is fine.


Nice work guys! Thanks for the shoutout.

High level comments:

  1. I think it could be useful to move your Canonicalizer ModulePass from llvm/tool/llvm-canon to somewhere in llvm/lib/Transforms so that it could possibly be used by llc or clang. I think it could be really useful to do something like clang -o - -S -emit-llvm -femit-canonical-ir foo.c and get the canonicalized output directly. Keep in mind, anything that is accessible through clang or llc can also be really easily accessed through the webapp godbolt.org :-). I still think llvm-canon.cpp and llvm-canon executable tool is fine, but I think it would be cool if the ModulePass part were reusable in other parts of llvm.
  2. clang-format
  3. Add lit tests.

I am all for this being added to llvm. I’ll post additional comments in Phabricator.


Hi Michal,

Thank you for sending such a detailed write up. I have some related thoughts that I wanted to share in case they are useful to you.

I’m working on a project that has a similar need for canonicalizing program representations. In our case one of the problems we were trying to address is that functionally equivalent programs could be represented differently based on ordering of various internal data structures, like the order of the uses list. While the order is deterministic for a given input source for a specific version of our compiler, there are trivial changes to the inputs and simple changes in our transformation algorithms that could cause the same functional program to have a different ordering, which then would impact our debug print layout.

In our case, we start with the assumption that the relative ordering of some inputs and outputs can’t change without changing the meaning of the program (like function parameters). We also assume that the topological ordering of our canonicalized program must be a valid topological ordering of the source program. Meaning if instruction B depends on the output of instruction A, B must be after A. Given these constraints, the problem we tried to solve is how do we generate a consistent ordering for all variations of a program that are functionally equivalent. This problem is basically a graph isomorphism problem, but with specially constrained graphs.

The solution we came up with to address this problem takes advantage of a property of SSA-style program representations, that might be applicable to your tool. By definition SSA IR values are assigned once, but may be used many times. When doing a top-down traversal you need to decide which use to traverse to from a given value. That decision might be dictated by the order of the uses list, or it could be heuristic driven, but in either case we found it to be fraught with problems. In our solution we traverse bottom up. We start at the immutable outputs, which have predetermined and unalterable ordering, then we traverse up from the use to the value. Since each value is only assigned once, we don’t have any tiebreaker input ordering issues.

I’ve thought about, but haven’t had time, to apply this technique to LLVM-IR, but in our program representation it works really well.

Hope this is useful,

Hi Chris,

Thank you for your thoughts! Your approach is very similar to the one we have implemented. The canonicalization is also based on the ‘output’ instructions and we also walk up the def-use tree. Those immutable outputs are also the only instructions we don’t move.

Thank you,

Michal Paszkowski

We needed a tool which will help us spot any differences which have impact on a particular side-effecting instruction. This is why we were looking for some way to reorder and name instructions canonically. I find it especially useful that minor changes somewhere at the beginning of the def-use tree impact canonical names of output instructions (using this changed instruction) while this does not 'pollute' the rest of the diff.

Thank you,
Michal Paszkowski