Speeding up compilation of MLIR-based projects

(note: this is about speeding up compilation of the compiler itself, not the compilation speed of user code that is compiled by the compiler – e.g. we are talking about time spent parsing MyDialect.ops.h.inc)

I recently did a refactoring in Torch-MLIR splitting up a large ~4500 line file that was taking ~20 seconds to compile into ~10 files (torch-mlir pull request). Unfortunately, this only sped up the wall clock time to 13 seconds (and increased the CPU time to about 90 seconds).

I did a bit of profiling (with the amazing clang -ftime-trace option), and it seems like almost all of the time is spent in the dialect headers parsing the ODS-generated ops, or instantiating templates that are not “my code”. In Linear.cpp (which is a typical file with some lowering patterns), I can only find about 200ms out of ~12s (or <2% of the time) spent compiling “my code”, and the pass itself (which just calls a few external functions to populate patterns), TorchToLinalg.cpp, is similar. (to be fair, the number is a bit bigger if you include things like "instantiating SmallVector<Value>", but it doesn’t appear to change the fundamental picture).

After a cursory investigation, a few observations:

  • It seems like the ODS-generated code is spending ~70% of its time instantiating the template metaprogramming of the mlir::Op<MyOp, ...long list of traits...> constructs. The rest is spent on the op body itself.
  • OpDefinition.h by itself is 500ms (transitively) to compile.
  • BuiltinTypes.h by itself is 750ms (transitively) to compile

Here are two traces which I was able to drag and drop into the chrome://tracing in my browser:

2 Likes

Are the times spent just linear in the number of ops in the op dialect headers you’ve included? Splitting the culprit dialect (with too many ops) will allow you to have smaller (only the needed) includes in each of your files.

I think it is linear in the number of ops. In this case though, splitting the dialect (torch) isn’t very desirable as it is basically one huge thing that maps to the torch op registry – there’s not enough intrinsic internal structure there to split along to really move the needle on this. The other dialect that seems to take up a lot of compilation time is linalg, but we need the named ops from linalg here, and I’m not sure that we want to attempt to make a factorized dialect structure from what is fundamentally a flat list of named ops.

Thanks Sean for the analysis. In my mind, I had assumed that the compilation of the corresponding cc files was the main bottleneck – and the traditional wisdom of splitting the impl in some way is an approach to that. But since this is mostly with respect to the templates in the headers, I think we just need to be better.

Scaling to having “random access” to a ~small thousands number of named ops is not uncommon for this space.

Since this is all generated code, it seems like we should be able to trade more explicitness in the code generator for less metaprogramming, but I don’t have concrete suggestions for how to do so.

1 Like

Is this as these ops have a lot of traits? E.g., is this due to template instantiation costs of C++ and these just happen to have a lot of traits? Are these traits injected or added by user?

Almost all the traits are added by the ODS generator automatically. A typical example would be:

mlir::Op<mlir::torch::Torch::Aten__Is__Op, 
::mlir::OpTrait::ZeroRegion,
::mlir::OpTrait::OneResult,
::mlir::OpTrait::OneTypedResult< ::mlir::torch::Torch::BoolType>::Impl,
::mlir::OpTrait::ZeroSuccessor,
::mlir::OpTrait::NOperands<2>::Impl,
::mlir::torch::Torch::OpTrait::AllowsTypeRefinement,
::mlir::torch::Torch::OpTrait::HasValueSemantics,
::mlir::torch::Torch::OpTrait::ReadOnly,
::mlir::InferTypeOpInterface::Trait>"

Only the 3 in the torch namespace are added manually.

Could you see how time scales by number? (I think 4 of these could be dropped without changing the signature/breaking compilation [could result in test failures but can still see compilation results], so could be used to get some some data points).

With what config as you building these?

This is a release + asserts build. I don’t know how to easily test how the time scales by the number of traits, but it looks like it is mostly taken up by

mlir::detail::FilterTypes<op_definition_impl::detect_has_any_fold_trait
and
mlir::detail::FilterTypes<op_definition_impl::detect_has_verify_trait

That kind of makes sense looking at the profile. Can you post a repro setup? The filtering
is integral for code size, but the way its done right now was the easiest thing to write. I can take
a look at doing something more intelligent there, but would need your setup to test.

To give a more concrete explanation of why it’s bad, it is effectively generating a permutation of std::tuple specialization instantiations for each op trait list. If we remove the std::tuple from the metaprogramming, the effect would likely be significant.

– River

Easiest is just having ODS add duplicates (“print out” 2-5 times rather than once when generating C++) that should be a NOP. But given River and you have an idea here, this is just OOC then.

Silly question, but wouldn’t it be trivial for the code generator to know whether either of those hooks was involved and just emit directly?

(Not a strong opinion - just generally consider less metaprogramming to be better)

I think primary reason I could see would be that we never required ODS and allow pure C++ too. That being said, we could recommend ODS + make it faster & simpler while allowing C++. Some work needed ODS side too as we allow native traits and don’t have them tagged.

The question isn’t limited to specific hooks in general, given that the filtering spans to other things as well. The cpp generator is generally “trivial” to do this pre-filtering… when we have all of the information there, which today we don’t.

There are various things here that I want to look into, some of which on the C++ side and some which on the ODS side.

– River

Sounds good - consider my feedback a perfunctory “do you really need all of that metaprogramming” level comment.

The answer is always yes.

1 Like

Given these are generally added by ODS, one way to reduce the # traits is to have an ODSTrait<a,b,c,d,e> which handles the common things ODS wants to enforce: # regions, # operands, #results, #successors, etc. That would collapse 5 or 6 of these things away into one class.

The problem is that it would be less orthogonal and beautiful perhaps, but if it leads to a significant build time speedup, then shrug.

-Chris

Looking at the profile, it seems like just building any .cpp file including the arith dialect ops would be sufficient test case for looking at this. I am doing a release+asserts build if it matters.

If you want to repro this in Torch-MLIR, the Torch-MLIR build instructions are available here: GitHub - llvm/torch-mlir: The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
You can repro by building e.g. lib/Conversion/TorchToLinalg/TorchToLinalg.cpp.

OOC I looked at this by injecting NOP attributes:

diff --git a/mlir/include/mlir/IR/OpDefinition.h b/mlir/include/mlir/IR/OpDefinition.h
--- a/mlir/include/mlir/IR/OpDefinition.h
+++ b/mlir/include/mlir/IR/OpDefinition.h
@@ -438,6 +438,15 @@ public:
   };
 };
 
+template <unsigned N> class Fluff {
+public:
+  template <typename ConcreteType>
+  class Impl : public TraitBase<ConcreteType, Fluff<N>::Impl> {
+  public:
+    static LogicalResult verifyTrait(Operation *op) { return failure(); }
+  };
+};
+
 /// This class provides the API for ops that are known to have a at least a
 /// specified number of operands.  This is used as a trait like this:
 ///
diff --git a/mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp b/mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp
--- a/mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp
+++ b/mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp
@@ -2521,6 +2521,11 @@ void OpEmitter::genTraits() {
   int numOperands = op.getNumOperands();
   int numVariadicOperands = op.getNumVariableLengthOperands();
 
+  for (int i = 0; i < NUM_FLUFF_EXTRA; ++i)
+    opClass.addTrait("::mlir::OpTrait::Fluff<" + Twine(i) + ">::Impl");
+
   // Add operand size trait.
   if (numVariadicOperands != 0) {
     if (numOperands == numVariadicOperands)

And got

extra attributes per op total compile time
0 10.44
50 10.92
100 13.32
200 20.58
300 30.41

Which is fairly linear in terms of traits.

If it is spending so much of its time “instantiating the traits”, then surely the answer is to look into how to optimize mlir to do this faster.
Rather than change the user source code just to try and work round the problem.
I have not looked at the compiler source code, but taking that long to compile 4500 lines of code seems excessive to me.
Maybe it is caused by something simple, where a small change to the compiler’s source code will speed this up a lot.
Have you tried profiling the compiler? Which method/function is it taking all its time doing? Maybe lots of malloc() or something like that?

I hope you realize that “the compiler” here isn’t MLIR but clang++? Unfortunately C++ allows you to write complicated enough templates to explode compile time. The answer is in general to look into the C++ code patterns…