Strong +1 on my end, seamless integration into a language/runtime/library ecosystem that provides easy to use libraries is a must for fast iteration, reproducibility and just showing the usefulness of MLIR in general.
We have been operating under the assumption that this type of infra is a bridge too far for the MLIR / LLVM repo so we have been building our own out of tree for a few months. This is also the place where the sparse compiler benchmarks started before forking off to their separate place that also brings in TACO.
Recently, the winds have shifted, motivated in large part by realizing that this is possible, necessary for reuse and overdue. There are a few orthogonal aspects here that I’ll happily unpack once there is reasonable confidence that this is welcome in-tree.
The MVP for me that would show this is possible is having a base reusable harness that we can all contribute to and per-project/dialect entry points.
Our harness is built around the concept of a “ProblemDefinition” that makes it easy to create a new benchmark that plays nicely with it. It specifies:
- how to create and initialize numpy and MLIR tensors (incl. alignment),
- the compute/bandwidth characteristics of the benchmark
- a check function that compares the MLIR JIT’ed code with a known correct impl. (e.g. in numpy)
- an entry point to metaprogram MLIR from python under a given context manager.
The ProblemDefinition itself is not ideal: it requires too much human intervention and should be automated; the current state is as far as we’ve been willing to push it so far for our use cases.
IMO, a “test for success” of a basic setup in-tree would be that we can just delete the existing code and reuse what would be in-tree.
The flow I describe above is tailored atm for a “custom op” programming model and has a minimal runtime/compiler contract to specify what “bufferization” should do as well as various compiler strategies and basic “transformations wrapping infra” that are not appropriate for other projects. So if you go with a model similar to what I described above, let’s be sure to iterate closely so we can surface what is generally reusable and portable vs what is too specific.
Below is a snapshot of what you can expect from the basic stuff that exists today:
> python -m python.examples.matmul.bench
###############################################################
Compile-time problem size {'m': 2040, 'n': 2041, 'k': 2042}
Runtime problem size {'m': 2040, 'n': 2041, 'k': 2042}
Problem types [<class 'numpy.float32'>, <class 'numpy.float32'>, <class 'numpy.float32'>]
Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e5e0>
compilation in 0.2015s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
slowest p1 p10 p25 p50 p75 p90 p99 fastest unit
------------------------------------------------------------------------------------------------------------------------
2.3e-01 2.3e-01 2.2e-01 2.2e-01 2.2e-01 2.2e-01 2.1e-01 2.1e-01 2.1e-01 seconds
73.58 73.58 76.24 76.49 78.15 78.76 82.07 82.07 82.07 GFlops/s
0.29 0.29 0.30 0.30 0.31 0.31 0.32 0.32 0.32 GBs/s
Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e580>
compilation in 0.322s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
slowest p1 p10 p25 p50 p75 p90 p99 fastest unit
------------------------------------------------------------------------------------------------------------------------
1.6e-01 1.6e-01 1.6e-01 1.6e-01 1.6e-01 1.6e-01 1.5e-01 1.5e-01 1.5e-01 seconds
103.70 103.70 103.74 103.78 105.14 106.28 112.44 112.44 112.44 GFlops/s
0.41 0.41 0.41 0.41 0.41 0.42 0.44 0.44 0.44 GBs/s
###############################################################
Compile-time problem size {'m': -1, 'n': 2041, 'k': -1}
Runtime problem size {'m': 2040, 'n': 2041, 'k': 2042}
Problem types [<class 'numpy.float32'>, <class 'numpy.float32'>, <class 'numpy.float32'>]
Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e5e0>
compilation in 0.2068s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
slowest p1 p10 p25 p50 p75 p90 p99 fastest unit
------------------------------------------------------------------------------------------------------------------------
2.1e-01 2.1e-01 2.0e-01 2.0e-01 1.9e-01 1.9e-01 1.9e-01 1.9e-01 1.9e-01 seconds
82.80 82.80 86.25 86.40 88.16 88.68 89.35 89.35 89.35 GFlops/s
0.32 0.32 0.34 0.34 0.35 0.35 0.35 0.35 0.35 GBs/s
Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e580>
compilation in 0.3505s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
slowest p1 p10 p25 p50 p75 p90 p99 fastest unit
------------------------------------------------------------------------------------------------------------------------
1.7e-01 1.7e-01 1.7e-01 1.7e-01 1.6e-01 1.5e-01 1.5e-01 1.5e-01 1.5e-01 seconds
101.42 101.42 101.94 102.05 108.79 111.31 113.19 113.19 113.19 GFlops/s
0.40 0.40 0.40 0.40 0.43 0.44 0.44 0.44 0.44 GBs/s
###############################################################
Compile-time problem size {'m': -1, 'n': -1, 'k': -1}
Runtime problem size {'m': 2040, 'n': 2041, 'k': 2042}
Problem types [<class 'numpy.float32'>, <class 'numpy.float32'>, <class 'numpy.float32'>]
Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e5e0>
compilation in 0.2182s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
slowest p1 p10 p25 p50 p75 p90 p99 fastest unit
------------------------------------------------------------------------------------------------------------------------
2.1e-01 2.1e-01 2.0e-01 2.0e-01 2.0e-01 2.0e-01 1.9e-01 1.9e-01 1.9e-01 seconds
80.91 80.91 83.69 83.76 85.70 87.13 90.24 90.24 90.24 GFlops/s
0.32 0.32 0.33 0.33 0.34 0.34 0.35 0.35 0.35 GBs/s
Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e580>
compilation in 0.344s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
slowest p1 p10 p25 p50 p75 p90 p99 fastest unit
------------------------------------------------------------------------------------------------------------------------
1.7e-01 1.7e-01 1.7e-01 1.7e-01 1.6e-01 1.6e-01 1.6e-01 1.6e-01 1.6e-01 seconds
100.42 100.42 100.65 100.71 105.42 106.16 108.50 108.50 108.50 GFlops/s
0.39 0.39 0.39 0.39 0.41 0.42 0.43 0.43 0.43 GBs/s