MLIR benchmarks

SaurabhJha · December 6, 2021, 7:51pm

Hey Aart and everyone,

I have started a revision that creates a bare minimal build necessary to get started writing actual benchmarks for sparse kernel. The revision is here.

The sequence of steps we need to do to implement benchmarks for sparse kernel are outlined by Aart here which I am copying inline

(1) investigating the typical way in which LLVM at large integrates benchmarks
(2) find interesting sparse kernels to implement and measure
(3) integrate such tests in a continuous build (or at least frequently run system)

The patch I linked to strives to achieve the first objective here. Once we are done here, we can start talking about what benchmarks to implement and so on.

Thank you,
Saurabh

aartbik · December 6, 2021, 7:58pm

Thanks for starting the discussion! Quick first reply: since you are really introducing a benchmarking framework (not just sparse), may I suggest you rename the subject to e.g. just “MLIR Benchmarks”. Then, sparse benchmarks can be a subfolder within this newly proposed framework.

SaurabhJha · December 6, 2021, 7:59pm

Sure thing, done!

SaurabhJha · December 12, 2021, 4:03pm

Hey everyone,

I am working on introducing benchmarks for MLIR in general and for SparseKernel in particular. So far, I have explored two ways of doing it:

Using google benchmark library.
Using python’s timeit library.

Here’s a summary of the discussion I had with @mehdi_amini and @aartbik on the patch

To use google benchmark library, we can write C++ execution tests but we need to improve the runtime to make this approach viable.
We can set up and run MLIR IR from python programs. Thus, we can use python libraries to do the execution and report the timings.

I believe we might go down the python route. What do people think about this?

I am also not sure whether these benchmarks should be included in llvm-lit’s discovery path.

nicolasvasilache · December 13, 2021, 8:02am

Strong +1 on my end, seamless integration into a language/runtime/library ecosystem that provides easy to use libraries is a must for fast iteration, reproducibility and just showing the usefulness of MLIR in general.

We have been operating under the assumption that this type of infra is a bridge too far for the MLIR / LLVM repo so we have been building our own out of tree for a few months. This is also the place where the sparse compiler benchmarks started before forking off to their separate place that also brings in TACO.

Recently, the winds have shifted, motivated in large part by realizing that this is possible, necessary for reuse and overdue. There are a few orthogonal aspects here that I’ll happily unpack once there is reasonable confidence that this is welcome in-tree.

The MVP for me that would show this is possible is having a base reusable harness that we can all contribute to and per-project/dialect entry points.

Our harness is built around the concept of a “ProblemDefinition” that makes it easy to create a new benchmark that plays nicely with it. It specifies:

how to create and initialize numpy and MLIR tensors (incl. alignment),
the compute/bandwidth characteristics of the benchmark
a check function that compares the MLIR JIT’ed code with a known correct impl. (e.g. in numpy)
an entry point to metaprogram MLIR from python under a given context manager.

The ProblemDefinition itself is not ideal: it requires too much human intervention and should be automated; the current state is as far as we’ve been willing to push it so far for our use cases.

IMO, a “test for success” of a basic setup in-tree would be that we can just delete the existing code and reuse what would be in-tree.

The flow I describe above is tailored atm for a “custom op” programming model and has a minimal runtime/compiler contract to specify what “bufferization” should do as well as various compiler strategies and basic “transformations wrapping infra” that are not appropriate for other projects. So if you go with a model similar to what I described above, let’s be sure to iterate closely so we can surface what is generally reusable and portable vs what is too specific.

Below is a snapshot of what you can expect from the basic stuff that exists today:

> python -m python.examples.matmul.bench

###############################################################
Compile-time problem size {'m': 2040, 'n': 2041, 'k': 2042}
Runtime problem size {'m': 2040, 'n': 2041, 'k': 2042}
Problem types [<class 'numpy.float32'>, <class 'numpy.float32'>, <class 'numpy.float32'>]

Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e5e0>
compilation in 0.2015s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
     slowest          p1         p10         p25         p50         p75         p90         p99     fastest        unit
------------------------------------------------------------------------------------------------------------------------
     2.3e-01     2.3e-01     2.2e-01     2.2e-01     2.2e-01     2.2e-01     2.1e-01     2.1e-01     2.1e-01     seconds
       73.58       73.58       76.24       76.49       78.15       78.76       82.07       82.07       82.07    GFlops/s
        0.29        0.29        0.30        0.30        0.31        0.31        0.32        0.32        0.32       GBs/s

Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e580>
compilation in 0.322s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
     slowest          p1         p10         p25         p50         p75         p90         p99     fastest        unit
------------------------------------------------------------------------------------------------------------------------
     1.6e-01     1.6e-01     1.6e-01     1.6e-01     1.6e-01     1.6e-01     1.5e-01     1.5e-01     1.5e-01     seconds
      103.70      103.70      103.74      103.78      105.14      106.28      112.44      112.44      112.44    GFlops/s
        0.41        0.41        0.41        0.41        0.41        0.42        0.44        0.44        0.44       GBs/s

###############################################################
Compile-time problem size {'m': -1, 'n': 2041, 'k': -1}
Runtime problem size {'m': 2040, 'n': 2041, 'k': 2042}
Problem types [<class 'numpy.float32'>, <class 'numpy.float32'>, <class 'numpy.float32'>]

Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e5e0>
compilation in 0.2068s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
     slowest          p1         p10         p25         p50         p75         p90         p99     fastest        unit
------------------------------------------------------------------------------------------------------------------------
     2.1e-01     2.1e-01     2.0e-01     2.0e-01     1.9e-01     1.9e-01     1.9e-01     1.9e-01     1.9e-01     seconds
       82.80       82.80       86.25       86.40       88.16       88.68       89.35       89.35       89.35    GFlops/s
        0.32        0.32        0.34        0.34        0.35        0.35        0.35        0.35        0.35       GBs/s

Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e580>
compilation in 0.3505s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
     slowest          p1         p10         p25         p50         p75         p90         p99     fastest        unit
------------------------------------------------------------------------------------------------------------------------
     1.7e-01     1.7e-01     1.7e-01     1.7e-01     1.6e-01     1.5e-01     1.5e-01     1.5e-01     1.5e-01     seconds
      101.42      101.42      101.94      102.05      108.79      111.31      113.19      113.19      113.19    GFlops/s
        0.40        0.40        0.40        0.40        0.43        0.44        0.44        0.44        0.44       GBs/s

###############################################################
Compile-time problem size {'m': -1, 'n': -1, 'k': -1}
Runtime problem size {'m': 2040, 'n': 2041, 'k': 2042}
Problem types [<class 'numpy.float32'>, <class 'numpy.float32'>, <class 'numpy.float32'>]

Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e5e0>
compilation in 0.2182s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
     slowest          p1         p10         p25         p50         p75         p90         p99     fastest        unit
------------------------------------------------------------------------------------------------------------------------
     2.1e-01     2.1e-01     2.0e-01     2.0e-01     2.0e-01     2.0e-01     1.9e-01     1.9e-01     1.9e-01     seconds
       80.91       80.91       83.69       83.76       85.70       87.13       90.24       90.24       90.24    GFlops/s
        0.32        0.32        0.33        0.33        0.34        0.34        0.35        0.35        0.35       GBs/s

Compilation expert <python.examples.core.transform.TransformationList object at 0x7fd3b8e4e580>
compilation in 0.344s
xxxxxxxxxx : 10 iters time on 1 threads
------------------------------------------------------------------------------------------------------------------------
     slowest          p1         p10         p25         p50         p75         p90         p99     fastest        unit
------------------------------------------------------------------------------------------------------------------------
     1.7e-01     1.7e-01     1.7e-01     1.7e-01     1.6e-01     1.6e-01     1.6e-01     1.6e-01     1.6e-01     seconds
      100.42      100.42      100.65      100.71      105.42      106.16      108.50      108.50      108.50    GFlops/s
        0.39        0.39        0.39        0.39        0.41        0.42        0.43        0.43        0.43       GBs/s

nicolasvasilache · December 13, 2021, 8:08am

On the specific use of python infra, this can easily get noisy and hard to control, esp. at low latencies.
So we built our own where the harness synthesizes a timing loop that calls a lower-level C timing library and fills an array of timing values.

When done, we run basic stats on those higher confidence numbers in python.

There would be a lot of interest on our end for help plugging-in perf counter harvesting in there and start generating some roofline plots: we have a few cases where we suspect we are bound by the frontend and I was made aware recently that llvm-mca does not model that.

nicolasvasilache · December 13, 2021, 9:03am

Also: Benchmarking tips — LLVM 16.0.0git documentation

ftynse · December 13, 2021, 11:20am

For benchmarking purposes, we have added this wrapper https://github.com/llvm/llvm-project/blob/e90630e5a501fc1ef3b8e56e9d35bb6588c4eea1/mlir/lib/ExecutionEngine/RunnerUtils.cpp#L79 around C++ high-resolution clock. It is used in several downstream projects as follows:

take the MLIR function you want to benchmark,
emit a “benchmark” function that takes the same arguments as the original function, calls the timer, the original function, and the timer again before computing the difference,
return the results of the original function + the measured time.

This sounds a bit complicated at a glance, but is really a dozen lines. It is also flexible enough to integrate with whatever code that will be calling this function. I have used it from Python.

Regarding running Python tests from lit, it is mostly straightforward and is already in use for Python binding tests. lit itself is driven in Python, and the test discovery is literally a glob by filename extension. The only important thing is to have a local configuration in a directory that marks the tests as unsupported for lit if the user has not requested benchmarking, this will need some cmake/lit communication similar to what is done here https://github.com/llvm/llvm-project/blob/main/mlir/test/python/lit.local.cfg and here https://github.com/llvm/llvm-project/blob/main/mlir/test/lit.site.cfg.py.in#L50.

stellaraccident · December 13, 2021, 3:13pm

Could you explain more what you mean here? I take it as it is currently targeted at smallish programs that can be meta-programmed at the single op level. What you have is really nice for that level of work.

I’ve struggled with the relation to continuity of tooling between that and the larger scale, and we have a lot of gaps there. Currently, we end up reducing such problems down to benchmark in this kind of harness when we have a problem or need to do focused development, but that leaves a lot to be desired.

At the larger scale, ime, the primary differences are:

You are rarely meta-programming: you are working from programs that something higher level has produced.
You may not have a convenient place to just insert a timing loop. Instead, you want some kind of interface to surface and aggregate metrics (which may be coming from different devices).
You tend to be more interested in metrics that indicate total system performance that goes beyond single invocations (ie. How much overhead between repeat invocations, compute unit idle time, peak memory usage over time, etc).
Amatorization of startup costs.
You want to drill down into these lower level metrics when there is a problem or opportunity.

In my experience, these are rarely served by the same tooling and development process, but they do benefit from a continuity of tooling and approach – and right now are more disconnected than I think we ultimately want. But I struggle with concrete steps to grow one up and grow the other down towards a common center.

nicolasvasilache · December 13, 2021, 5:01pm

Yes, this is an instance of alternating between top-down and bottom-up investigations.
I am starting to see some glimmers on the horizon / have a few ideas for bridges to get us closer.
The steps outlined are indeed for O(1) op but already help reduce ambiguity and surprise.
They are by no means sufficient but IMO they are necessary for solving the bigger problem progressively, without surprises.

I did take the post as looking at benchmarking at the O(1) ops level since it was mentioning SparseKernel, but I agree the larger space is much more ambiguous: the tooling I personally like in this space is nvprof/nsight but that is definitely a different level.

SaurabhJha · December 13, 2021, 7:09pm

So many good comments . Here’s what I understand from the above discussion:

We should call these benchmarks from python.
We should include them in llvm-lit with CMake flags to include/exclude them.
The directory structure I am thinking about is a top level directory inside something like mlir/benchmark/python that contains all the code for a test harness and then the subdirectories for the benchmarks of different dialects.

What we need to decide on is what approach are we taking for test harness. Should we go with the iree-llvm-sandbox suggested by @nicolasvasilache or RunnerUtils suggested by @ftynse. We can probably get started with something lightweight on a toy benchmark and see if it works

There were a lot of things I couldn’t understand in the discussion so far like “meta programmed at the op level” or “bufferization”. It will hopefully make more sense to me as I spend more time working on this project.

nicolasvasilache · December 13, 2021, 7:17pm

Note that the python level reuses the RunnerUtils, this is not a simple binary choice (sorry ).
Rather, there are multiple ways to slice the problem depending on what you want to achieve and when.

mehdi_amini · December 13, 2021, 7:33pm

On this topic, what are you trying to achieve with lit to drive these?
In general lit is nice to drive test execution and report pass/fail result (and coarse graining timing for test invocation). However in the context of benchmarks I would expect a very different set of features provided by the test harness:

stabilize / warmup the machine
re-run a benchmark until you get confidence in the result
collect timing and aggregate them in a format that can then be post processed for analysis (and fed to a tracing tool/database).

Lit won’t provide any of this. Traditionally LLVM is using lnt for most of these, but it works best to track full-process execution time. Also it seems that we’re going the route of integrating some of the features directly in a custom python harness.

I would look maybe into how is Google benchmark used and driven by other LLVM subprojects?

SaurabhJha · December 13, 2021, 7:49pm

I was mainly looking for something that we can run as part of the build process. You are right that the characteristics of benchmarks we are going to have are very different from the ones provided by lit so now I am not sure now whether a lit driver would be appropriate here.

Using Google Benchmarks would require a C++ interface and I thought there are runtime limitations on generating MLIR IR from C++ like its done here. Did you mean using Google Benchmark’s python binding? There are examples of benchmarks in libc and libcxx that uses Google Benchmark’s C++ interface that I can use if we want to explore this path.

mehdi_amini · December 13, 2021, 8:06pm

My comment was purely from a “driver” point of view: how will we invoke the benchmark suite and get performance numbers out of this?
I would expect that driving a suite of a bunch of Google Benchmarks tests have a similar “driver” requirement? Unless it is all linked in a single binary and you just execute it once and it takes care of it internally?

mehdi_amini · December 13, 2021, 8:07pm

Here is the doc FYI: Test Producers — LNT 0.4.2.dev0 documentation ; I don’t know if it would provide any “driver” feature for what we need here though.

aartbik · December 13, 2021, 8:09pm

+1 on this. I also made a similar comment in the revision on that, but what I would like to see in a benchmark framework is e.g. a report on the JIT time required for a kernel, the actual time required to run a kernel (with cold and warm cache behavior). For sparse kernels, we probably also want to get information on reading a large sparse tensor, packing it, assembling it for a kernel, and running the actual computation.

For example, borrowing from our very good friends of TACO, their runtime tools have a built-in mechanism for timing sparse kernels, with an output sample shown below.

b file read: 281.989 ms
b pack:      277.087 ms
b size: (16614 x 16614), 13162808 bytes

Compile:  127.5 ms
Assemble: 2.41907 ms
Compute time (ms)
  mean:   0.000354375
  stdev:  0.000503503
  median: 0.000152

SaurabhJha · December 13, 2021, 8:32pm

Thank you both for the inputs. I am still not sure if I got everything so let me just post my current understanding here.

Driver

We need a “driver” to drive our benchmark suite. What Mehdi suggested, which I misunderstood earlier, was to look at ways other llvm sub projects benchmarks are driven and take inspiration from that for the new framework. The driver is definitely not going to be Lit.

Benchmarking tool

I am not sure yet what to use here. Let me take some time understanding this comment thread, update the current revision, and then we can discuss more once we have something more concrete.

SaurabhJha · December 20, 2021, 2:10pm

Hey everyone,

I have put a python script together to run implement and run benchmarks. I borrowed heavily from iree-llvm-sandbox. Here’s an output from my local run:

********************
Benchmarking "sparse_kernel"
Compilation time: 0.32732105255126953
  Number of runs: 100
  Mean:           742.2206599799997 ms
  Median:         733.643208 ms
  Fastest:        701.91025 ms
  p1:             705.70454875 ms
  p10:            715.0420289 ms
  p25:            723.45889525 ms
  p50:            733.643208 ms
  p75:            752.60710375 ms
  p90:            769.5289202 ms
  p99:            837.5624051800003 ms
  Slowest:        890.5387089999999 ms
********************

The patch I have contains both the python script and an example benchmarking kernel. Maybe we should split it up.

I am also looking for a way to run them. As far as I understand, these benchmarks are going to be run manually and its output is going to be interpreted by a human. I am not sure though and please let me know if anyone has any thoughts on this.

aartbik · December 20, 2021, 6:38pm

Although it is true that the output will often be interpreted by a human to find opportunities for further optimizations, I also think there should be a shared component between all benchmarks that can be parsed by machines, for example, by a future performance tracking robot (which shows performance over time on a status page, and sends out an email when performance is badly broken by a new revision).

Topic		Replies	Views
Sparse Compiler Starter Projects MLIR	10	903	December 14, 2021
Regarding sparse kernel benchmarks in MLIR LLVM Dev List Archives	3	124	November 30, 2021
[beginner] Benchmarking MLIR functions MLIR	1	599	August 8, 2021
MLIR/Linalg bad performance MLIR	5	598	August 12, 2021
Sparse multiplication benchmarks MLIR	0	192	March 12, 2022

MLIR benchmarks

Driver

Benchmarking tool

Related topics