MLIR News, 21st edition (11/28/2020)

See the previous published edition.

Welcome to the twenty-first issue of the MLIR (bi)Weekly, a newsletter (published on Friday) covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!



  • Function.h and Module.h are in the process of being removed in favor of BuiltinOps.h
  • Side effect instances can now specify an Attribute containing additional effect parameters.
  • Side effect instances can now provide a SymbolRefAttr as the value being affected.

Optimizations and Code Generation

  • RFC open to discussion about adding dialects for modeling the ARM Neon and SVE instruction sets.
  • The prototype sparse compiler has been committed:
    • Some sanity check benchmarking shows “on par” performance for a couple of sparse kernels and matrices compared to the Eigen library.
  • A parallelization strategy was also added:
    • Provides control over what loops should be expressed with “scf.parallel” (inner/outer loops, dense/sparse loops)
    • Some sanity check benchmarking using the current in-tree async lowering for parallel loops exhibits reasonable speedups over sequential sparse code.
  • Planned next: vectorization strategy, invariant code hoisting, storage type control


  • Various clean ups were introduced to improve consistency in the SPIR-V dialect. spv._* ops are renamed as spv.mlir.* ops to follow general convention in MLIR.
  • Module combiner now can unique global variables, specialization constants, and functions.


  • OpenMP: Added operation for the OpenMP worksharing loop construct, omp.wsloop. An SCF parallel to OpenMP parallel+worksharing loop conversion pass was also added. Patches to pretty-print+parse, lower to LLVM IR are in progress.

In the Ecosystem

CIRCT : Circuit IR Compilers and Tools aka ‘MLIR for hardware’

  • A pass was added to flatten FIRRTL bundle types, making it simpler for other CIRCT components to interface with FIRRTL.
    • Notably, this opens up a path for some Handshake modules to be emitted as System Verilog that previously failed.

mlir-npcomp: Prototype for compiling numpy programs

TensorFlow / MLIR-HLO

Recent Talks


You mentioned “Some sanity check benchmarking using the current in-tree async lowering for parallel loops exhibits reasonable speedups over sequential sparse code” in Optimizations and Code Generation section, could you please explain what is “in-tree async lowering” and how to do “in-tree async lowering”?

Thank you so much!

You can see an example here:

And here is a test with parallel loop lowering to the async primitives:

Hi Medhi,

I looked at the example you sent to me. I tried with the example:, I run with “with-async (i.e. the 1st mlir-opt command in the mlir code)” and “no-async (i.e. the 2nd mlir-opt command in the mlir code)” separately, the “with-async” execution time is: 0.245081; While the “no-async” execution time is: 0.238621. So it shows that “with-async” has no performance improvement in this case, and a tiny performance reduction actually. I also tried the same mlir code, just change the all matrix sizes from 1024x1024 to 83334x83334, the “with-async” exection time is: 4362.45; While the “no-async” execution time is: 4940.47. So in this case “async” gives a small performance improvement. I am running on an Intel Xeon machine, with 2 threads per core, 12 cores per socket and 2 sockets total.

Is it possible for your to tell me which example your group used to get the “reasonable speedups” over the async lowering? Is it in the github repo? I really want to learn this!

Thank you so much!

The runtime implementation that is currently in-tree isn’t intended to showcase any performance at the moment: it is a fairly naive thread pool right now.
I don’t know if @ezhulenev has an example that would still fit and show some speedup there.

@rqtian I disabled threading in because of the problems with dynamic library unloading. With a thread pool that tests shows about ~3x speedup for 1024x1024 with 4 threads.

One option to fix it, is to build a mlir-cpu-runner with statically linked runtime, and bind Async API symbols at runtime (example from TFRT

Or figure out how to do proper dynamic library unloading that will wait for all threads to stop before the shutdown.

@rqtian I submitted ⚙ D94346 [mlir] AsyncRuntime: use LLVM ThreadPool to run async tasks that brings back parallel execution to async, on my desktop I see execution time 0.318219 vs 0.126553 in the microbench-linalg-async-parallel-for.mlir test.

Hi @ezhulenev ,

Thank you so much for letting me know! I just tried the microbench-linalg-async-parallel-for.mlir, I get performance improvement when I change scf.for into scf.parallel (no-async:0.234836 with async: 0.0692391). (ps: If using scf.for, there’s no performance improvement).