MLIR News, 40th edition (8/7 - 8/20/2021)

See the previous published edition
Welcome to the fortieth issue of the MLIR (bi)Weekly, a newsletter covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!




  • linalg.tiled_loop peeling pattern: out for review (⚙ D108270 [mlir][linalg] linalg.tiled_loop peeling)
  • scf.for loop peeling pattern: ~75% landed (see “stack” of revisions of D108270)
  • FlatAffineConstraint / FlatAffineValueConstraint discussion and refactorings have started landing.
  • Sparse compiler progress
    • Add an exhaustive sparse testing example through Python API to OSS.


  • spv.(inBounds)PtrAccessChain ops are defined.

In the Ecosystem

IREE : An Experimental MLIR Execution Environment

  • The SPIR-V backend moved to use dynamic pass pipelines (similar to the CPU and CUDA backends). This will allow the SPIR-V backend to also support the code-generation for scatter and FFT operations that are currently supported on other backends.
  • Initial path to lower matmuls using MMT4d operations landed. More study on using mmt4d path on ARM CPU backend shows that currently the cost of packing to get to the tiled layout is inefficient. This requires making the lowering of the linalg.generic that does the packing go through vector dialect and ensuring that the generated code is efficient (currently generic op lowering relies on LLVM auto-vectorization)
  • CUDA backend:
    • Basic improvements to configurations allow to get MobileBert training time to 13ms (goal is to match TF which runs in 9ms)
    • Add workaround for a driver bug (48771 – [NVPTX] Miscompilation in trivial fixed-stride loop) in CUDA 11.2 driver
    • Enabled MobileNet and mobileBert full model function tests
    • Looking at solutions to handle a performance regression due to improvements in fusion that causes expensive ops to be duplicated.

TensorFlow / MLIR-HLO

Kernel Generator:

  • JIT mode has been implemented and last performance tuning is under way. Adding support for a jitted kernel requires only a few (<6) kb of binary size and will allow us to support more data types on GPU.


  • TFRT supports symbolic shapes and will recompile kernels for unique shape constraints.