MLIR News, 28th edition (2/20 - 3/5/2021)

See the previous published edition.
Welcome to the twenty-eight issue of the MLIR (bi)Weekly, a newsletter covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!




Table-driven Infrastructure


  • The Vector dialect now has vector.load and ops. These operations model contiguous vector loads and stores from/to memory and allow representing vector loads and stores that couldn’t be represented using std.load and ops. They will facilitate the progressive lowering of both Affine vector loads/stores and Vector transfer reads/writes. D96185 [mlir][Vector] Introduce ‘vector.load’ and ‘’ ops (
  • Generalized vector.scatter/gather operations to higher dimension
    • Unifies their form with masked load/store and compress/expand, simplifies vectorization from scalar memref to SIMD memref
  • Rewrite all-true/false vector masked-load/store/expand/compress direct into l/s => No longer necessary to through transfer operations first
  • Various improvements and bug fixes to linalg on tensors in support of the sandbox that has landed in IREE, including:
    • Fix order of dimensions in hoistPaddingOnTensors.
    • Canonicalize scf.for last tensor iteration result.
    • Tighten the rules around folding TensorLoadOp.
    • Add folding of vector transfers from/into tensor producing ops.
    • Add folding of linalg.copy that are in fact identities.
  • TOSA got more lowering to std/scf for control flow and to Linalg (transpose, reshape, and identity)


  • A few patches landed to rename some existing SPIR-V ops to improve naming consistency, more are in reviews.


In the Ecosystem

Flang, the LLVM Fortran Compiler

IREE : An Experimental MLIR Execution Environment

  • Landed HAL and compiler support for CUDA and enabled first CUDA E2E tests for element wise ops.
  • A functional linalg-on-tensor sandbox has landed in iree/experimental. It will serve as a staging area for bridging the gap between abstractions needed by IREE and core.
    • Uses a custom single shot bufferization with inplace semantics to try and alleviate some of the difficulties raised in the post about bufferization.

mlir-npcomp: Prototype for compiling numpy programs

TensorFlow / MLIR-HLO

XLA GPU CodeGen:

  • Implementing whole graph LMHLO lowering (all WIP):
    • Disable multi-stream support
    • Either migrate or turn off --xla_hlo_profile
    • Migrate Conditional and While ops

GPU Kernel CodeGen

  • We have now launched all unary kernels from our initial launch set. Further kernel launches will require better rank specialization logic, which is planned but not currently being worked on.
  • We have further optimized our broadcasting code, introducing the new chlo.minimum_broadcast_shapes operation. It is used as a pre-step for binary operations to reduce the number of dimensions to a minimum and hence the iteration space for the compute loop. This is particularly useful on GPU, where we have to recompute the iteration space from a 1d index.
  • For binary operations, we have started launching operations but have encountered some obstacles in testing. Tests commonly pattern match exact error messages and the ones produced by kernel generator are slightly different. We will need to update tests first, which is under way.
  • The team is now looking into better fusion in the presence of broadcasts and performance improvements on CPU. Another focus is moving vectorization to the MLIR infrastructure instead of using LLVM’s vectorizer.

TFRT: A New TensorFlow Runtime

Auto-Fusion for TFRT

CIRCT : Circuit IR Compilers and Tools aka ‘MLIR for hardware’

Recent Talks

Open MLIR Meeting

Fifth LLVM Performance Workshop at CGO

Slides are online for most talks, here are the MLIR ones:

Recent Publications

This paper introduces HIR, an MLIR-based intermediate representation (IR) to describe hardware accelerator designs. HIR combines high level language features, such as loops and multi-dimensional tensors, with programmer defined explicit scheduling, to provide a high-level IR suitable for DSL compiler pipelines without compromising control over the micro-architecture of the accelerator. HIR’s explicit schedules allow it to express fine-grained, synchronization-free parallelism and optimizations such as retiming and pipelining. Built as a dialect in MLIR, it draws from best IR practices learnt from communities like those of LLVM. While offering rich optimization opportunities and a high level abstraction, HIR enables sharing of optimizations, utilities and passes with software compiler infrastructure.
Our implementation shows that the code generation time of the HIR code generator is on average 1112x lower than that of Xilinx Vivado HLS on a range of kernels without a compromise on the quality of the generated hardware. We believe that these are significant steps forward in the design of IRs for hardware synthesis and in equipping domain-specific languages with a productive and performing compilation path to custom hardware acceleration.