MLIR News, 42nd edition (9/4 - 9/20/2021)

Work in progress: this is a wiki post, everyone is welcome to modify it directly

Please update with work done between 9/3 and 9/20: you can update it along the way (don’t wait the end date to add entries here: you can add as the work is landing)

See the previous published edition
Welcome to the forty-second issue of the MLIR (bi)Weekly, a newsletter covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!

MLIR Core

Infrastructure

DRR

Codegen

  • Sparse compiler progress:
    • Added support for general affine subscripts (dense tensors only at the moment)
    • Implemented cast operations (int/fp, int/int, fp/fp) within sparse linalg ops
    • Improved sparse tensor convert with folding
    • Started “sparse kernel” collection: matmul, convolution, quantized matmul, etc.
  • Conversion pipelines targeting the LLVM dialect must now run the -reconcile-unrealized-casts pass at the end instead (or in addition to) -convert-std-to-llvm to remove undesired casts and discover incomplete partial conversions.

In the Ecosystem

IREE : An Experimental MLIR Execution Environment

  • CPU backend has a new pipeline for using a Tensors → Vectors pass, with late bufferization. Matmuls/Batchm-matmul based codegen go through this path now
  • FFT is now tiled and distributed by default now. This helps remove FFTs as bottleneck on SPIR-V/CUDA backends since they are not now completely serialized.

Recent Talks

Recent Publications

Most compilers have a single core intermediate representation (IR) (e.g., LLVM) sometimes complemented with vaguely defined IR-like data structures. This IR is commonly low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM’s extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.

To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW co-design ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their mappings on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations(CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different mapping schemes.