MLIR News, 28th edition (2/20 - 3/5/2021)

mehdi_amini · February 22, 2021, 10:58pm

See the previous published edition.
Welcome to the twenty-eight issue of the MLIR (bi)Weekly, a newsletter covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!

Highlights

An evolution of the tensor MLIR type is in discussion in the RFC on sparse tensor type as first-class citizen.
The Python bindings now allow to JIT and invoke MLIR code, the interface is still very primitive but will evolve!
A prototype for DSL for authoring Linalg operations is in development in-tree, see the RFC: Linalg OpDSL.

MLIR Core

Infrastructure

The proposal discussed in [RFC] Debug Actions in MLIR: Debug Counters for the Modern World has landed!

Table-driven Infrastructure

Attributes can now be generated using ODS (using similar mechanisms to Types).

Codegen

The Vector dialect now has vector.load and vector.store ops. These operations model contiguous vector loads and stores from/to memory and allow representing vector loads and stores that couldn’t be represented using std.load and std.store ops. They will facilitate the progressive lowering of both Affine vector loads/stores and Vector transfer reads/writes. D96185 [mlir][Vector] Introduce ‘vector.load’ and ‘vector.store’ ops (llvm.org)
Generalized vector.scatter/gather operations to higher dimension
- Unifies their form with masked load/store and compress/expand, simplifies vectorization from scalar memref to SIMD memref
Rewrite all-true/false vector masked-load/store/expand/compress direct into l/s => No longer necessary to through transfer operations first
Various improvements and bug fixes to linalg on tensors in support of the sandbox that has landed in IREE, including:
- Fix order of dimensions in hoistPaddingOnTensors.
- Canonicalize scf.for last tensor iteration result.
- Tighten the rules around folding TensorLoadOp.
- Add folding of vector transfers from/into tensor producing ops.
- Add folding of linalg.copy that are in fact identities.
TOSA got more lowering to std/scf for control flow and to Linalg (transpose, reshape, and identity)

SPIR-V

A few patches landed to rename some existing SPIR-V ops to improve naming consistency, more are in reviews.

Other

LLVM Dialect has now support for memory access group metadata and LoopOptions metadatas

In the Ecosystem

Flang, the LLVM Fortran Compiler

IREE : An Experimental MLIR Execution Environment

Landed HAL and compiler support for CUDA and enabled first CUDA E2E tests for element wise ops.
A functional linalg-on-tensor sandbox has landed in iree/experimental. It will serve as a staging area for bridging the gap between abstractions needed by IREE and core.
- Uses a custom single shot bufferization with inplace semantics to try and alleviate some of the difficulties raised in the post about bufferization.

mlir-npcomp: Prototype for compiling numpy programs

GlobalizeObjectGraph tranformation to convert object graphs to a more easily analyzable form
add ability to annotate TorchScript classes. Currently used for public/private annotations, but eventually will be important for providing other information to the compiler.
properly model “derefinement” which is an impedance mismatch between TorchScript and MLIR.
BERT imports and lowers. Resnet in progress.

TensorFlow / MLIR-HLO

XLA GPU CodeGen:

Implementing whole graph LMHLO lowering (all WIP):
- Disable multi-stream support
- Either migrate or turn off --xla_hlo_profile
- Migrate Conditional and While ops

GPU Kernel CodeGen

We have now launched all unary kernels from our initial launch set. Further kernel launches will require better rank specialization logic, which is planned but not currently being worked on.
We have further optimized our broadcasting code, introducing the new chlo.minimum_broadcast_shapes operation. It is used as a pre-step for binary operations to reduce the number of dimensions to a minimum and hence the iteration space for the compute loop. This is particularly useful on GPU, where we have to recompute the iteration space from a 1d index.
For binary operations, we have started launching operations but have encountered some obstacles in testing. Tests commonly pattern match exact error messages and the ones produced by kernel generator are slightly different. We will need to update tests first, which is under way.
The team is now looking into better fusion in the presence of broadcasts and performance improvements on CPU. Another focus is moving vectorization to the MLIR infrastructure instead of using LLVM’s vectorizer.

TFRT: A New TensorFlow Runtime

Auto-Fusion for TFRT

Math function approximations (also works for vectors) landed in MLIR.
JIT runtime supports functions returning memrefs without wrapping them in async.value

CIRCT : Circuit IR Compilers and Tools aka ‘MLIR for hardware’

ESI’s Co-simulation feature is v1 feature complete. [Demonstration PR]

Recent Talks

Open MLIR Meeting

2021-03-04: MLIR based Numba backend ; slides - recording

Fifth LLVM Performance Workshop at CGO

Slides are online for most talks, here are the MLIR ones:

Moving LLVM’s code generator to MLIR framework
Classical Loop Nest Transformation Framework on MLIR
LTO and Data Layout Optimisations in MLIR
COMET: Domain Specific Compilation for Heterogenous Targets

Recent Publications

HIR: An MLIR-based Intermediate Representation for Hardware Accelerator Description

This paper introduces HIR, an MLIR-based intermediate representation (IR) to describe hardware accelerator designs. HIR combines high level language features, such as loops and multi-dimensional tensors, with programmer defined explicit scheduling, to provide a high-level IR suitable for DSL compiler pipelines without compromising control over the micro-architecture of the accelerator. HIR’s explicit schedules allow it to express fine-grained, synchronization-free parallelism and optimizations such as retiming and pipelining. Built as a dialect in MLIR, it draws from best IR practices learnt from communities like those of LLVM. While offering rich optimization opportunities and a high level abstraction, HIR enables sharing of optimizations, utilities and passes with software compiler infrastructure.
Our implementation shows that the code generation time of the HIR code generator is on average 1112x lower than that of Xilinx Vivado HLS on a range of kernels without a compromise on the quality of the generated hardware. We believe that these are significant steps forward in the design of IRs for hardware synthesis and in equipping domain-specific languages with a productive and performing compilation path to custom hardware acceleration.

Topic		Replies	Views
MLIR News, 25th edition (1/22/2021) Newsletter	2	1258	January 29, 2021
MLIR News, 44th edition (10/2 - 10/15/2021) Newsletter	0	700	October 14, 2021
MLIR News, 32nd edition (4/17 - 4/30/2021) Newsletter	0	1033	April 19, 2021
MLIR News, 35th edition (5/29 - 6/12/2021) Newsletter	0	961	June 1, 2021
MLIR News, 45th edition (10/16 - 10/29/2021) Newsletter	0	676	October 20, 2021