MLIR News, 39th edition (7/24 - 8/7/2021)

See the previous published edition.
Welcome to the thirty-ninth issue of the MLIR (bi)Weekly, a newsletter covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!

Highlights

MLIR Core

Infrastructure

Codegen

  • Documentation improvements:
    • Added introduction sections to new dialects such as AMX and SparseTensor
  • Sparse compiler improvements:
    • Introduced a general “sparse_tensor.conversion” operation that enables concisely expressing arbitrary sparse tensor types conversions (dense to sparse, sparse to dense and sparse to sparse) in the IR. Started the actual “lowering” of some of these into runnable code.
  • scf.for loop peeling pattern: Split loop into “full” iterations and maybe partial iteration.
    • Peeling pattern has landed, affine.min canonicalization pattern (incl. more general replacement for AffineMinSCFCanonicalizationPattern) is out for review.
    • Dependent on various changes to FlatAffineConstraints: now split into 2 classes, base class without “Value” association, also out for review.

SPIR-V

  • Two boolean loading/storing issues were fixed in SPIR-V conversion.
  • A few issues in the SPIR-V module combiner were fixed.
  • MemRef/Math to SPIR-V conversions are split into their own directories and files.

In the Ecosystem

IREE : An Experimental MLIR Execution Environment

  • CUDA backend:
    • More progress on BERT training performance enhancements for CUDA backends. Current execution time stands at 18 ms per step (tensorflow is at 10 ms for reference)
    • Einsums lowered to batched matmul
    • Enabling multi-level tile and distribute of scatter operations (using the TiledOpInterface in IREE, upstream in progress as TilingInterface (⚙ D106406 [mlir] Add an interface to allow operations to specify how they can be tiled.))
    • Adding support for efficient distribution of workgroup memory copies using Vector distribution framework from MLIR core
    • Software pipelining to overlap loads and computations in matmul kernels
    • Some simple barrier insertion elision
    • Rework the order of vector transformations to take advantage of the latest upstream improvements
  • Also some good progress on lowering FFT through IREE compilation stack on ARM CPUs. Basic implementation of the Cooley-Tukey FFT algorithm, with pre-computing twiddle factors reduces the execution time from 33ms to 5ms. More can be obtained by vectorization.
  • Exploration of using linalg.mmt4d in the IREE compilation flow for ARM CPUs making progress. This approach brings learning from Ruy into the IREE compilation. Stand-alone matmul achieves 20% improvement over use of linalg.matmul operations as expected due to better cache utilization.

mlir-npcomp: Prototype for compiling numerical python programs

  • Massive build rework by Stella.
  • MILESTONE: ResNet runs end-to-end for the first time.
    • RefBackend works (even with dynamic shapes).
    • IREE is currently limited to static shapes, but it boils down to a known issue being actively fixed ( IREE #6629).
  • Add full machine translation model to curriculum and pave the way for other tests with heavy dependencies PR.

TensorFlow / MLIR-HLO

A new “TensorFlow Graph IR” dialect has been implemented, this is using a graph region to implement the dataflow model of TensorFlow, and aim to offer a perfect compatibility with TensorFlow GraphDef, addressing some impedance mismatch issues between GraphDef and the TensorFlow dialect.

TFRT JIT

  • Prototype for code generation for reduction kernels, e.g. tf.Sum, tf.Prod. Once the infrastructure is landed, we have to invest more time into performance tuning.
  • We are working on enhanced shape analyses to enable more fusions in the presence of reshape and broadcasting patterns.

mlir-hs

  • Early stage work on Haskell bindings building on top of C API

Recent Publications

  • ScaleHLS: Scalable High-Level Synthesis through MLIR

While existing HLS tools are built using compiler infrastructures largely based on a single-level abstraction (e.g., LLVM), we propose ScaleHLS, a next-generation HLS compilation flow, on top of a multi-level compiler infrastructure called MLIR, for the first time. By using an intermediate representation (IR) that can be better tuned to particular algorithms at different representation levels, we are able to build this new HLS tool that is more scalable and customizable towards various applications coming with intrinsic structural or functional hierarchies. ScaleHLS is able to represent and optimize HLS designs at multiple levels of abstraction and provides an HLS-dedicated transform and analysis library to solve the optimization problems at the suitable representation levels. On top of the library, we also build an automated DSE engine to explore the multi-dimensional design space efficiently. In addition, we develop an HLS C front-end and a C/C++ emission back-end to translate HLS designs into/from MLIR for enabling the end-to-end ScaleHLS flow. Experimental results show that, comparing to the baseline designs only optimized by Xilinx Vivado HLS, ScaleHLS improves the performances with amazing quality-of-results – up to 768.1x better on computation kernel level programs and up to 3825.0x better on neural network models.

2 Likes