MLIR News, 37th edition (6/26 - 7/9/2021)

See the previous published edition.
Welcome to the thirty-seventh issue of the MLIR (bi)Weekly, a newsletter covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!

MLIR Core

Infrastructure

  • The MLIRContext now allows for users to inject a thread pool, this is useful to be able to shared a thread pool between different context or re-use one while destroying / re-creating the context, to avoid waiting for and recreating all the threads.

Codegen

  • Sparse compiler progress:
    • Started support for sparse tensors as output (for now restricted to annotated all-dense or when the nonzero structure does not change, e.g. for a_ij = a_ij * 2)
    • Generalized operations to include various other operators as well (unary abs/ceil/floor/neg, binary subtraction/bitwise ops/restricted divisions)
    • Refactored merger into utility directory, added unit tests (work by Gus, our summer intern)
    • Presented progress of sparse support in MLIR to TACO team
  • Misc: released fun “white paper” on using Prolog to find best assembly sequence
  • Conversion to LLVM has been refactored:
    • The generic utilities are exposed to the public (previously in standard-to-llvm conversion);
    • Separate passes for memref-to-llvm and math-to-llvm conversions, which makes standard-to-llvm conversion faster and more maintainable.
  • OpenMP loops now support reductions.

SPIR-V

Other

  • Addition of warp(subgorup)-synchronous MMA ops in GPU Dialect. Lowerings to NVVM Dialect were also added.

In the Ecosystem

IREE : An Experimental MLIR Execution Environment

  • CUDA backend:
    • Enabled integration tests for matmul_f16 using wmma ops in iree-llvm-sandbox

mlir-npcomp: Prototype for compiling numerical python programs

  • Generalize support for elementwise ops PR
  • Add support for IREE in TorchScript end-to-end tests PR

TensorFlow / MLIR-HLO

TFRT JIT

  • Further improvements in op coverage, including TensorFlows special reshape operation with support for auto-expanding dimensions of size -1.

Kernel Generator

  • Support for unsigned integers is completed and we have landed further mlir based kernels that use it.
  • We also ported more functionality to the complex dialect, enabling more kernels on complex numbers.
  • A first prototype of a kernel-gen jit has been implemented. It allows generating mlir based kernels on demand, allowing broader dtype coverage without the size implications of AOT based kernels.
  • A use-range analysis for optimizing buffer allocations has landed in TensorFlow. A corresponding buffer reuse analysis is under review.

CIRCT : Circuit IR Compilers and Tools aka ‘MLIR for hardware’

Recent Talks

Recent Publications

MAPS 2021 Predictive Data Locality Optimization for Higher-Order Tensor Computations

Automating locality optimization is still an open problem for compiler writers. Compiler-based approaches, guided by analytical cost models have achieved some success in matching high performance libraries on a restricted set of computations such as general matrix multiply (GEMM). On the other hand, library-based approaches may present some open scalability concerns. Recent developments in convolutional neural networks has seen an explosion of models, each with differing combinations of parameters. Manually tuning each of these configurations can take many development months. Further, these operations are called multiple times during machine learning training, which necessitates highly optimized implementations. 2D convolutional operators are unique in that they consist of 7-deep loop nests with different loops carrying reuse for different tensors, making the problem of identifying an optimal loop ordering hard. We devise a machine learning-based compiler which learns a regression model, correlating performance with the loop order. We integrate this model with other traditional compiler analysis for transformations such as loop unrolling and vectorization, relying on the Multi Level Intermediate Representation (MLIR) compiler framework. We achieve an average speedup of 1.67× and 1.41× for 2D convolution forward and weight update kernels respectively. We are also at 0.88× and 0.96× the performance of oneDNN’s best performing implementation which applies additional data layout transformations.