MLIR News, 13th edition (8/7/2020)

See the previous published edition.

Welcome to the thirteenth issue of the MLIR (bi)Weekly, a newsletter (published on Friday) covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!

Highlights

  • CIRCT and MLIR-NPComp were both moved to the LLVM organization as LLVM incubator projects: contributions are welcome!
  • MLIR based CodeGen (using structured representation / Linalg) starts getting used in production on CPU and GPU in TensorFlow (see TensorFlow section below).

MLIR Core

Infrastructure

Table-driven Infrastructure

Shape Dialect

  • The dialect is split into safe and unsafe versions based on operand and result types. This is now in use in the TensorFlow Kernel Generator project (see TensorFlow update below). As next steps, we will explore missing canonicalizations and holes in code generation.
  • Beyond code generation, we plan to focus more on exploratory work for shape inference, using nocomp as a testing ground. This effort is just starting.

Optimizations and Code Generation

  • LLVM dialect type redesign has been implemented and landed. Tooling is available to migrate tests to the new syntax.
  • LLVM dialect no longer depends on LLVMContext and requires no explicit locking, making parallel MLIR-to-LLVMIR translation and parallel compilation of resulting modules a reality.
  • Work is ongoing to support convolution operations in Linalg, currently convolutions can be lowered through SCF loops.
  • We have started modernizing the implementation of the GPU dialect to catch up with MLIR infrastructure improvements. This is a preparation step for extending the GPU dialect to model the asynchronous behavior of some of its operations.
  • The async dialect proposal has seen more discussion and we will do a final round of consolidation before starting a first prototype.
  • We are discussing a better model for mapping computations to processors (like GPUs). A first prototype in the context of LinAlg was implemented.

CPU codegen

  • The CPU performance bugs identified by a previous case study (AVX512 codegen for vector dialect) continue to be addressed. For example, by slightly changing the way vector.extract_strided_slice is lowered, the operation runs 3x faster for longer vectors. This also directly improved vector.shape_cast and vector.extract_slices.
    • Note that this nicely illustrates the advantage of progresive lowering; improvements to a core operation benefit all other operations that lower into that operation
  • Vector.transfer operations can now be split between a fast unmasked path and a slower path with padding. This results in significant speedups on code with boundary conditions.
  • Improved lowering of matmul-like and matvec-like vector.contract to vector.reduce (via vector.transpose) has landed.
  • More useful operations are being identified and added to the vector dialect. Besides gather/scatter, now expand-load/compress-store are also supported, to be used in sparse computations. Also, masked load/store are now supported in the vector dialect. These make a potential progressively lowering of transfer-read/write in the future a bit simpler (currently the latter operations lower directly into LLVM IR). All new operations have accompanying integration tests and performance microbenchmarks on CPU
  • See also XLA uses of vector codegen in the TensorFlow update below.

SPIR-V

  • Initial commit to generate OpenCL compliant SPIR-V code landed by allowing “Kernel” capability in SPIR-V module.
  • SPIR-V to LLVM conversion status: Conversion patterns for spv.loop, spv.selection to LLVM dialect and GLSL ops to LLVM intrinsics added.
  • spv.loop and spv.selection gained support for loopControl and selectionControl attributes.
  • Support for lowering memrefs with vector element type to SPIR-V dialect landed. This allows translation of loads/stores and allocations of memrefs with vector element types.

Other

In the Ecosystem

Flang, the LLVM Fortran Compiler

IREE : An Experimental MLIR Execution Environment

  • Set up environments for testing IREE on Android mobile GPU - first step towards model benchmarking
  • NPCOMP: First runtime engine is submitted, its binary size is < 6KB. It can be a good option for inference on microcontrollers.

mlir-npcomp: Prototype for compiling numpy programs

TensorFlow

  • XLA:CPU now emits small matrix-matrix multiplies (k <= 128) through Linalg and Vector dialect. This is enabled by default and running in production workloads. Performance-wise this change was neutral, but we now have a lot more potential for improvement over the previous custom matmul emitter. Work is underway to expand this strategy to more ops like matrix-vector multiplies and sum/max reductions.

  • XLA:GPU codegen migration to MLIR LHLO is progressing with support for SortOp, and many smaller fixes (propagation of debug names, fusion serialization, non-identity layouts).

  • The Kernel Generator project intends to generate TensorFlow kernels ahead-of-time (at TensorFlow build time), starting with Nvidia GPU. A talk is scheduled next month at the public MLIR meeting.

    • The device code for Tanh and Abs in in production. We are working on setting up benchmarking for regression testing before enabling more kernels. This work leverages the MLIR-based TF/XLA bridge as it relies on dynamic shapes,
    • Host-side code generation for unary kernels is working end-to-end from a tf dialect input to execution on CPU with a mock TensorFlow runtime. Next steps are integrating the required bits for hooking up GPU and the glue code to integrate properly with the TensorFlow runtime.

CIRCT : Circuit IR Compilers and Tools aka ‘MLIR for hardware’

Recent Talks