MLIR News, 12th edition (7/25/2020)

mehdi_amini · July 12, 2020, 9:06pm

Welcome to the twelfth issue of the MLIR (bi)Weekly, a newsletter (published on Friday) covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!

Highlights

The last open meeting gave the opportunity for a nice overview and discussion about the work on the asynchronous modeling:
[RFC] New dialect for modelling asynchronous execution at a higher-level and [RFC] Async/Await dialect targeting LLVM coroutines (slides and recording).

MLIR Core

Infrastructure

RegionKindInterface and support for Graph Regions without SSA-dominance. Along the way many concepts in the Language Reference were also clarified.

The first commit in a large refactoring of DialectConversion type conversions has landed. This refactoring will ensure that patterns produce legal types, and that operands to conversion patterns have properly converted types.
Types and attributes now support a “mutable” component. This is a precondition for supporting recursive types that require delayed initialization, in particular LLVM structure types.
An RFC for “Debug Actions” was sent out. This framework would allow external entities to control various aspects of the compiler, and is intended to enable debug counters, optimization fuel, interactive compiler debugging, and more.
DialectConversion now emits an error if an operation or value was marked as “erased”, but has live users after the conversion process. (It previously asserted).

Shape Dialect

Restructure shape dialect so that its operations can operate on shape-specific values or on standard values. The shape-specific types like shape.shape and shape.size correspond to the standard types tensor<?xindex> and index with the difference that they can hold error values. The shape operations can now operate on either and ensure error propagation where it occurs.**

Optimizations and Code Generation

An RFC to clean up casting support in MLIR core was shared, with the goal to upstream the HLO cast operations. It has been tested with lowering of the dynamic reshape cast operation to LLVM.

CPU codegen

Started to explore what operations are missing in the Vector dialect to support sparse computations:
- The gather/scatter operations were added to the Vector dialect
- A reference lowering to LLVM intrinsics was added to run on CPU
- Subsequent improvements made sure this lowers to e.g. effective AVX512
- Currently benchmarking / debunking matrix times vector cases:
  - Sparse SDOT, Sparse SAXPY, Sparse SAXPY (jagged diagonal)
- A few other useful primitives were identified, still TBD
Continued matmul on AArch64 study
- Near peak matmul for all matrices of size multiple of tile_size including small matrices.

SPIR-V

Continued progress on SPIR-V to LLVM conversion. A conversion manual is set up to show how different IR constructs are mapped and tracking the progress. New patterns are added for simple SPIR-V branches, more patches are coming to cover spv.selection, spv.loop, and load/store memory operations.
Standard signed remainder to SPIR-V conversion is fixed to use spv.UMod to emulate because Vulkan does not support spv.SMod/spv.SRem with negative values.
A rationale section is added to the doc to explain why SPIR-V conversion is not using MemrefDescriptor as the CPU codegen.
Continued work on supporting shared memory for matmul

Other

In the Ecosystem

IREE : Compiler/runtime for ML models focused on resource constrained deployments

Ops: mhlo.iota on SPIR-V, fixes to out of bounds access for pad
Debugging enhancements (tensor trace op, additional tips)
New CI pipelines: Cross compile and test on Android aarch64-v8a, Run tests on Nvidia GPUs, Scheduled build of manylinux python wheels (generic/mhlo compiler and runtime only – TensorFlow compiler wheel building not yet running in buildkite).
- Significant productionalization work to implement pipelines, fix issues and get them to green on Nvidia GPU, Android Mali GPU, and Android aarch64 CPU
- Ad-hoc tested (and fixed bugs) on Samsung S10
Detailed codegen pipeline documentation. Includes roadmap notes with specific rationale and next steps with respect to convergence/upstreaming.

TensorFlow

XLA-CPU: Added support for generating matrix-vector multiplies through Linalg. Performance results look positive, still need to fix correctness issues with transposed inputs.
XLA-GPU: moving on with refactoring the backend to operate on the LMHLO dialect:
- Further cleanups on the XLA/GPU backend
  - Not using XLA-based llvm_ir::AliasAnalysis anymore.
  - Nested computations don’t depend on BufferAssignment anymore.
- LHLO::SortOp ported and passing unit tests:
- LHLO ops should carry names for debugging.
- Nested FusionOp (de)serialization support for the current MHLO.
- Added tests with non-identity layouts.

CIRCT: Circuit IR Compilers and Tools aka ‘MLIR for hardware’

Conversion from standard to handshake was updated to generate a single block region, to be compatible with the upstreamed version of Graph Regions
An initial lowering from the handshake dialect into FIRRTL now exists
The handshake dialect gained a simple buffer insertion pass to break potential combinational loops after RTL generation
FIRRTL gained better support for bit extract operations
The FIRRTL to RTL transformation is now a Pass, rather than a legalization with patterns because the type conversions were too messy
LLHD gained a bunch of basic folding optimizations, particularly for variadic operations

Recent Talks

Asynchronous execution slides recording

Topic		Replies	Views
MLIR News, 10th edition (6/26/2020) Newsletter	0	1454	June 15, 2020
MLIR News, 25th edition (1/22/2021) Newsletter	2	1301	January 29, 2021
MLIR News, 9th edition (6/12/2020) Newsletter	0	964	June 1, 2020
MLIR News, 8th edition (5/29/2020) Newsletter	0	1508	May 18, 2020
MLIR News, 42nd edition (9/4 - 9/20/2021) Newsletter	0	851	September 3, 2021