See the previous published edition.
Welcome to the twelfth issue of the MLIR (bi)Weekly, a newsletter (published on Friday) covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!
Highlights
The last open meeting gave the opportunity for a nice overview and discussion about the work on the asynchronous modeling:
[RFC] New dialect for modelling asynchronous execution at a higher-level and [RFC] Async/Await dialect targeting LLVM coroutines (slides and recording).
MLIR Core
Infrastructure
- RegionKindInterface and support for Graph Regions without SSA-dominance. Along the way many concepts in the Language Reference were also clarified.
- The first commit in a large refactoring of DialectConversion type conversions has landed. This refactoring will ensure that patterns produce legal types, and that operands to conversion patterns have properly converted types.
- Types and attributes now support a âmutableâ component. This is a precondition for supporting recursive types that require delayed initialization, in particular LLVM structure types.
- An RFC for âDebug Actionsâ was sent out. This framework would allow external entities to control various aspects of the compiler, and is intended to enable debug counters, optimization fuel, interactive compiler debugging, and more.
- DialectConversion now emits an error if an operation or value was marked as âerasedâ, but has live users after the conversion process. (It previously asserted).
Shape Dialect
- Restructure shape dialect so that its operations can operate on shape-specific values or on standard values. The shape-specific types like
shape.shape
andshape.size
correspond to the standard typestensor<?xindex>
andindex
with the difference that they can hold error values. The shape operations can now operate on either and ensure error propagation where it occurs.**
Optimizations and Code Generation
- An RFC to clean up casting support in MLIR core was shared, with the goal to upstream the HLO cast operations. It has been tested with lowering of the dynamic reshape cast operation to LLVM.
CPU codegen
- Started to explore what operations are missing in the Vector dialect to support sparse computations:
- The gather/scatter operations were added to the Vector dialect
- A reference lowering to LLVM intrinsics was added to run on CPU
- Subsequent improvements made sure this lowers to e.g. effective AVX512
- Currently benchmarking / debunking matrix times vector cases:
- Sparse SDOT, Sparse SAXPY, Sparse SAXPY (jagged diagonal)
- A few other useful primitives were identified, still TBD
- Continued matmul on AArch64 study
- Near peak matmul for all matrices of size multiple of tile_size including small matrices.
SPIR-V
- Continued progress on SPIR-V to LLVM conversion. A conversion manual is set up to show how different IR constructs are mapped and tracking the progress. New patterns are added for simple SPIR-V branches, more patches are coming to cover
spv.selection
,spv.loop
, and load/store memory operations. - Standard signed remainder to SPIR-V conversion is fixed to use
spv.UMod
to emulate because Vulkan does not supportspv.SMod
/spv.SRem
with negative values. - A rationale section is added to the doc to explain why SPIR-V conversion is not using MemrefDescriptor as the CPU codegen.
- Continued work on supporting shared memory for matmul
Other
In the Ecosystem
IREE : Compiler/runtime for ML models focused on resource constrained deployments
- Ops: mhlo.iota on SPIR-V, fixes to out of bounds access for pad
- Debugging enhancements (tensor trace op, additional tips)
- New CI pipelines: Cross compile and test on Android aarch64-v8a, Run tests on Nvidia GPUs, Scheduled build of manylinux python wheels (generic/mhlo compiler and runtime only â TensorFlow compiler wheel building not yet running in buildkite).
- Significant productionalization work to implement pipelines, fix issues and get them to green on Nvidia GPU, Android Mali GPU, and Android aarch64 CPU
- Ad-hoc tested (and fixed bugs) on Samsung S10
- Detailed codegen pipeline documentation. Includes roadmap notes with specific rationale and next steps with respect to convergence/upstreaming.
TensorFlow
- XLA-CPU: Added support for generating matrix-vector multiplies through Linalg. Performance results look positive, still need to fix correctness issues with transposed inputs.
- XLA-GPU: moving on with refactoring the backend to operate on the LMHLO dialect:
- Further cleanups on the XLA/GPU backend
- Not using XLA-based llvm_ir::AliasAnalysis anymore.
- Nested computations donât depend on BufferAssignment anymore.
- LHLO::SortOp ported and passing unit tests:
- LHLO ops should carry names for debugging.
- Nested FusionOp (de)serialization support for the current MHLO.
- Added tests with non-identity layouts.
- Further cleanups on the XLA/GPU backend
CIRCT: Circuit IR Compilers and Tools aka âMLIR for hardwareâ
- Conversion from standard to handshake was updated to generate a single block region, to be compatible with the upstreamed version of Graph Regions
- An initial lowering from the handshake dialect into FIRRTL now exists
- The handshake dialect gained a simple buffer insertion pass to break potential combinational loops after RTL generation
- FIRRTL gained better support for bit extract operations
- The FIRRTL to RTL transformation is now a Pass, rather than a legalization with patterns because the type conversions were too messy
- LLHD gained a bunch of basic folding optimizations, particularly for variadic operations