See the previous published edition.
Welcome to the twenty-eight issue of the MLIR (bi)Weekly, a newsletter covering developments in MLIR, and related projects in the ecosystem. MLIR (bi)Weekly is brought to you by a collective effort of contributors, we welcome your contributions!
Highlights
- An evolution of the
tensor
MLIR type is in discussion in the RFC on sparse tensor type as first-class citizen. - The Python bindings now allow to JIT and invoke MLIR code, the interface is still very primitive but will evolve!
- A prototype for DSL for authoring Linalg operations is in development in-tree, see the RFC: Linalg OpDSL.
MLIR Core
Infrastructure
- The proposal discussed in [RFC] Debug Actions in MLIR: Debug Counters for the Modern World has landed!
Table-driven Infrastructure
- Attributes can now be generated using ODS (using similar mechanisms to Types).
Codegen
- The Vector dialect now has
vector.load
andvector.store
ops. These operations model contiguous vector loads and stores from/to memory and allow representing vector loads and stores that couldn’t be represented usingstd.load
andstd.store
ops. They will facilitate the progressive lowering of both Affine vector loads/stores and Vector transfer reads/writes. D96185 [mlir][Vector] Introduce ‘vector.load’ and ‘vector.store’ ops (llvm.org) - Generalized vector.scatter/gather operations to higher dimension
- Unifies their form with masked load/store and compress/expand, simplifies vectorization from scalar memref to SIMD memref
- Rewrite all-true/false vector masked-load/store/expand/compress direct into l/s => No longer necessary to through transfer operations first
- Various improvements and bug fixes to linalg on tensors in support of the sandbox that has landed in IREE, including:
- Fix order of dimensions in hoistPaddingOnTensors.
- Canonicalize scf.for last tensor iteration result.
- Tighten the rules around folding TensorLoadOp.
- Add folding of vector transfers from/into tensor producing ops.
- Add folding of linalg.copy that are in fact identities.
- TOSA got more lowering to std/scf for control flow and to Linalg (transpose, reshape, and identity)
SPIR-V
- A few patches landed to rename some existing SPIR-V ops to improve naming consistency, more are in reviews.
Other
- LLVM Dialect has now support for memory access group metadata and LoopOptions metadatas
In the Ecosystem
Flang, the LLVM Fortran Compiler
IREE : An Experimental MLIR Execution Environment
- Landed HAL and compiler support for CUDA and enabled first CUDA E2E tests for element wise ops.
- A functional linalg-on-tensor sandbox has landed in iree/experimental. It will serve as a staging area for bridging the gap between abstractions needed by IREE and core.
- Uses a custom single shot bufferization with inplace semantics to try and alleviate some of the difficulties raised in the post about bufferization.
mlir-npcomp: Prototype for compiling numpy programs
- GlobalizeObjectGraph tranformation to convert object graphs to a more easily analyzable form
- add ability to annotate TorchScript classes. Currently used for public/private annotations, but eventually will be important for providing other information to the compiler.
- properly model “derefinement” which is an impedance mismatch between TorchScript and MLIR.
- BERT imports and lowers. Resnet in progress.
TensorFlow / MLIR-HLO
XLA GPU CodeGen:
- Implementing whole graph LMHLO lowering (all WIP):
- Disable multi-stream support
- Either migrate or turn off --xla_hlo_profile
- Migrate Conditional and While ops
GPU Kernel CodeGen
- We have now launched all unary kernels from our initial launch set. Further kernel launches will require better rank specialization logic, which is planned but not currently being worked on.
- We have further optimized our broadcasting code, introducing the new
chlo.minimum_broadcast_shapes
operation. It is used as a pre-step for binary operations to reduce the number of dimensions to a minimum and hence the iteration space for the compute loop. This is particularly useful on GPU, where we have to recompute the iteration space from a 1d index. - For binary operations, we have started launching operations but have encountered some obstacles in testing. Tests commonly pattern match exact error messages and the ones produced by kernel generator are slightly different. We will need to update tests first, which is under way.
- The team is now looking into better fusion in the presence of broadcasts and performance improvements on CPU. Another focus is moving vectorization to the MLIR infrastructure instead of using LLVM’s vectorizer.
TFRT: A New TensorFlow Runtime
Auto-Fusion for TFRT
- Math function approximations (also works for vectors) landed in MLIR.
- JIT runtime supports functions returning memrefs without wrapping them in async.value
CIRCT : Circuit IR Compilers and Tools aka ‘MLIR for hardware’
- ESI’s Co-simulation feature is v1 feature complete. [Demonstration PR]
Recent Talks
Open MLIR Meeting
Fifth LLVM Performance Workshop at CGO
Slides are online for most talks, here are the MLIR ones:
- Moving LLVM’s code generator to MLIR framework
- Classical Loop Nest Transformation Framework on MLIR
- LTO and Data Layout Optimisations in MLIR
- COMET: Domain Specific Compilation for Heterogenous Targets
Recent Publications
This paper introduces HIR, an MLIR-based intermediate representation (IR) to describe hardware accelerator designs. HIR combines high level language features, such as loops and multi-dimensional tensors, with programmer defined explicit scheduling, to provide a high-level IR suitable for DSL compiler pipelines without compromising control over the micro-architecture of the accelerator. HIR’s explicit schedules allow it to express fine-grained, synchronization-free parallelism and optimizations such as retiming and pipelining. Built as a dialect in MLIR, it draws from best IR practices learnt from communities like those of LLVM. While offering rich optimization opportunities and a high level abstraction, HIR enables sharing of optimizations, utilities and passes with software compiler infrastructure.
Our implementation shows that the code generation time of the HIR code generator is on average 1112x lower than that of Xilinx Vivado HLS on a range of kernels without a compromise on the quality of the generated hardware. We believe that these are significant steps forward in the design of IRs for hardware synthesis and in equipping domain-specific languages with a productive and performing compilation path to custom hardware acceleration.