Schedule
9:00 – 9:05 Opening
GPU Compilation
9:05 – 9:30 CUDA Tile IR (Matthias Springer, Lorenzo Chelini)
9:30 – 10:00 ASTER: MLIR-Based Assembly Tooling (Nicolas Vasilache, Fabian Mora Corder, Kunwar Grover)
10:00 – 10:30 Auto-tuning MLIR schedules for Intel GPUs (Tuomas Karna, Rolf Morel)
10:30 – 11:00 Break
Synthesis & Optimization
11:00 – 11:30 Progressive Arithmetic Lowering to Synthesizable Datapaths (Louis Ledoux, Pierre Cochard et al.)
11:30 – 12:00 Multi Stage Sequential RL for MLIR Meta-Optimization (Prakanth Thilakaraj)
12:00 – 13:00 Lunch
Compiler Abstractions
13:00 – 13:30 From Graphs to Warps: Semantic Interoperability (Nachiketa Gargi)
13:30 – 14:00 Beyond Constants: Mojo’s Attribute-Based Expression System (Billy Zhu)
14:00 – 14:30 Tamagoyaki: MLIR-Native Equality Saturation (Sasha Lopoukhine, Jules Merckx et al.)
14:30 – 15:00 Break
AI & HPC Compilation
15:00 – 15:30 MLIR-RAJA: Bridging AI Models and HPC (Tai-Hsiang Peng, Hung-Ming Lai et al.)
15:30 – 15:55 Training-Aware Compilation for Custom AI Accelerators (Mriganka Bezbaruah, Akshay K et al.)
15:55 – 16:00 Closing
Abstracts
CUDA Tile IR: Lessons from a Tile-Centric CUDA Dialect for MLIR
Matthias Springer (NVIDIA Switzerland), Lorenzo Chelini (NVIDIA Switzerland)
This talk presents CUDA Tile IR, a tile-based CUDA dialect for MLIR, focusing on the design trade-offs that differentiate it from upstream dialects such as arith, tensor, memref, linalg, and async. Using concrete examples (e.g., matrix multiplication, TMA–friendly load/store patterns, and token-based ordering), the talk contrasts CUDA Tile IR’s type system, operations, and overall dialect design with existing MLIR abstractions, and highlights practical lessons for developers designing their own vendor-specific dialects.
ASTER: MLIR-Based Assembly Tooling and Representations
Nicolas Vasilache (AMD), Fabian Mora Corder (AMD), Kunwar Grover (AMD)
Today, achieving peak performance on modern AI accelerators often requires control over low-level hardware features. This trend is expected to further exacerbate as more asynchronicity and dynamism are built first-class in the hardware. As Dark Silicon trends continue, hardware is expected to expose coarser-grain primitives and coarser-grain programming models must be used (e.g. with warp/wave specialization, the low-level programming model increasingly resembles MPI/MIMD-style parallelism but complexified by low-level hardware constraints such as instruction issue ports or warp/wave scheduling and specialization).
AMD’s open approach to hardware ISA documentation creates a unique opportunity to build world-class assembly tooling in the open, making AMDGPU ASM accessible to a broader community as well as higher-level tools, while maintaining expert-level control.
To reap the benefits of modern and future HW we believe an order of magnitude better low-level tooling is needed.
Aster builds the foundations for highly-controllable assembly production and pushes the boundaries of what’s possible in low‑level performance tooling.
Auto-tuning MLIR schedules: a case study targeting Intel GPUs
Tuomas Karna (Intel), Rolf Morel (Intel)
We present an end-to-end MLIR schedule for ML workloads and the auto-tuning thereof. The schedule targets Intel Battlemage GPUs and takes kernels all the way from Linalg-dialect ingress to LLVM IR. Alongside scheduling ops for, e.g., tiling, the schedule encodes its auto-tuning problem through tuneable “knob” ops and encodes the many hardware-imposed constraints on parameters as ops as well. We present a harness that performs automatic optimization of such auto-tuneable schedules and show how our approach - using upstream MLIR and without user-specified tiling - can achieve performance comparable to state-of-the-art frameworks.
Progressive Arithmetic Lowering from Tensor Kernels to Synthesizable Datapaths
Louis Ledoux (INSA / INRIA Lyon), Pierre Cochard (INSA / INRIA Lyon), Florent de Dinechin (INSA / INRIA Lyon)
This work presents an end-to-end MLIR-based compilation flow that lowers high-level machine-learning and DSP kernels to explicit combinational and sequential datapaths suitable for CIRCT and RTL export. The flow treats arithmetic as a multi-level concern, progressively exposing numeric intent, structured control, and hardware structure from tensor programs down to circuit-level IR. It integrates real-number expression recovery, configurable floating- and fixed-point lowering, and direct construction of circuit-level datapaths using CIRCT-compatible representations. The approach enables floating-point and specialized arithmetic to remain first-class up to the circuit boundary, supporting fine-grained trade-offs in precision, performance, and area. The flow has been validated on real silicon, including a taped-out DSP design using the GF180MCU open-source PDK and an end-to-end compilation of a PyTorch LLaMA layer into a synthesized Sky130 process-node.
Multi Stage Sequential Reinforcement Learning Environment for MLIR Meta-Optimization
Prakanth Thilakaraj (University of Warwick)
Machine learning–guided compiler optimization has recently attracted significant attention, with applications such as optimization pass prediction in LLVM IR and heuristic tuning in domain-specific language (DSL) compilers. In this work, we extend this paradigm to the Multi-Level Intermediate Representation (MLIR), a compiler infrastructure for DSL development that is widely used in AI compiler stacks. We present MLIRCompilerEnv, a reinforcement learning (RL) environment for predicting optimization passes and pass options in MLIR-based compiler pipelines, and evaluate it in the context of AI compilation. Our design targets the linalg, affine, and scf dialect levels, while remaining extensible to other MLIR-based compiler stacks for meta-optimization. We formulate the compiler optimization passes as a multi-stage RL problem, where each dialect stage is associated with a dedicated agent responsible for predicting optimization passes and their corresponding options. A message-passing graph neural network serves as a shared backbone across agents extracting structural features from the intermediate representations (IR). Agents operate sequentially to mirror the ordering of stages in the MLIR lowering pipeline, and end-to-end execution runtime is used as the reward signal to guide learning. To support structured learning over compiler IRs, we introduce a graph construction framework that fuses control-flow and data-flow information into a unified graph representation, covering MLIR dialects including linalg, scf, affine, arith, and math. We illustrate how this design naturally generalizes beyond AI compilers and facilitates extension to other MLIR-based compiler stacks such as stencil compilers. Initial results demonstrate the feasibility of the approach and we discuss performance trade-offs and compilation overhead introduced by the learning-based framework. The framework is currently being further developed to include matching optimization passes to the MLIR based IREE compiler for a comprehensive like-for-like comparison and analysis.
From Graphs to Warps: Semantic Interoperability Across MLIR Abstraction Levels
Nachiketa Gargi (NVIDIA)
Modern ML compiler stacks span multiple semantic abstraction levels, from graph-level program representations to tile-based computation and SIMT kernels. Composing these layers reliably remains challenging in practice, particularly in GPU compilers, where execution semantics, memory hierarchy, and parallelism are explicit and tightly coupled to performance. This talk characterizes recurring interoperability failures that arise when crossing abstraction boundaries, such as loss of semantic information during lowering, conflicting ownership of layout and scheduling decisions, and non-composable cost models. Using examples drawn from the MLIR ecosystem we illustrate why these interoperability failures are fundamental rather than incidental. The goal of this talk is to frame semantic interoperability as a first-class problem in MLIR-based compiler design and to outline open research questions.
Beyond Constants: Mojo’s Attribute-Based Expression System
Billy Zhu (Modular)
The Mojo programming language leverages MLIR’s attribute system to represent compile-time expressions as attribute trees, enabling parametric IR where operations are “staged” with expression attributes that evaluate to constants during the compilation pipeline. Our expression language is built on a typed lambda calculus using de Bruijn indices for bound variables, with dialects contributing their own types and operators to create a rich, extensible system. To evaluate these expressions, we developed a custom AttrTypeReplacer with depth-aware caching (handling de Bruijn index-based references) and context-aware replacement (providing access to outer symbol tables). This talk presents our representation and our evaluator design as a case study for programming language developers using MLIR for high-level IR, demonstrating how MLIR’s attribute system can be so much more than just for “constants”.
Tamagoyaki: MLIR-Native Equality Saturation
Sasha Lopoukhine (University of Cambridge), Jules Merckx (Ghent University), Sam Coward (University College London), Jianyi Cheng (University of Edinburgh), Bjorn De Sutter (Ghent University), Tobias Grosser (University of Cambridge)
Equality saturation (EqSat) is an expression rewriting technique based on efficiently representing equivalent expressions. In recent years, it has successfully been applied in many different domains. Applying EqSat on MLIR code, however, requires additional effort for converting rewrite patterns and IR to formats that are understood by external tools. We present Tamagoyaki, an MLIR (and xDSL) framework to represent equivalences directly in your IR. Building on pdl, we apply rewrite patterns on this representation. By encoding equivalences explicitly in IR, we open the door to including more complex compiler passes in EqSat. As a case study, we replicate the core procedure of Herbie, a floating point accuracy optimizer, directly in MLIR. Additionally we show how Tamagoyaki can easily be applied to other MLIR projects such as CIRCT.
MLIR-RAJA: Bridging AI Models and HPC Performance Portability
Tai-Hsiang Peng (National Tsing Hua University, Taiwan), Hung-Ming Lai (National Tsing Hua University, Taiwan), Chun-Lin Huang (National Tsing Hua University, Taiwan), Wei-Shen Huang (National Tsing Hua University, Taiwan), Jenq-Kuen Lee (National Tsing Hua University, Taiwan)
RAJA, originally developed by Lawrence Livermore National Lab, is a C++ template library widely used in High-Performance Computing (HPC). It ensures that code can run efficiently on different hardware, such as CPUs and GPUs, without needing to be rewritten. At the same time, AI models are becoming essential tools for modern scientific discovery. To connect these two worlds, we introduce MLIR-RAJA, a MLIR dialect that links high-level AI frameworks with the performance benefits of RAJA. Our work defines a specific RAJA Dialect in MLIR to build an automated end-to-end flow, capable of translating AI models directly into optimized RAJA C++. This solution eliminates the barrier between AI development and HPC execution, enabling the automatic generation of portable code for AI Models. Experimental results demonstrate that our structure-aware optimizations achieve significant improvement in sequential execution over baseline. Furthermore, these performance gains extend to OpenMP-enabled parallel execution. Finally, we flow MLIR RAJA dialects into LLVM to utilize a variety of backends to support MLIR RAJA computations.
Training-Aware Compilation with MLIR for Custom AI Accelerators
Mriganka Bezbaruah (CDAC), Akshay K (CDAC), Prachi Pandey (CDAC)
This session presents the design of a training-aware compilation flow built using MLIR-based infrastructure for lowering PyTorch models toward accelerator-oriented intermediate representations. While many compilation pipelines focus primarily on inference, enabling full training requires handling both forward execution and backward gradient computation within the same framework. The talk explains how models are traced to obtain computation graphs, how forward and backward passes are compiled within a unified pipeline, and how parameter updates and memory dependencies are handled across training iterations. The session focuses on operator mapping strategies and practical considerations when compiling forward and backward graphs consistently in an MLIR-based flow.