At Intel Labs, we’re working on a hybrid compiler/library approach to high-performance code generation, as presented in the TPP paper.
We use a Tile dialect to decompose ML “operations” into tile operations to then re-compose them into a hardware-efficient way (blocking/tiling/fusing), then calling hand-written micro-kernel libraries.
This scales well on CPUs because writing small tile kernels is easier than one large kernel for every combination of high-dimensional shapes, and calling micro-kernel libraries are efficient. But on GPUs, this is not so trivial.
Essentially, for GPUs (and similar devices) you need to re-compose the micro-kernels into larger kernels to offload whole-computation, taking advantage of the hardware threads and units inside the larger kernel, and that needs cooperation between compiler and micro-library writers, ABI agreements for register reuse, etc.
After discussing with different teams inside and outside of Intel, on forums and at the MLIR Hackathon in Edinburgh, we have a proposal for a tile dialect that could work for both types of devices (and we hope many more). However, the proposal isn’t precise, because there is no “right way” of doing it, and either way, we’ll need buy-in from the community.
This RFC’s intent is to gather the direction other groups working on the same problems feel is the most productive. So we’d welcome the feedback from everyone here.
Rationale
The core idea is that high-level ML and HPC problems are described in “problem space” while efficient hardware implementation is written in “architecture space”. These are already different for a single problem and a single architecture, let alone multiple problems on different architectures.
The Tile dialect’s main objective is to allow seamless decomposition and re-composition of these problems as operations on tiles. A tile could be different for CPUs and GPUs and even across different CPUs, but it’s usually a 2D structure because you usually have N registers with M lanes each to represent your inner computation. Broadcast and reduction operations change that a bit but that’s irrelevant for the current discussion.
Most importantly, tile operations can be executed in parallel by the hardware. For example, a reducing GEMM can be considered a tile operation if it reduces the K dimension, then getting fused with a bias add later on.
In a nutshell:
- Each Tile operation represents a computation on a 2D “tile”
- Element wise, matmul, broadcast, reduction, specialized (ex. ReLU), etc.
- Allow decomposition from “problem space” into parallel tile ops
- Allow re-composition to “architecture space” and into larger kernels (CPU/GPU)
- Choosing the tile sizes, loop order, fusion opportunities is a job for the compiler
- Implementing the most efficient micro-kernel for a specific architecture is a job for the library/vectorizer
- Compiler and libraries need to talk to each other to converge to a global optimal solution
Current State
We have created a TPP tile dialect and an XSMM micro-kernel dialect to represent these concepts.
We don’t claim they are the most optimal way for either problem, but they do allow us to work on the problems and get good performance out of it. We seek consensus on what the most optimal way is, so that we can continue pushing the community for the best way forward, not just one that “works for us”.
Our current prototype has the following path:
- Lower
StableHLO/Torch/TCPinto [linalg+arith/math] - Tile into [
SCFandlinalg+arith/math] (decomposition) - Lift to [
SCFandTPP] and fuse (re-composition) - Lower to
XSMM/OpenMPandLLVMcalls
In GPUs, the last step would need instead to use GPU threads and bundle the whole parallel loop into a kernel to offload. We’re working on a prototype for that now.
The main problems with this approach are:
linalgis great for tiling and fusing, but it mandates strongly nested loops, which hinders fusion at different nest levels, for example, and element-wise after the last reduction of a GEMM tile.linalg+arith/mathis not helpful for pattern-matching (re-composition), so we need to lift fromlinalg+arith/math-on-scalarsto named tile ops (our TPP dialect).- The end result of a tiled
linalgis a set ofSCFloops (for,forall,parallel) and morelinalginside, so we have to go toSCFanyway. - Lowering to
linalgtoo early means forcing a particular implementation detail that may not be optimal in the target architecture, becauselinalguses scalararith/mathinside its regions. For example, a ReLU can be lowered togeneric { maxf i,0 }orgeneric { cmp; sel }when the target has a better implementation for one or the other, or neither.
To avoid lowering and lifting, and being able to use SCF loops from the beginning, we could have an extension of a tensor dialect with named ops (like TCP) that gets tiled directly to tile ops if possible, and linalg generics if not.
This uses the “best of both worlds”, but needs invasive changes to many upstream dialects.
This alternative process (with tensor ops dialect) would be:
- Lower
StableHLO/Torch/TCPinto [tensor ops on ND-tensorsandlinalg+arith/math] - Tile into [
SCFandtensor ops on tiles(linalgtiles separately)] (decomposition) - Optional lifting from
tiled linalgif needed - Reorg to [
SCFandtensor ops on tiles] and fuse (re-composition) - Lower to kernel / micro-kernel calls, vectorize, group for GPUs, etc.
Reshuffle of Dialects
For a while there has been discussions to remove tensor and memref support from arith and math, and that linalg is gaining too many named ops while that’s not the original design.
Recently we agreed to a TCP dialect that converges StableHLO, Torch and other ingress dialects. On its own, it can serve as a common ingress, but may prove hard to justify yet-another-partial-conversion route if it does not provide a value beyond convergence.
TOSA has a lot of the required functionality (ND shapes, named ops, strict semantics) but it’s not an upstream dialect in the sense that its development is controlled by a separate group that needs to certify every change. In essence, it serves as a very stable exchange format, not so much as an upstream development and performance dialect.
So, to avoid lowering and raising IR, we’d need a dialect that:
- Can represent named operations on N-dimension tensors/memrefs
- Implements
TileInterface, imply specific affine maps and parallel/reduction iteration - Have passes that can easily match specific dimensions as tiles (0~2D)
- Have the same semantics as
arithandmath, but on tensors - Have additional operations that are common to ML/HPC problems
- Have a direct mapping between ingress dialects and tile
- For everything else we use
linalggenerics
Potential Solutions
TCP
The most appealing to me is to use TCP. It already has most of those features, but not yet the whole intention, so it would need some re-thinking if we want to go that way. But creating another Tensor dialect would conflict with the existing tensor dialect and lower the appeal of TCP.
This is what I understood initially, but it seems the group implementing it had different ideas. If this post helps convince to extend the semantics, great. If not, we can do with some other dialect.
TensorArith / TensorMath
Basically arith and math for tensor and memref. I don’t like this.
Or bundle into a single “TCP but for tiling”, if the TCP folks don’t think it’s a good idea to use TCP for tiling. Reduces the appeal of TCP.
Tensor
We could also extend tensor to have all arith and math ops and more (ML/HPC) and bloat it a lot, which would also mean we need the same for memref (vector too?), so this to me is the least appealing one.
It would create a mismatch between arith+math on scalar+vector and tensor and memref with shape ops plus a whole set of arithmetic and maths and ML and HPC. To me, tensor, memref and vector are type dialects.
Arith / Math / ML / HPC
Formalise arith, math, expand into new dialects for ML and HPC dialects on scalars and tensors. May not be possible.
This is a long road and has been rejected in the past a few times because the semantics isn’t consistent across architectures like “higher-level ML operations” are.
We could just get away with ML and HPC here, same as TensorMath above.
TPP
Continue on our current route to have a “tile only” dialect, and continue to lift from linalg+scalar into tile operations. This is working so far, but we have an increasing number of pattern matchers, to find an increasing number of patterns in the wild.
Better Ideas?
As I said at the start, these are just some ideas (and not even great ones). I may be missing something completely obvious, so I’m seeking honest feedback.
From the people I talk to, the code I read elsewhere, people are more or less struggling with the same problems we are, so if someone has a better solution, a lot of people will be very happy! ![]()
@mehdi_amini @sanjoyd @nicolasvasilache @jpienaar @stellaraccident @TobiasGrosser