The two main topics that were discussed at the round table were cost models / target descriptors and named operations in Linalg. We ran out of time (1h) so had to stop.
The examples below are not meant to be accurate, and it’s just an idea of what the design is. The actual implementation may end up very different and will be discussed in the respective PRs.
Cost model / Target Descriptor
Reasonable consensus that we need a generic way to represent multiple targets in MLIR and that cost models need to make the interface between those target descriptions and the information that the passes need. Our PR goes into that direction, so it’s not a big departure.
The main delta was discussed after the round table on how to represent that in MLIR. The PR itself uses the MLIRContext
, which is a global state, but after speaking with @mehdi_amini and @jpienaar, there is a much better way to do that, via some TargetQueryAttribute
in the module. This would not be a dictionary (map of key/value pairs) on the IR itself, but an identifier into how to acquire the information.
Composition can be ade via lists of attributes, and can be used to override target descriptions (ex. TTI, JSON file, cmd-line options), or to have multiple (ex. CPU, GPU, XPU), etc. All that can be encoded into different query attributes and query attribute lists.
Example:
module {
// This creates an LLVM TTI using the X86 target and the rest as arguments
target = #target.tti['x86_64-linux-gnueabi','sapphirerapids','amx']
}
module {
// This parser a file and creates a run-time target (API TBD)
target = #target.json['my-device.json']
...
}
module {
// This allows two targets in the same IR, indexed by position
target = #target.list[
#target.tti['x86_64-linux-gnueabi','sapphirerapids','amx'],
#target.json['my-device.json']
]
// This is a CPU function
func.func cpu_func(...) #target.id['0'] {
...
}
// This is a XPU function
func.func cpu_func(...) #target.id['1'] {
...
}
}
module {
// This allows two targets in the same IR, indexed by string
target = #target.dict[
#target.id['CPU'],
#target.tti['x86_64-linux-gnueabi','sapphirerapids','amx'],
#target.id['XPU],
#target.json['my-device.json']
]
// This is a CPU function
func.func cpu_func(...) #target.id['CPU'] {
...
}
// This is a XPU function
func.func cpu_func(...) #target.id['XPU'] {
...
}
}
module {
// This overrides the TTI info with JSON info
target = #target.override[
// This is the baseline
#target.tti['x86_64-linux-gnueabi','sapphirerapids','amx'],
// This overrides TTI on intersection, adds the rest
#target.json['spr-special.json']
]
...
}
None of that is implemented and the PR won’t have all that, just JSON for now, but this is the idea.
Named Operations in Linalg
The main discussions revolved around semantics. The consensus seems to be that we want a strong documented semantics and encode the lowering/generalization to match that semantics and not the other way around.
Currently, the named ops have forced generalizations and that’s what we use for semantics. Some of them will change, all of them will be documented in the website.
This is important to have strong expectations from the front-ends, so that they lower their implicit behaviour into explicit Linalg, given that not all of them have the same expectations, we need to common it up on the same language.
The main semantic agreements are:
- Named ops will not have implicit casts (type, shape)
- Element-wise ops will require same types for input/output
- Matmul/conv will have existing appropriate type restrictions
- Broadcast will have to use
linalg.broadcast
on the appropriate operand - Type cast / quantization should use appropriate quantization strategies
None of this changes the linalg.generic
operation, which continues to represent all of those casts as affine maps and/or arithmetic casts inside the region block.
Another important discussion was surrounding a grouping op. Currently we have scf.execute_region
, which can already group based on the idea of a “target” or “thread block”, and it could be used with the above discussion of target descriptors (for some operations in a region, not all).
But that doesn’t translate to tiling and fusion opportunities. When using named ops, being able to fuse ops at multiple nest levels, then tile them (as a group), then fuse again is very powerful. We’ll need guarantees of what can be put into those regions (thus, can’t use scf.execute_region
), for example, only ops that implement the TilingInterface
, or something.
A good example in the discussion was linalg.softmax
. How we lower it will define how we tile & fuse. For example, we can teach reduction/broadcast to be fusable, or we can split the lowering into three groups: pre-reduction, red/bcast, after brodcast, so that we tile and fuse the first group with the producer, the last group with the consumer and do special lowering for the middle one.
Other topics
There were other topics that were not discussed in the round tablr but are also important, we should work on them soon. Most of those I have discussed in separate throughout the conference and can update later.
- ML-guided optimization in sync with cost models, target descriptor, etc. (this was a big topic at CGO than EuroLLVM).
- Packing for CPU extensions (we want to upstream our code in tpp-mlir)
- Linalg to GPU lowering upstream (we want to upstream our code in tpp-mlir)
- Pipeline composition, deps, canon, ordering, multiple downstreams (we want to expose compilers like IREE to various upstream/downstream passes)
- Temporary buffers, memory address space, shared memory, arena allocators, stack scope and other memref/vector allocation techniques to expose software pipelining across multiple threads, where not all of them do the same thing (@matthias-springer ?)
- Vector layout for GPUs and CPU extensions (this was addressed by @Groverkss at EuroLLVM)
- Transform schedules, multi-versioning (being addressed by @aniragil [PR] @martin.luecke Rolf Morel)
@ftynse @nicolasvasilache @stellaraccident
[Edited to make clear we don’t want to use the current quant
dialect, but some explicit quantization semantics, be it its own dialect or inside linalg]
[Edited to add multiple types of target composition]