Summary
We propose a new MLIR dialect for distributed heterogeneous systems. The dialect introduces a schedule
operation that groups task
operations, each annotated with a target (e.g., cpu
, gpu
). It enables explicit orchestration, static analysis, and lowering to MPI Dialect.
Motivation
MLIR lacks a unified abstraction for coordinating computation across nodes and devices. Existing dialects (gpu
, openmp
) don’t do distributed task orchestration or communication.
Key Concepts
dhir.task target = "gpu" {
// GPU work
dhir.yield
}
dhir.task target = "cpu" {
// CPU work
dhir.yield
}
}
task:
Encapsulates computations with a target
%task1 = dhir.task %cpu, %a, %b, %c
{
scf.parallel %i = 0 to 20
{
%x = memref.load %a[%i]
%y = memref.load %b[%i]
%z = arith.addi %x, %y
memref.store %z, %c[%i]
}
}
schedule:
Orchestrates tasks
dhir.schedule %a, %b, %c, %d
{
%cpu = dhir.target{arch = "x86_64"}
%gpu = dhir.target{arch = “sm_90” }
%task1 = dhir.task %cpu, %a, %b, %c
{
scf.parallel %i = 0 to 20
{
%x = memref.load %a[%i]
%y = memref.load %b[%i]
%z = arith.addi %x, %y
memref.store %z, %c[%i]
}
}
%task2 = dhir.task %cpu, %a, %b, %d
{
scf.parallel %i = 0 to 20
{
...
memref.store %z, %c[%i]
}
}
%task3 = dhir.task %gpu, %c, %d, %d
{
gpu.launch
{
%tid = gpu.thread_id x
...
memref.store %z, %c[%tid]
}
}
}
Communication and barriers are modeled explicitly
Feedback Requested
- Is this abstraction useful and minimal?
- Any overlap with ongoing work in IREE, HPVM, or CIR?
Happy to share an early implementation. Looking forward to feedback!