[RFC] An MLIR Dialect for Distributed Heterogeneous Computing

Summary

We propose a new MLIR dialect for distributed heterogeneous systems. The dialect introduces a schedule operation that groups task operations, each annotated with a target (e.g., cpu, gpu). It enables explicit orchestration, static analysis, and lowering to MPI Dialect.


Motivation

MLIR lacks a unified abstraction for coordinating computation across nodes and devices. Existing dialects (gpu, openmp) don’t do distributed task orchestration or communication.


Key Concepts

  dhir.task target = "gpu" {
    // GPU work
    dhir.yield
  }
  dhir.task target = "cpu" {
    // CPU work
    dhir.yield
  }
}

task: Encapsulates computations with a target

 %task1 = dhir.task %cpu, %a, %b, %c
 {
     scf.parallel %i = 0 to 20 
     { 
       %x = memref.load %a[%i]
       %y = memref.load %b[%i]
       %z = arith.addi %x, %y
       memref.store %z, %c[%i]
     }
 }

schedule: Orchestrates tasks

dhir.schedule %a, %b, %c, %d 
{
 %cpu = dhir.target{arch = "x86_64"}
 %gpu = dhir.target{arch = “sm_90” }
 %task1 = dhir.task %cpu, %a, %b, %c
 {
     scf.parallel %i = 0 to 20 
     { 
       %x = memref.load %a[%i]
       %y = memref.load %b[%i]
       %z = arith.addi %x, %y
       memref.store %z, %c[%i]
     }
 }

 %task2 = dhir.task %cpu, %a, %b, %d
 {
     scf.parallel %i = 0 to 20 
     {
       ...
       memref.store %z, %c[%i]
     }
  }

 %task3 = dhir.task %gpu, %c, %d, %d
 {
     gpu.launch  
     {
       %tid = gpu.thread_id x
       ...
       memref.store %z, %c[%tid]
     }
  }

}

Communication and barriers are modeled explicitly


Feedback Requested

  • Is this abstraction useful and minimal?
  • Any overlap with ongoing work in IREE, HPVM, or CIR?

Happy to share an early implementation. Looking forward to feedback!


6 Likes

This is a really cool idea, thanks!

I’ll let @fschlimb and @teekarna dissect the distribution part, and @rolfmorel help with the DLTI and cost model side, but I have some questions:

How do you lower this to different hardware? Split into submodules? Create separate modules? Or would this need changes to the lowering to have multiple targets?

Can you call external functions from inside the tasks? If two tasks run on the same target, can one call the other? Or functions inside each other?

The arch part really should be done in DLTI, have you thought about that? Would be nice to combine (perhaps later) the DLTI structure and the existing GPU attributes like the ones you use.

Is it assumed that the CPU “task” will be the orchestrator? What if we had more than one CPU architecture? Say, a RISC-V orchestrator, an XeonPhi accelerator and an NVidia GPU, all in one system?

Finally, this has a big opportunity to establish cross-target cost models. Not just compute speed, but latency constraints, frequency matching, thermal concerns and a lot more. Have you look into that kind of optimization strategy in your work?

1 Like

Thanks so much for the thoughtful feedback! :heart_hands:

We lower dhir to the MPI dialect, which acts as an intermediate coordination layer. From there, we lower to llvm ir. While the current MPI dialect is still evolving, we found that with minimal enhancements such as adding lowering support for mpi_comm_size and mpi_barrier, We were already able to generate the following intermediate representation that was successfully lowered to llvmir. (simplified and slightly modified version for brevity and clarity.)

func @genmpi(%a, %b, %c, %d) {
  %comm = mpi.comm_world 
  %rnk = mpi.comm_rank
  %sz = mpi.comm_size

  %p0 = arith.remsi 0, %sz
  %cnd1 = arith.cmpi eq, %rnk, %p0
    scf.if %cnd1 {
     scf.parallel %i = 0 to 20 
     { 
       %x = memref.load %a[%i]
       %y = memref.load %b[%i]
       %z = arith.addi %x, %y
       memref.store %z, %c[%i]
     }
    }

  %p1 = arith.remsi 1, %sz
  %cnd2 = arith.cmpi eq, %rnk, %p1 
    scf.if %cnd2 {
      loop
      mpi.send %d 0
    }
  mpi.barrier(%comm)

  %p3 = arith.remsi 0, %sz
  %cnd3 = arith.cmpi eq, %rnk, %p3 
   scf.if %cnd3 {
    mpi.recv %d 1
    gpu.launch {
      loop
      gpu.terminator
    }
   }

  }
}

This approach lets us cleanly decouple distributed orchestration from device-level code generation, and we believe the MPI dialect can serve as a solid foundation for backend-independent lowering of distributed programs.

Yes, tasks can call external functions within their body, just like any other region-based operation. However, tasks cannot call other tasks or recursively invoke themselves. Instead, data dependencies between tasks are expressed by passing outputs from one task as inputs to another.

This design enforces explicit scheduling and isolation, which simplifies analysis and enables more predictable execution and distribution strategies.

That’s a great point. Thanks for highlighting it. We agree that target architecture information should be handled through DLTI for better consistency and extensibility.

Since we lower dhir.schedule to the MPI dialect, orchestration is not hardcoded to a specific CPU architecture. Instead, the orchestrator can be any participating node in the MPI program be it a RISC-V CPU, XeonPhi, or any other architecture depending on how the MPI processes are launched and mapped.

Yes, that’s exactly what we’re actively exploring. We’re currently working on integrating cross-target cost models that go beyond compute performance to include factors like communication latency, memory bandwidth, frequency mismatches, and thermal constraints. The goal is to use these models to drive more intelligent task placement and scheduling decisions across heterogeneous devices.

Thank you so much for the thoughtful response. Now that we’ve shared more context and details about the project (which I realize I missed earlier), we’d really appreciate any further feedback or suggestions you might have!

1 Like

Very nice, thanks @johnmaxrin!

We should make sure this dialect composes with the mesh dialect. It should expose the right operations to be able to map the high-level sharding/partitioning “directives” from the mesh dialect to this kind of dynamic scheduling.

At this point the MeshToMPI pass does similar things. It does not use “task” as an abstraction, though. Instead it directly translates operations on shards to SPMD code. It seems like good idea to put a layer in between to also allow mapping mesh to scheduling runtimes. With the right semantics it should be possible to define the right mappings between sharding, tasks, MPI and dynamic schedulers.

Do you have any thoughts on the relation to the mesh dialect (and it’s accompanying passes)?

Just a minor comment: there is rudimentary lowering available to lower MPI to LLVM (see MPIToLLVM.cpp). including comm_rank and comm_world.

1 Like

Looks like this could also go through the regular lowering creating multiple modules for each target and then just use a runtime with proper synchronization for the scheduling code. I think, if this is combined with a nice-and-easy communication interface, it could save some headache when generating code for multiple platforms at once (and especially adding new ones later on) even without MPI. Have you considered a declarative schedule for this by any chance?

Would be cool to combine this with some memory placement primitives + optimizations someday (as in “here’s my task and here’s my HW/cost models; I want the task scheduled to the best compute available using XYZ memory constraints”).

Thank you for this proposal! I have some high-level questions first following the guidelines on contributing new dialects:

  • What is the overall goal of the dialect? Note that we don’t look for minimality, in my experience quite the opposite, but we do prefer small incremental contributions.
  • What is the community that this dialect serves?
  • Who is going to contribute and maintain to this dialect?

This is not in the guidelines, though I’m thinking of proposing a change to put it there, but I’d also ask:

  • What is the problem this dialect is solving / how does MLIR or a broader community benefit from this dialect existing?

Regarding your question, this does look sort of similar to the abstractions available in Flow and Stream dialects in IREE. Those are both significantly larger and more oriented towards tensor compilers. You should also look at the async dialect, that conceptually can be seen as operating on tasks. A specific design point should be to see whether a new dialect contributes something that is not easily amenable to extensions of async + extensions of DLTI.

This RFC was discussed at the Area Team meeting today. The Area Team identified lack of consensus on this RFC so far. @johnmaxrin is encouraged to respond to the questions from the community and, in particular, actively engage with @fschlimb on the potential intersection and reuse between the proposed dialect and the mesh dialect. We would also like to get an estimate of the amount of engagement and support that is expected for this dialect. The Area Team will revisit this RFC in the next meeting.

Thanks so much @fschlimb for the kind words and helpful pointers!

You’re absolutely right that making dhir interoperable with the mesh dialect would be very valuable. We’ve been thinking along similar lines, and your comment helped us see it more clearly. We could add a MeshToDHIR pass.

The idea would be to lift mesh collectives and sharded operations into explicit dhir.tasks, using the mesh topology to guide task creation.

Would love to discuss further on how best to define this pass between mesh and dhir. Your experience with the mesh abstraction will be invaluable in shaping this.

Thanks @kurapov-peter so much for the thoughtful comment!

We hadn’t seriously considered a runtime-agnostic communication API before, but that direction is starting to look very promising. It could give dhir a lot more flexibility, especially on platforms where MPI isn’t ideal.

We also hadn’t explored a declarative schedule model yet, but now that you mention it, we see the value, Particularly for enabling better optimization, cost-model-driven decisions, and potentially even user-guided scheduling. Definitely worth spending time on!

Right now, we’re focusing on building a cost-model-based scheduler, but your suggestion to eventually combine that with memory placement primitives and constraints is really helpful. That kind of integration could make dhir a much more powerful coordination layer across diverse hardware.

Thanks again. This gives us some solid ideas to explore further!

Goal of the Dialect

The core goal of dhir is to orchestrate distributed, heterogeneous execution in MLIR across CPUs, GPUs, and potentially other accelerators. We aim to offer a mid-level coordination layer that complements lower-level dialects and allows users or tools to define and analyze multi-device execution in a unified way.

Community Served

We believe this dialect would benefit:

  • HPC and systems researchers exploring compiler-guided task scheduling across nodes/devices,
  • Teams building domain-specific compilers for scientific workloads, where cross-node coordination matters,
  • And even MLIR compiler builders who want a more explicit orchestration layer above existing compute dialects.

We’re also seeing interest from those working on ML training compilers for multi-node clusters (e.g., TPU pods, GPU superclusters), where orchestration and communication need to be first-class.

Who Will Maintain It

At the moment, we (myself and collaborators from PACE LAB, IIT Madras) are actively working on the design and implementation. We’re committed to maintaining it, improving integration with existing MLIR infrastructure, and aligning with upstream feedback.

What Problem It Solves

dhir addresses the lack of an explicit orchestration layer in MLIR for distributed heterogeneous systems.

While dialects like async, flow, and stream handle execution order, streaming, and partitioning well (especially in tensor-heavy settings like IREE), they don’t currently model:

  • Target-specific task encapsulation
  • Architecture-aware scheduling
  • Cross-device coordination that isn’t tightly tied to a specific runtime.

Thanks for the update and for taking the time to review the RFC during the Area Team meeting. I’ve responded to all open comments so far.

Regarding expected engagement and support: this RFC is primarily exploratory. Our goal in putting it out was exactly to understand whether the community sees value in having a dialect like dhir.

At this stage, we don’t have confirmed adopters or downstream users who’ve committed to using the dialect, but we believe it could serve needs in distributed heterogeneous systems, particularly in HPC and research-oriented compiler stacks. We’re targeting communities that work with task graphs, heterogeneous devices, and want to reason about cross-device orchestration more explicitly.

That said, we do have active developers (including myself and collaborators) who are committed to building and maintaining this dialect going forward.

Looking forward to the next review, and happy to iterate further based on the team’s feedback.

Thank you for the revision and for your offer to contribute and help maintain this work.

Please note that the mesh dialect does include communication primitives that are runtime-agnostic; it’s just that currently only a single lowering (to MPI) exists. There has been some discussion about separating these primitives from the mesh dialect. If another dialect starts using them, we may want to revisit alternative approaches.

I see that the proposed dialect introduces new capabilities, but it would be helpful to clarify these more precisely. There appears to be significant overlap with both mesh and async. It may be worth considering whether these new capabilities could instead be added to mesh and/or async. A clear separation of concerns between the three would be highly desirable—and perhaps even a rethinking of how these responsibilities are divided might be appropriate.

One of the major extensions introduced by dhir, as I understand it, is the ability to define task groups that can perform potentially blocking collective communication. This seems like a natural extension to async.

Similarly, device assignment could also potentially fit within async.

What are your thoughts on this?