Lean sharing of MLIR dialects

Last year, when I wanted to share a prototype dialect I was creating, I had two choices: share my entire project or share the entire LLVM monorepo + my dialect. I’ve done both, really, and neither were good.

My project was banking on the idea that MLIR is highly composable, taking advantage of the existing dialects. However, either sharing strategies meant I could only reuse the standard dialects already upstream. If I wanted to reuse two other dialects, I’d have to somehow clone both their repositories and pull the files into some place to build them “correctly”, probably writing my own CMake kludge on the way.

Now, writing our own front-end, I’d like to maybe reuse things from FIR or CIL, and well, I’d love if they could reuse from each other, too! FIR would be slightly simpler, as it’s in (will be in?) the monorepo (and we already clone that), bit CIL isn’t. Sure, “once CIL gets into the monorepo”, but if that’s the only way to easily reuse, then this model seems broken.

Looking at most dialects, we put all logic (TD, headers, code, CMake) into a single directory and then include that from the parent’s CMake. This seems to be like a perfect encapsulation strategy to have dialects be on their own projects (and repos) which other projects (like front-ends, optimisers) can use and reuse by just cloning/submodule the repo on their own.

I’m not very proficient with CMake (by choice!), but wouldn’t some strategy like that be a more sensible “default dialect sharing” strategy? This could be prominent on the documentation, so that all developers out there know how to simply share their dialects, and “facilitate synergy between projects” (spoken like a true marketeer:).

Any thoughts on this? Am I missing something obvious?

Thanks!
Renato

1 Like

We’ve had success building multiple dialects out of the LLVM tree (e.g. LLVM + npcomp + CIRCT + local code that depends on npcomp + CIRCT. With the right CMake (which isn’t necessarily obvious), this can be made to work. The biggest challenge, is I see it is more of a non-technical one where MLIR is still changing relatively quickly, which makes the change management extremely challenging. I think any solution to composability that doesn’t start with “there are certain parts of MLIR that are stable” is unlikely to be successful.

Right, I’m more concerned with dialects that aren’t on the monorepo. So, say I want to develop a dialect that borrows from another project without having to import the whole other project or having the other project to have to work directly off the monorepo.

If there’s a standard way of pulling a dialect into my project without any other dependency, then I can even do that to my own dialect (ex. keep in a different repo), on my own project, and make it easier to keep the dialect clean of my own dependencies, so it’s easier for other people to use my dialect.

This is pretty much the same stance as most of LLVM, so it’s a fair assumption.

Perhaps I’m reading it wrong, but is the proposal to move to something like

mlir/Dialect/foo/{include/foo, lib/foo}

And so instead of having dialects nested inside include/mlir/Dialect, have them in their own standalone directory (“dialect package”) so that inclusion via submodules are easier? (An alternative is like GitHub - tensorflow/mlir-hlo which is close, except you have one extra Dialect in the path but easier to have a utils directory adjacent to the dialect). And then the dialects cmake includes all subdirectories, so adding an upstream “dialect package” can be done without modifying core (well excluding registering with opt tool and the like, although folks may have their own ones there already).

Building all at a revision where all builds would be important. But I see this as a transitive property: “leaf” dialects have to work at same revision where ones they depend on work down to core. So you may be limited by the slowest dialect dependency to update (I mean it is open source so one could send a revision :slight_smile: ).

1 Like

One thing is how we package the upstream dialects, and what you propose might be simpler if we just want to have them, but I don’t see how we can make a dialect work without having the monorepo, too. Some dialects have their own passes, which probably wouldn’t be moved inside mlir/dialect?

My idea is to decouple other projects (non-core) dialects from their own projects without needing to couple them to the monorepo. My first prototype had to be in the monorepo because of how the CMake files worked, then someone added a “standalone” dialect example, which I used to move the dialect to my own project, using the monorepo as a dependency. My current project does the same and that’s good enough for what we need (our CMake is very confusing, though).

But if I want to reuse someone else’s dialect (say RISE, CIL, etc) which are not in the monorepo (and may never be), then I don’t know how to merge the repositories. Some use the monorepo as a base, which means I need to check out multiple versions of LLVM, or I try a really horrible merge commit from multiple repos to get to a common base.

Having to get those dialects (whatever they are) into the monorepo means bureaucracy and potentially stale state (you get what was merge sometime in the past). It also means all other projects have to follow the same guidelines on how to use the monorepo (for example, CIL doesn’t, there’s a single commit). So, if the CIL repo could just be a single directory dialect/CIL and that gets merged into other users’ implementation, that’d be much nicer. The CIL team, then, would have another repo with their usage of their dialect.

Honestly, I’m not sure this is a good final state (having to force split repos on other projects), this is why I’m asking the larger community. I don’t really like submodules, but I like having to cloning the monorepo multiple times less. :slight_smile:

We can work on CMake-fu to make it easy to write an out-of-tree dialect that is using MLIR.
This is the setup that is done with GitHub - tensorflow/mlir-hlo which is reused in mlir-npcomp and in IREE for example and the monorepo is clone/built once. This is also done with the standalone example in-tree. I agree it isn’t there yet in terms of “standardizing” this and make it easy to mix and match multiple “standalone” examples.

But ultimately you can’t really “decouple from the monorepo” because you need the tools in the monorepo to build your dialect and passes, and you need the core datastructure.
Now you can build your dialect against a libMLIR.so built in the monorepo, but there’ll always be a problem that you need the project you want to reuse to use the same version of LLVM that you’re using (assuming we continue to be heavily C++ base and don’t change the way we manage the project).

Some of these concerns have invaded my dreams quite a bit as I try to think through the practicalities. To jump straight to my conclusions:

  • Unless if something changes regarding development process or implementation language choices, life at head outside of the monorepo will always incur a tax, since at best, downstreams are eventually consistent, making cross connections between them a fraught prospect.
  • It seems like we are still in the phase of the project where we are paying a lot of startup transients for scaling things in a new solution space. I think that some investment in infra and norms might help, but there is probably also just a wall-clock aspect to having a higher degree of stability.
  • Eventually, I feel that MLIR related components as part of stable/numbered LLVM releases can help synchronize the community for projects not at head, but it seems like we are ~1-2 major releases away from having something worth synchronizing to.
  • Actually making more of the infra installable/distributable to OS and language package management outlets can be a good forcing function for some of the regularity needed to reduce the taxes. This is why most of my effort on this stuff recently is in the context of MLIR python development packages and low level work on LLVM components with the goal of making the shared linkage story more robust (i.e. you can’t have multiple projects, even in a version locked way, effectively without having good library boundaries at install/deployment time).

One thing that I would like to see more of is for projects that expose “interface” dialects intended to be either a source or sink of some import or transformation pipeline (as opposed to things that are designed to be implementation details of a particular setup) to have such dialects isolated either in their own project or in a dedicated part of the build system that can be used independently. As it stands now, we usually just put such dialects with the rest of whatever project is hosting them, and that usually brings with it a large set of dependencies, complicated build setups, etc. Further, while such dialect defintions may be simple and rarely require updating as part of LLVM version bumps, the entire hosting project uses a lot more of the infra and often needs a lot of patches to adapt to upstream API changes.

Some of the dialects that I am aware of that may qualify as such a thing:

  • NPComp torch and aten dialects
  • IREE iree and vm dialects
  • Circt handshake dialect (I think – not super up to speed there)
  • EmitC dialect (already broken out in such a way)
  • TensorFlow’s mhlo dialects (already somewhat broken out in such a way but also currently co-mingled with all of the transformation pipelines which are much less stable)
  • TensorFlow’s tf_executor, tf and tfl dialects: Although I am not smart enough to understand how to untangle them from the TensorFlow infra that has in-grown into them.

Ideally for each of these, they could live in a separate CMake project inside of their host project and I could build them independently into a standalone .so/.dll, given an installed LLVM toolchain and set of dev tools.

1 Like

Right, that’s an unnecessary goal as everyone that uses a dialect also uses MLIR and LLVM, so we all have a clone somewhere. The goal I was going for is to not depend on other people’s repos either, nor having to upstream my dialects to share with others, especially when it’s still under heavy development.

For more stable dialects we could focus on specific releases, but that wasn’t part of my goal either. I’m still relying on the fact that everyone has a clone of LLVM, but not everyone has the same commit checked out. We need some flexibility, but not total disconnection from MLIR.

@stellaraccident, you seem to have spent way more time than I did thinking about this. :slight_smile:

Your proposal for the ecosystem seems like a nice long term solution. But it still doesn’t solve the collaboration between projects before their dialects are stable enough to (for example) be upstreamed into the monorepo. My concern is at the stage where dialects are not ready yet, when research on different areas need collaborating. Let me try a hypothetical example to explain what I mean.

Imagine a few different ongoing research projects on PDL, rewrite rules (like RISE), meta-MLIR (MLIR producing MLIR), language lowering (like CIL, FIR). Each of those have their own dialect, possibly on multiple versions and alternatives, on their repos/branches. (for the sake of the argument, let’s consider all of them on separate repos, not the monorepo).

Now, I want to write a dialect (or a pass) that will make use of all of those dialects that are not in the monorepo. How do I go about importing them all into a single repo? Sure, I need the monorepo anyway, so I get all upstream dialects “for free” as well as the tools I need. But what about the others?

If another dialect is off the monorepo, I can perhaps add their repo as a git remote and work off branches (I’ve done that with RISE). If the repo has a single commit (like CIL) I need to clone the whole thing somewhere else. If their repos are not monorepo-based, I need to clone them, too. all of that for a few TD/H/CPP files.

Of course, I need to make sure they all build against a compatible LLVM version (which probably will be a nightmare), but if they’re isolated, I can have my own branches from their dialect repos with local patches to fix the build issues.

So, the two main problems with this scenario are:

  1. I’m cloning a lot of redundant stuff for a very small set of files. This not only wastes disk space, but slows down CI and make the build system more complex.
  2. I have to handle each external dialect in completely different ways because there isn’t a standard way of sharing dialects. This also makes the build system more complex.

I think @mehdi_amini’s MHLO example is more or less what I’m talking about. If something like that would be the “encouraged” default for off-tree dialects, then it would be easier to share and reuse at an early stage. Perhaps we don’t even need to contribute them to the monorepo, if there’s a list of dialects (and a way to fetch them correctly at build time, for example).

For the dialects that you mention, the ones that are source/sink of transformation, I think it would be nice to have them in the monorepo eventually. But for niche ones, like HLO or RISE, it may be perfectly fine to forever be in another repo.

Makes sense?

Yes absolutely. I’m very supportive of getting there, it is just a “non-trivial amount work” that needs to happen…

1 Like

I’m just happy to have someone else to talk to :slight_smile:

Yes, for this early phase experimentation, the MHLO example is a good one – we did follow that same basic pattern with a couple of out of tree projects. I think that if starting a new experimental assembly of things, the first thing that you need to do is create a repo where you pin the monorepo + the leaf dependent dialect projects, building them all as sibling out of tree projects. There’s not a good way out of submodule revision hell. In practice, in the past, if synchronizing with at least one Google repo, then you can take advantage of the fact that most MLIR dependent Google repos are bumped up to a couple times of day to consistent LLVM hashes – and picking one of those can ease the situation.

For this kind of experimental sandbox situation also git subtree advertises that it may have a better solution that keeps you from needing to have N different upstream forks in order to apply local patches to some (which you will inevitably need to do in order to deal with version skew). I’ve never actually gotten over the activation energy to try this option.

This sounds like an improvement over submodules, but it sounds equally complicated to work with. Though, this may be the same as having a separate repo for your dialect: the slight increase in cost for one project might be a big reduction in cost for the ecosystem.