As a part of a larger research project, we want to build a custom compiler based on MLIR. We will most likely not change the MLIR source code, but implement our own dialect(s), types, operations, transformations etc. in our own codebase using MLIR, as outlined in the tutorials. I’m still relatively new to MLIR and LLVM, and I’m wondering which version of MLIR we should use.
Question 1) It might make sense to use an official release of the LLVM monorepo. However, since the release notes do not mention MLIR, I would like to ask what is the relation between LLVM releases and MLIR. Does a release of the monorepo contain an especially stable, consistently documented, etc. version of MLIR?
Question 2) Given that MLIR still seems to be a young and quickly evolving project, would you rather encourage using the latest commit on the main branch of the LLVM monorepo and upgrade to the new latest commit from time to time, e.g. when there are helpful new features or important bug fixes?
I see that this is generally a trade-off between (a) version stability and frequency of changes required to our code to reconcile it with breaking changes in MLIR, and (b) access to the latest improvements in MLIR. Thus, I do not expect a solution for my concrete case. However, if there are any best practices (e.g. to reduce the amount of work to adapt to breaking changes), it would be great if you could share them!
The releases do not impact stability or documentation. We try to ensure it is building and green inside the monorepo continuously. But we do not have a more exhaustive test that we employ for releases than not. Same with documentation although we do have the occasional doc sprint which improves things more than done during general development.
I’d recommend staying close to head and upgrading frequently given the age. I have done both kinds of syncing (synching every month’ish to head or doing it daily) and honestly found doing it frequently to be less painful. That way I was able to know what changed that resulted in me not building, rather than trying to update callsites for multiple separate commits all in one go (found this a lot more error prone and unstructured). Besides it also makes it easier to commit changes/improvements/analysis/… back
Hi @pdamme . I expect you could learn alot by following the development process of the mlir-npcomp or circt projects. These are LLVM projects, but exist outside of the monorepo. Both projects have an informal system for tracking updates to the llvm source tree, which typically happens at the scale of days or weeks. The current ‘known good’ version of LLVM is stored in the project source tree as a git submodule reference. Generally speaking, MLIR is still changing frequently, so it would probably be better to update LLVM more often than once a release, but this can be tuned for your organization’s personal tradeoff between stability and update cost vs. regular investment. I will say that in my personal experience, delaying updates like this beyond one release tends to increase the deferred cost of this maintenance to the point where it hard to manage. Small regular updates (at least once a month?) can be a reasonable tradeoff. With some automation it’s possible to check for changes in a nightly build process, but this can run into overhead with normal daily churn.
I would say that this is true for now, but as MLIR mature I hope that the release will get more stability and back port for bug fixes in the same way that LLVM does.
I’d also add that if you rely on plugin a JIT ultimately, even without any back port for MLIR you still get all the testing done on the LLVM optimizer and backends in the release branches.
We’ve been doing what you want to do for a year now. Based on my experience, my suggestion in answer to Question 2 would be to update regularly, but not all the time, to the last version of MLIR. We’re doing it once a month on average, now. Expect from time to time to spend some time to make your code recompile or to re-understand how some piece of C++ interface or tool works. Technical questions here are answered quite rapidly, but it is expected of you to do you due diligence in finding answers (again, this is my experience).
The dialects part of the MLIR distribution are well documented, it was quite nice to develop on top of them. On the other hand, the second you want to touch things like TensorFlow the level of support and documentation is quite low.
Thanks everyone for the quick and insightful responses! I will include the LLVM monorepo as a git submodule and try to update regularly. Thanks also for sharing your experience regarding the update frequency; I will try to find a good balance here.
Which is funny as it is many of the same people involved that wrote the previous documentation, processes and support This is a bug that needs to be improved. Currently the focus there has been very much along usage of very specific execution needs and custom entry points have been up to the user unfortunately. Those are evolving but a couple of designs for APIs have been canned post review which made presenting a best practices there more difficult as we wanted more stability there first. The current effort ongoing seems more promising than the previous ~3.
I did not mean to criticize. Indeed, the difficulty of dealing with TF is not so much related to MLIR, but to the whole ecosystem of tools and transformations around it. The various dialects inherit the lack of documentation and sometimes clarity of this ecosystem.
Hi @stephenneuendorffer From your expert experience, how to maintain the MLIR version is a better choice, supposing that LLVM version is fixed to one certain released version, e.g. LLVM 12.0? As MLIR evolves quickly, new features are under way, and we would not like to miss them. As you mention, the mlir-npcomp and circt project are both good examples for using LLVM, but they keep updating with the LLVM. Thanks ahead.
Some changes in MLIR are due to changes in LLVM. If you want to keep LLVM fixed you’ll need to be very selective in what you use (e.g., avoid LLVM dialect and anything that uses OrcJIT) and/or be willing to update LLVM version when such a change happens. I think it would be difficult to keep LLVM version completely fixed (I mean conceptually one could have a MLIR LLVM 12 branch and just integrate MLIR changes which don’t require bumping LLVM version + adding some local patches to smooth over LLVM changes, but that may be more work than updating LLVM version, and you may have bulk changes when LLVM releases new version).
Just an FYI from the npcomp side: we bump adhoc as needed. I haven’t checked history but I would guess on a period of weekly to monthly. For a project being actively developed, weekly may be the sweet spot ime: slow enough to not have a ton of build/ci churn and at about the limit of how long is wise to carry local patches that impact the project. I’ve done a lot of these bumps and most are nearly trivial. Those that aren’t, you don’t want to defer anyway because they tend to be large, mechanical changes that make it hard to bisect around (obscuring deeper problems). We usually keep an eye on the forum and phab queue to keep a sixth sense of what/when these are. For disrupting changes to components in llvm that we are also actively developing, I try to reach out to known early adopters and include “notes for downstream integrators” in patch descriptions as a courtesy (something that maybe shouldn’t be a policy but as a courtesy, makes lives easier, imo, when changes can be foreseen).
On the IREE project, we update on a special branch continuously as part of Google’s rolling update process. Then we down integrate patches to the main branch up to daily (but usually just a couple or few times per week). Daily updates on an OSS project are too frequent imo. The continual churn will kill you.
Finally, on the npcomp side, where we also track PyTorch head, we bump that to current nightly in the ci and fix things when it turns red. It probably makes sense to do that for llvm changes as well as things mature.
As Mehdi says, I expect the major llvm release cycle to be more load bearing for MLIR in the future, but if just guessing, I think we are still a couple of major versions away from that level of maturity.
To date, I think all of the MLIR plumbing in Tensorflow (with the exception of MHLO) has been at the level of internal implementation vs real API – and depending on it is quite painful and fragile. Further, since the entire build system and development methodology is bespoke, none of the work that gets done in the rest of the community (ie. Dialect sharing, APIs, etc) is easily accessible. Working with the rest of the ecosystem, it is forks, real APIs, patches, etc – I can rebase around and glue things together. But the moment I have to touch Tensorflow, I just have to lock on to what they are doing and try to buffer around the instability. This is sadly also true of things that are not part of the monolithic core (ie. TFLite, XLA, etc), should be independent components, but are just along for the development process ride that governs the whole.
It would be nice if TF defined things useful for integration and made them independently usable by integrators. It’s a real loss to have the most sophisticated user of the tech hamstrung by development process and code organization issues. Without movement in that direction, it is a hostile dependency to take.
No idea. We avoid bazel as much as possible because it is incompatible with everything else in the ecosystem and integrates poorly with normal workflows – which are non negotiable points for software intended to be used as a library. The alternatives are not paragons of technical virtue but have at least evolved to not create tech islands by using them.
(I’m not just saying inflammatory things for my health: these are all decisions that the TF team has made and could make differently, so I have some hope that users complaining enough could improve things)
I don’t think the cache invalidation is super Bazel-specific in this case. If you’re updating LLVM twice a day you’re going to invalidate a good chunk of the cache twice a day whether you’re using a Bazel mechanism like --disk_cache or a more standard one like ccache. So invalidating caches is certainly something to consider when deciding on an update frequency.
Yes I really think that this is a more general issue then bazel specific.
Then, how the modularity of the (TF) targets can minimize or not the impact of these frequent updates is unknown as the thread there has no reply yet.