Thank you - this could graduate into a minimal out-of-tree python sample once we smash the bugs.
I had some time to do some more triage/work on this and did make some progress. I believe we have two areas that need attention:
- General dynamic linking woes
- Facilities for embedding/distributing standalone MLIR Python based projects
General dynamic linking woes
For now, I think we just need to be up front that dynamic linking in MLIR is known to not be reliable. One can usually wedge it into working on some platforms, but the mechanisms and organization are just not robust to all modes of use in the wild. I think that coming up with a more linker-friendly/robust way of handling TypeID
’s is 90% of the problem, but I am still annoyed because what is being done (while not the best idea) should work but breaks in a myriad ways in practice. I have identified some more clues on this quest:
- The
--exclude-libs
flag used to build some shared libraries seems to be suppressing emission of some of the vague linked TypeID
symbols. If removed, the situation improves but is still not great (i.e. binary size gets ~20% larger, a lot of symbols leak, and this only works on Linux/not MacOS/Windows). I think this issue accounts for the majority of problems people have had in-tree with TypeID
mismatches. For out of tree, whether it happens is entirely down to luck based on what ends up in a header vs in a static library, etc. When I repro’d there was one specific TypeID
that was not getting exported just based on innocent source organization.
- Improper cross module loading with
RTLD_LOCAL
. This is the default way that the Python interpreter loads its extensions. Our regular hack for this is to use RTLD_GLOBAL
, which isn’t great. This is distinct from the above, in that the libraries have the right SONAMEs, should be upgrading themselves to the global namespace when shared, and the vague TypeID
symbols should resolve. Except they don’t. No idea why.
- My theory that there was some path canonicalization thing going wrong based on the RPATH has been invalidated. I simplified a broken case so there was no possibility of ambiguity and it still failed with RTLD_LOCAL.
At this point, I think we should just scrap the current linker-resolved TypeID mechanism and use something that is better rooted in a single shared library. That won’t fix everything, but it will make everything fixable with local interventions to how things get linked.
Facilities for embedding/distributing standalone MLIR Python based projects
The fact that we are having so many dynamic linking issues begs the question of why we are doing so much dynamic linking in the first place While good for development and some narrow deployment cases, a lot of folks who are building MLIR based projects really should be statically linking and making hermetic distributions that don’t have C++/runtime dependencies between them. In other words, we should be aiming for a world where we can statically link all C++ dependencies across a few projects and export them as a single shared library for API/binding use. If we have that, then advanced use cases (i.e. deploying as part of an OS, anaconda, etc) can upgrade to a fully dynamically linked version if it suits them (and if we fix the above bugs), but presumably those cases have more control of their deployment environment and will use that in setting things up.
I think this doesn’t just apply to Python, but likely implicates other bindings.
To this end, here is a WIP of a multi-project package that I’m trying to assemble as part of IREE. This one pulls together Python bindings for MLIR core, MHLO, NPComp (TBD), and IREE’s public dialects (TBD) because those are all useful for compiling and running with IREE and we need to keep them reasonably version synced.
Here is the main commit that:
- Adds MHLO CAPI and Python bindings/build machinery
- Reworks the way that MLIR CAPI libraries are built so it is better usable by out of tree
- Makes it configurable how to link the Python extension modules, so we can dynamically link them all against a mondo/hermetic dylib that exports the various C APIs.
The result works without RTLD_GLOBAL hacks or TypeID errors today (and didn’t before reworking):
from mlir.ir import *
import mlir_hlo
with Context() as context:
mlir_hlo.register_dialects(context)
print("It works")
This includes a bunch of changes across LLVM/mlir-hlo as part of my overall monorepo. I need to export those and upstream individually (and clean them up). I’ll also need to rework some downstreams (npcomp, circt) so that this mode of working is reliable both as standalone projects, at build time and when embedded. But I think this is the general shape of the solution.
There is also some more breakdowns that we can do on the Python and MLIR CAPI side so that downstreams that don’t need everything can only pay for what they need (right now, you get it all – all dialects, execution engine, etc). The current size, with the execution engine for X86 (which comprises the largest part) is ~45MB on disk/17MB zipped.
Anyway, it will be some weeks before I can finish this but I think this is the right direction. A previous iteration of this approach worked out of the box on Linux/MacOS/Windows, and I’m pretty sure this one will too when done.
A few folks specifically who have been on this journey: @mehdi_amini @jdd @mikeurbach @ftynse @GeorgeL @clattner