Checking types between different python extensions

In an out-of-tree project, we have a custom type using the MemRefElementTypeInterface which should allow it to be used inside a MemRefType. We also have our own python extension for accessing our custom types, which is different from the mlir python extension.

When we call MemRefType.get() with our custom type, it does fail saying that the type can’t be used with MemRef, although, it works if the same thing is done using Cpp. The failing check in the python side can be found here, specifically, the call to type.isa<MemRefElementTypeInterface>().

It’s worth noting that this check is happening on the mlir python extension, while the type is coming from our python extension, and it seems like the MemRefElementTypeInterface from both extensions are considered to be different.

Knowing that python extensions are better self contained, and should avoid dynamic linking, two python extension would normally have different MemRefElementTypeInterface, and we should still be able to do this check across different extensions.

We believe that the problem is related to the fact that both python extensions incorporate MLIR libraries through static linking, leading to multiple implementations of the functions used to retrieve unique instances for types. Could we do the linking dynamically? Or is there any other workaround for this?

How does your extension link to MLIR? The expected way is for both the MLIR extension and all custom extensions to dynamically link to libMLIR.so. Linking for Python is not battle-tested, and it’s a hard problem. @mehdi_amini and @stellaraccident looked into that.

The isa infrastructure is based on MLIR’s typeid implementation, which is essentially an address of some static variable defined once per class. If your process ends up having two copies of MemRefElementTypeInterface, they are having different typeid, which makes isa succeed or fail depending on which one is getting used in a particular context.

@clattner, @GeorgeL and I have an offline thread where we have been discussing a more robust approach to TypeID management, which, among other things, should make it safer for some kinds of additional linking flexibility (an offshoot of Storing a pointer in MLIRContext - #7 by clattner). If implemented, this would help with a number of bugs we currently face when dynamic linking against the core MLIR libraries.

However, there is no currently planned/known way for two unrelated entities that statically link the core MLIR libraries to interop: they will each create their own type universes (and fork access to other statics). If you manage to smuggle c++ pointers between them, I don’t expect it to be a great outcome.

As Alex says, there aren’t a lot of good answers here, and more mature projects also struggle with it a lot by trying to fit within the confines of PyPI deployable wheels (conda has better support for these kinds of complex deployments). We don’t have any immediate plans to publish the MLIR python bindings in this way, because of this uncertainty and their positioning as being a building block - not an end in themselves.

Leaving aside deployment philosophy, making it possible to embed/private label the mlir bindings into an external project is something I want to support (and will need for my own projects in ~months). This will require some reorganization but was planned for, and it would let you have an myproject.mlir.ir, etc. All of the end user facing projects i know of that use these bindings will need that (vs just assuming that the single top level mlir package was built to their specifications).

There is a path for these bindings to be an independently deployable asset, but it would take someone with a real motivation to do that, and so far that person has not emerged :slight_smile:

1 Like

ldd shows me that it’s dynamically linked to libMLIRPublicAPI.so.13git

Even for me, deployment isn’t the priority here, as far as I can get something working on my own environment.

Maybe I can share a minimal project with the issue I’m talking about, and we can discuss what should be fixed for it, so that I can help with it?

Then this is likely the fragility issues I referred to with the current TypeID implementation. There are some subtle ways that these libraries can be loaded multiple times, resulting in mismatched types. Why that happens isn’t yet fully known and seems related to elf linking arcana, some of the specific ways that LLVM is setting this up, the way the rpaths are configured, and how python loads modules with RTLD_LOCAL.

One of these workarounds usually gets folks going (I’d love a solid root cause on this but don’t have one):

I’d love a root cause on this. It affects quite a number of folks.

Please. I’m on vacation for a couple of weeks and likely can’t respond, but there are other folks here who have wrestled with it and can at least point at what they’ve done to get it working.

I have setup a minimal project that replicates the issue, it does contain a custom dialect, for which there are python bindings. The custom dialect has a type that can be used in a MemRef, and that’s the heart of the issue.

Thank you - this could graduate into a minimal out-of-tree python sample once we smash the bugs.

I had some time to do some more triage/work on this and did make some progress. I believe we have two areas that need attention:

  1. General dynamic linking woes
  2. Facilities for embedding/distributing standalone MLIR Python based projects

General dynamic linking woes

For now, I think we just need to be up front that dynamic linking in MLIR is known to not be reliable. One can usually wedge it into working on some platforms, but the mechanisms and organization are just not robust to all modes of use in the wild. I think that coming up with a more linker-friendly/robust way of handling TypeID’s is 90% of the problem, but I am still annoyed because what is being done (while not the best idea) should work but breaks in a myriad ways in practice. I have identified some more clues on this quest:

  1. The --exclude-libs flag used to build some shared libraries seems to be suppressing emission of some of the vague linked TypeID symbols. If removed, the situation improves but is still not great (i.e. binary size gets ~20% larger, a lot of symbols leak, and this only works on Linux/not MacOS/Windows). I think this issue accounts for the majority of problems people have had in-tree with TypeID mismatches. For out of tree, whether it happens is entirely down to luck based on what ends up in a header vs in a static library, etc. When I repro’d there was one specific TypeID that was not getting exported just based on innocent source organization.
  2. Improper cross module loading with RTLD_LOCAL. This is the default way that the Python interpreter loads its extensions. Our regular hack for this is to use RTLD_GLOBAL, which isn’t great. This is distinct from the above, in that the libraries have the right SONAMEs, should be upgrading themselves to the global namespace when shared, and the vague TypeID symbols should resolve. Except they don’t. No idea why.
  3. My theory that there was some path canonicalization thing going wrong based on the RPATH has been invalidated. I simplified a broken case so there was no possibility of ambiguity and it still failed with RTLD_LOCAL.

At this point, I think we should just scrap the current linker-resolved TypeID mechanism and use something that is better rooted in a single shared library. That won’t fix everything, but it will make everything fixable with local interventions to how things get linked.

Facilities for embedding/distributing standalone MLIR Python based projects

The fact that we are having so many dynamic linking issues begs the question of why we are doing so much dynamic linking in the first place :slight_smile: While good for development and some narrow deployment cases, a lot of folks who are building MLIR based projects really should be statically linking and making hermetic distributions that don’t have C++/runtime dependencies between them. In other words, we should be aiming for a world where we can statically link all C++ dependencies across a few projects and export them as a single shared library for API/binding use. If we have that, then advanced use cases (i.e. deploying as part of an OS, anaconda, etc) can upgrade to a fully dynamically linked version if it suits them (and if we fix the above bugs), but presumably those cases have more control of their deployment environment and will use that in setting things up.

I think this doesn’t just apply to Python, but likely implicates other bindings.

To this end, here is a WIP of a multi-project package that I’m trying to assemble as part of IREE. This one pulls together Python bindings for MLIR core, MHLO, NPComp (TBD), and IREE’s public dialects (TBD) because those are all useful for compiling and running with IREE and we need to keep them reasonably version synced.

Here is the main commit that:

  • Adds MHLO CAPI and Python bindings/build machinery
  • Reworks the way that MLIR CAPI libraries are built so it is better usable by out of tree
  • Makes it configurable how to link the Python extension modules, so we can dynamically link them all against a mondo/hermetic dylib that exports the various C APIs.

The result works without RTLD_GLOBAL hacks or TypeID errors today (and didn’t before reworking):

from mlir.ir import *
import mlir_hlo

with Context() as context:
  mlir_hlo.register_dialects(context)
print("It works")

This includes a bunch of changes across LLVM/mlir-hlo as part of my overall monorepo. I need to export those and upstream individually (and clean them up). I’ll also need to rework some downstreams (npcomp, circt) so that this mode of working is reliable both as standalone projects, at build time and when embedded. But I think this is the general shape of the solution.

There is also some more breakdowns that we can do on the Python and MLIR CAPI side so that downstreams that don’t need everything can only pay for what they need (right now, you get it all – all dialects, execution engine, etc). The current size, with the execution engine for X86 (which comprises the largest part) is ~45MB on disk/17MB zipped.

Anyway, it will be some weeks before I can finish this but I think this is the right direction. A previous iteration of this approach worked out of the box on Linux/MacOS/Windows, and I’m pretty sure this one will too when done.

A few folks specifically who have been on this journey: @mehdi_amini @jdd @mikeurbach @ftynse @GeorgeL @clattner

2 Likes

Completely agreed!

What did you do with TypeID?

-Chris

Well, in this case, I cheated (or at least slid through a crack) in that for distribution in this context, we want one mondo C export library that the python extensions depend on – and that dylib should statically link everything, hide symbols, etc. No C++ dynamic linking == no possibility of TypeID issues.

That happens to be the right way to build things for this specific case (limited pollution of the namespace is important in these plug-in scenarios, and it gives a ~20% binary size saving right off the top), but we definitely want dynamic linking to be robust for a lot of other cases.

To roll this out for development (and other deployment scenarios), where we don’t have some outer build setup saying to do a mondo static build of the API dylib, we still want dynamic linking, and that will trigger the TypeID issue. As a first step just to get things going, I was going to just have the default library loader use RTLD_GLOBAL to unblock things (the packaged library loader does its own thing anyway). It seems like everyone goes through this progression: just enable global loading until you’ve fixed enough bugs and can turn it off. I think there are still bugs in the current setup, even with that, so we’ll see. I feel better about pragmatic things like that if we’ve got an idea on how to fix it the right way over time.

For reference, in addition to TypeID problems, I’ve also seen static symbol duplication for pass registry. Two separate registries were created, one for the main MLIR+bindings and another for the out-of-tree project with bindings, and not merged by the linker for some reason. RTLD_GLOBAL wasn’t sufficient in that case. I had to LD_PRELOAD libOutOfTreeProject.so.

This gets me back to my “this should work” comment, and I really wish I had a theory as to why it doesn’t. In my reading of the glibc sources, a SONAME match should be sufficient to cause only one copy of the .so to be loaded (but the code is hard to follow and there are a lot of contradictory statements about it). This kind of namespace splitting is what dlmopen is for. I’ve observed that a direct dependency (with a SONAME) of an RTLD_LOCAL dependency is properly handled, but indirect dependencies are not. Might be an interaction with lazy loading and we should try using RTLD_NOW, which some have reported is important (but I haven’t tracked down if/why).

In your setup, are you able to confirm whether the library containing the pass manager registry has a SONAME? (readelf -d) Also if setting the flags to include RTLD_NOW has any effect?

Note that we should never be deploying things with this structure (ie. Deep, non-unique SONAME deps), because multiple out of tree projects will conflict with each other. So I still maintain that that when packaging, my static linking approach is the way to go.

I don’t have a similar approach implemented for development yet, and perhaps the way out is to just do that properly so that each project can be isolated, like at deployment time. In order for that to work, we would need to:

  • Finish making the mlir packages relocatable (it is about 70% there now) and give each sub project their own (ie. myproj.mlir.ir).
  • Rework the extensions into an object/dylib split so that each project can link their own core extensions in the way of their choosing.
  • Split the CAPI into finer grained dependencies so that you don’t need to have everything for every project (ie. Get the ~40mb minimum charge down to a few mb).

Between all of my partial attempts, I think I am not far from the first two (and the third is just an optimization). Do you think this is the right approach, and should I just do it? (It is not trivial, and I would like to know this is what we want to do before setting out)

I don’t know which one does contain the passmanager so I’m gonna tell you about both. My python extension doesn’t have a SONAME, and does depend on libMLIRPublicAPI.so.13git which has a SONAME.

AFAIK, RTLD_NOW is enforced in Linux systems (with python) and that’s what I have been using since the beginning of this project

Dropping a link to the most authoritative answers I’ve found so far: 19884 – Document libdl API in glibc to describe loading process in detail.

And just confirming, you are building with -DBUILD_SHARED_LIBS=ON, right?

I’ve confirmed this locally: All LLVM libraries (built with llvm_add) have a SONAME set. We’re back to: this shouldn’t happen but clearly does. I still can’t reason about what I have observed:

Files:

  • _mlir*.so (no SONAME, loaded with RTLD_LOCAL)
  • libMLIRPublicAPI.so (has a unique SONAME, DT_NEEDED from _mlir*.so)
  • _foo*.so (no SONAME, loaded with RTLD_LOCAL)
  • libFooPublicAPI.so (has a unique SONAME, DT_NEEDED from _foo*.so, has a DT_NEED on libMLIRPublicAPI.so)
  • libMLIRIR.so (has a unique SONAME, DT_NEEDED either from lib*PublicAPI.so or one of its transitive deps, all of which have unique SONAMEs)

What I am seeing is that the mechanism works as I expect for all _mlir*.so extension loads (i.e. they properly resolve to one copy of the tree of deps rooted at libMLIRPublicAPI.so). However, loading _foo*.so extensions appear to get an entirely duplicated tree of dependencies, resulting in various issues with statics. In a fully dynamic build like this, I do not understand how this is possible. I’ve tried various things like making _foo*.so have a direct DT_NEEDED on libMLIRPublicAPI.so, etc. The only thing that makes it work in my cases is to load both _mlir*.so and _foo*.so with RTLD_GLOBAL. I don’t have a theory as to what else is going wrong with the case that @ftynse reported and why only an LD_PRELOAD works.

Within a namespace, there should only be one library with a given SONAME, and the first one wins. If that can’t be depended on, the whole house of cards falls down in a lot more spectacular ways than some project like ours getting duplicate definitions of a couple of its statics.

We need a theory as to why the rule holds for all _mlir*.so dlopen()'s and not for _foo*.so (external project) dlopen()'s. I have not found any difference there that should be load bearing and don’t feel confident investing in workarounds without a root cause, or at least a theory, as to what is going on.

It thought that the issue we face is a duplication of the same symbol across different .so, isn’t it?

For example errors about duplicate registration generally don’t point to two copies of the registry itself, but two copies of the registered entity (for example the same pass in two different .so but with a global constructor doing the registration).
We actually allows duplicate registration, but everything is guarded through TypeID, so it is still a problem if the resolution of the static instance in TypeID::get is local to the .so instead of global (will it be a direct load or go through an indirection?), which may also be affected by the various visibility mode, etc.
Am I misunderstanding what you’re chasing?

Do you have a linux repro that shows this issue?

I tried to look into alternative mechanism, it is actually quite difficult to achieve while staying performant. In general I’d be concerned to add a penalty to the static case because as you mentioned " a lot of folks who are building MLIR based projects really should be statically linking and making hermetic distributions that don’t have C++/runtime dependencies between them".

I thought of a more intrusive solution that would work only for IR constructs (Type, Operation, Attribute): we could require each of these to define explicitly a static member to act as a TypeID instance. Because it wouldn’t be defined in a header and it wouldn’t be a template, it would be a strong definition instead of a weak one, and it should be easier to have a single instance in the program.
This would still not be a guarantee though, you could still link a dialect implementation in two different .so and get duplicate definition (that you could hide by using visibility hidden or load with RTLD_LOCAL), but that should still be an improvement over what we have today I think.

No

These does have a SONAME in my setup

And they still have a SONAME when I build with -DBUILD_SHARED_LIBS=ON

Here is how it could look like (for Operations, haven’t done Type/Attributes yet): ⚙ D105903 Emit strong definition for TypeID storage in Op definition (WIP)

Ah, this today will produce legit duplicate objects for core libraries. Prior to my poc approach above we can only link a) fully dynamically BUILD_SHARED_LIBS=ON or b) statically but only for the core library (cannot create a combination with another project). My poc above allows a c) statically link a monolithic dep for a cross project python setup (and will need some work to productionize but I think we should).

I expect that in the end, folks will use (a) for dev/early phase stuff and (c) for anything real. (b) is just an artifact of static linking never being finished for non-core python projects.

So today, you will find everyone who has this working using (a). And for a reason I can’t discern, that currently requires RTLD_GLOBAL (and has other spooky reports like what @ftynse reports above).

Aside from the full approach I propose above, you can get a little further with the current static linking setup if you remove the various flags --exclude-libs as I report above, but this will only work on Linux and may have other issues.

I have re-engineered the MLIR Python build. I believe it should meet the requirements that everyone has, both in-tree and out-of-tree: ⚙ D106520 Re-engineer MLIR python build support.. As a side effect, once we roll this out, the various Python MLIR projects should be trivially able to be uploaded to PyPI with just a bit more tweaking. And this also fixes the out-of-tree Python Windows builds.

@youben Do you mind if I pull your sample project in-tree as an example?

This is going to take some coordination with incubator projects in order to land. Hoping we can get organized to do that ~early next week, assuming there is not controversy.