Checking types between different python extensions

The least I can do, no problem!

This looks really great Mehdi. Regardless of what happens the python dylib thing, this seems like a clearly better approach to reduce pressure on the static/dynamic linker.

+1 - anchoring these in one translation unit is the right thing.

Fwiw - I finally figured out what is going on with the ā€œPythonā€ side. It is an issue with loading more than one plugin library with RTLD_LOCAL. Then if the dependent libraries (which have SONAMEs) have vague or weakly linked symbols in common, you can end up based on linking order, with vague and weak symbols that are duplicated in the process. This seems to just be a design flaw of ELF linking and there is no way to avoid it. It comes up naturally from the rule that ā€œfirst library with a SONAME winsā€. So the first RTLD_LOCAL plugin that encounters a certain SONAME-so (libA.so) will resolve all of its symbols and they will be cached in the namespace and used verbatim the next time the SONAME is called for. If another SONAME-so (libB.so) is loaded later under its own RTLD_LOCAL scope, then it will also resolve its vague/weak symbols to unique addresses. Then, if later (where ā€œlaterā€ can just be further in the DAG-dependency order), libA.so is also referenced, its cached/linked symbols will be used as-is, and now you have duplicates.

It is frighteningly easy to end up with dependency DAGs that exhibit this pattern, and mixing in other linker flags like re-exporting dependent static libraries (default on Linux, not on MacOS) and hidden symbol visibility doesnā€™t change the bug mechanic ā€“ but does add a lot more ways that this can fail.

In order to be the most usable as a library, we should avoid weak/vague symbols of this form and always anchor them in a specific TU. On the Python side, layering things cleanly, like Iā€™ve now done, is also the right way for other reasons (i.e. creating static/hermetic distributions).

1 Like

Ok, āš™ D106520 Re-engineer MLIR python build support. has landed and we have pre-flighted integrations in CIRCT/npcomp, and this has been used in some experimental repos I am aware of.

Things left to do:

  • Further refinement of the breakdown of libraries. Right now the execution engine is included in the main IR group, which is the primary thing bloating sizes. Splitting this should be easy ā€“ just wasnā€™t obvious before. I also took steps to allow a downstream to not depend on all core dialects/passes but there is probably some additional work to refine things.
  • Landing an example project upstream (@youben - yours is a good start - do you want to do that or me?)
  • Finishing/testing the support for renaming the mlir. namespace (i.e. making it a private part of your project). I built the support for it but likely missing things/not tested. This is expected to be the main way that full out of tree projects use the API (i.e. if they are distributing something, they should not be all distributing the mlir. namespace).

Great progress Stella!

And just to confirm that this fix is indeed working, NPCOMP with PyTorch integration (which is a non trivial build/link setup to say the least) now works on MacOS. This had been broken on TypeID issues for MacOS for a very long time.

I am really glad to have this all working and on to better things now.

2 Likes

Thatā€™s great!

If itā€™s not a lot for you (since Iā€™m not familiar with the contribution guidelines), otherwise, I can try to do it myself.

Thanks @stellaraccident for the fix and the new way of building python modules. I will be waiting for the CIRCT project to adopt it since it was the one I based my project on (as a documentation).

FYI @youben if youā€™d like to follow CIRCT, hereā€™s the patch where we switch to the new Python build support: [Python] Re-work Python bindings using upstream improvements. (#1484) Ā· llvm/circt@8a792f8 Ā· GitHub.

If youā€™d like to contribute your example Iā€™m sure it would be appreciated! The guidelines are here: How to Contribute - MLIR. Basically, you can copy your demonstration project under examples/, submit a patch through Phabricator, and request a review. If youā€™re interested, Iā€™d be happy to help answer any questions and review the submission.

1 Like

Great that CIRCT has already applied the changes, I will follow that commit to rework the bindings on my end as well.

Thanks for offering your help, I will then follow the guidelines to submit an example as soon as I get it to work with the new bindings.

FYI - I recently had to start a simple boilerplate project in IREE. It may be of value in pulling together a sample, especially since it is a single commit: https://github.com/google/iree/commit/aa8431fb5b8b20abeeb36e50e3b495abcfa603e3

1 Like

FYI - Recent builds for me are showing that weā€™ve gotten some of the vague symbols but not all. Here is a sample of what is left: https://gist.github.com/stellaraccident/1c2e35219cf23f543410e98663f79d1c

Just glancing through, looks like various interfaces, op traits and analysis related things. Not shown here, but ops, attributes, types and dialects are being emitted as strong symbols, as expected.

So still some more work to be done for anyone who wishes to.

As it turns out we still have really fragile TypeID problems in -DBUILD_SHARED_LIBS=ON builds. I just reproā€™d on an internal Google project with mlir-hlo-opt. Logs arenā€™t public but the issue seems to be mismatches on OpTrait and Interface TypeIDs. So I do think we need to finish this work.

It took me quite some time to move to HEAD and start using the new mechanism for building python bindings. The memref issue that I previously had is no longer there, but still, it seems that Iā€™m facing another TypeID issue.

An assert is failing in llvm-project/mlir/include/mlir/Support/InterfaceSupport.h:104 because the function couldnā€™t get the interface (returning nullptr). Down there, it seems that searching for the interface using a certain TypeID couldnā€™t find the interface. The same function called from a C++ compiled binary would work, but from a python binding.

Some part of the backtrace that I can publicly disclose, if it can be helpful

#0  mlir::Dialect::getRegisteredInterfaceForOp (this=0xc0c660, interfaceID=..., opName=...) at /home/youben/code/llvm-project/mlir/include/mlir/IR/Dialect.h:174
#1  0x00007ffff6a4176d in mlir::Dialect::getRegisteredInterfaceForOp<mlir::SymbolOpInterface> (this=0xc0c660, opName=...) at /home/youben/code/llvm-project/mlir/include/mlir/IR/Dialect.h:180
#2  0x00007ffff6a3e503 in mlir::OpInterface<mlir::SymbolOpInterface, mlir::detail::SymbolOpInterfaceInterfaceTraits>::getInterfaceFor (op=0xc71fc0)
    at /home/youben/code/llvm-project/mlir/include/mlir/IR/OpDefinition.h:1846
#3  0x00007ffff6a3ea1a in mlir::detail::Interface<mlir::SymbolOpInterface, mlir::Operation*, mlir::detail::SymbolOpInterfaceInterfaceTraits, mlir::Op<mlir::SymbolOpInterface>, mlir::OpTrait::TraitBase>::Interface<mlir::FuncOp, (void*)0> (this=0x7fffffffc170, t=...) at /home/youben/code/llvm-project/mlir/include/mlir/Support/InterfaceSupport.h:102
#4  0x00007ffff6a3c03b in mlir::OpInterface<mlir::SymbolOpInterface, mlir::detail::SymbolOpInterfaceInterfaceTraits>::Interface<mlir::FuncOp, (void*)0> (this=0x7fffffffc170)
    at /home/youben/code/llvm-project/mlir/include/mlir/IR/OpDefinition.h:1834
#5  0x00007ffff6a3c065 in mlir::SymbolOpInterface::Interface<mlir::FuncOp, (void*)0> (this=0x7fffffffc170) at tools/mlir/include/mlir/IR/SymbolInterfaces.h.inc:103

Iā€™m not really aware of how complex/simple this problem is, so it would be great to know if we can try to solve it somehow, and how much effort it would cost to eradicate it from the python bindings, because it makes them unusable for advanced purposes.

Here are my notes on the remaining TypeID weak linkage instances. For robust C++ dynamic linking, they all need to be fixed: More work needed for TypeID duplication (help wanted)

We are using the Python bindings in some quite advanced, cross project ways, but we never do C++ dynamic linking. Instead, each project embeds what it needs of the API and the cmake machinery creates a combined CAPI .so/.dll and everything links to that. It would be nice to have better support for general C++ dynamic linking on MLIR, but this case (hermetic, CAPI-based Python extensions) is the preferred way for a number of deployment scenarios, and it carefully threads the needle across all of the platforms regarding link-time/dso designs to be sound (i.e. IREE, CIRCT and Torch-MLIR are deploying Python extensions that include the MLIR API+other projects like MHLO for both Linux/Windows/MacOS).

I would like to see the rest of the issues I noted above fixed, but I donā€™t have the time right now to track them all down.

Given what I know, it sounds like you may be holding something wrong. Could you share more about your CMake flags and setup? Specifically, it sounds like you are doing a DYLIB/shared build ā€“ and afaict, that should be an unsupported configuration for MLIR until more of the dynamic linkage issues are resolved (my personal opinion/experience).

1 Like

I added some explicit ā€œhomingā€ of the TypeID for many entities, but havenā€™t got to do them all. In particular Interfaces arenā€™t yet, neither are traits.
You can https://reviews.llvm.org/D105903 and see if you get inspired to send a patch for the Interface generation :wink:

Edit: My message crossed with Stella, who pointed to the summary I forgot about: More work needed for TypeID duplication (help wanted) ! This has all the info indeed.

1 Like

Iā€™m using a unified build, with my project as an external project, which is what is currently supported for the python bindings AFAIK. BUILD_SHARED_LIBS is off, the python extension that Iā€™m building is linked dynamically to a shared lib specific to my project, and another one emitted by the build by add_mlir_python_common_capi_library

Ok, so this is a bit different from what we are doing, but it should work (the best thing here is to fix the TypeID issues as I walked through in the above post).

Iā€™d need to understand more about what is in the ā€œshared lib specific to my project,ā€, what it links to and the directory tree. Maybe an ls -lR of the python_package directory? And ldd on key libraries? Feel free to PM me if this is sensitive or better for off thread.

Thanks to @stellaraccident , we were able to find out what was causing the issue and how to fix it. The TypeID issue is the root of the problem, and the fix is for as long as the this issue remain. There was a cross-DSO C++ dependency, where some parts of the MLIR C++ implementation were in the python extension (e.g. _custom.so), as well as in the generated library for the CAPI (e.g. libCustomBindingsPythonCAPI.so. The way to go around this is to use the CAPI in your python extension instead of C++ (which was my case). Anything you would need to call from python should have a CAPI.

2 Likes

Iā€™m glad we were able to figure it out ā€“ I owe some doc updates on out of tree and packaging. It went from something we were struggling with to something that we can use over the last couple of months and it would be good to describe where we landed.

This does just emphasize that MLIR really is not compatible with C++ linking across DSO boundaries. Python is in a bit of a privileged position because, if you can make it work, you do have the option of using a C++ API directly. But a lot of other language bindings really work best (or at all) with pure C anyway ā€“ so even if some extra typing, keeping that layering is the right thing for the core project. I do wish that it were easier for out-of-tree projects to bend the rules, though, in cases where the layering doesnā€™t matter to them. The way it is put together now is a high bar. I think it produces the best end result, but it is a lot of boilerplate to get going.

I think if I had advised you to build with project-wide hidden visibility, the layering would have been enforced at link time vs just failing randomly at runtime ā€“ and that may have been better for debugging. Example:

set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)
set(CMAKE_C_VISIBILITY_PRESET "hidden")
set(CMAKE_CXX_VISIBILITY_PRESET "hidden")

With default visibility, it will let you accidentally link to C++ internals. With hidden visibility, only exported parts of the API are visible across DSO boundaries, and that will in practice, restrict you to the C API (which has macros that export it).

1 Like