MLIR and the C++ ABI

In CIRCT, we encountered a weird bug related to ⚙ D124791 [mlir] Refactoring dialect and test code to use parseCommaSeparatedList.

In a CI test, we build MLIR with Clang and link it to CIRCT compiled with GCC to test for GCC-specific compiler errors. The builds started failing after rebasing LLVM onto the patch. We are aware of the fact that this is not supposed to work and we will discontinue cross-compiler linking.

The reason for the failure is the lambda function in parseTypeList. On the MLIR library side, the function associated with the lambda is compiled with Clang, expecting the structure of captured items to contain the this pointer as its first field. On the CIRCT side, GCC decides to inline the creation site of the lambda, setting up the payload according to its own rules, placing the captured this pointer as the last field of the structure. Once the lambda function starts executing, the lambda crashes as the this pointer and the SmallVectorImpl reference are reversed.

While there are obvious incompatibilities between the lowering of lambdas, issues can be hidden by the fact that lambda setup code and lambda function implementations are often in the same compilation unit and the payload is not inspected while traversing other modules, especially if llvm::function_ref is used. Unfortunately, if inlining is involved or lambdas are copy- or move-constructed, problems arise. To the best of my knowledge, C++ has no ABI to standardise this (a C++ ABI draft goes only as far as specifying name mangling rules).

This is going to be a problem if LLVM/MLIR will be packaged in any way, as functions that construct lambdas will not work when inlined. While our use case was incorrect, linking a project compiled with clang to a library built by GCC and distributed by a package manager is a legitimate scenario. If packaging is a need, we should consider some style guidelines to avoid such issues. Otherwise, I hope this post serves as a warning of the perils of mixing compilers.

7 Likes

Wow, that is crazy. This seems like it is a pretty significant ABI incompatibility issue - is this specified by the de-facto C++ ABIs, or just defined by implementations?

I think it’s left up to the implementation, as there is no reference in the ABI draft and the standard leaves the field order unspecified.

Maybe relevant: The cxx-abi-dev January 2013 Archive by thread.

Sounds relevant, although the issue is more involved here since the incompatibility arises not from the definition of the lambda, but from the inlining transformation performed by the compiler, which I believe was not considered in that discussion. Did the discussion in that thread lead to any tangible changes?

I haven’t looked at your problem in detail, but I assume that both compilers see the lambda and generate code for both construction and operator(). Then one of these is perhaps inlined and the other is discarded by the linker in favor of the copy from the other compiler, resulting in inconsistencies. In standardese these are ODR issues. The linker can discard duplicate weak symbols because the one-definition rule states they must be the same, otherwise it’s undefined behavior.

Maybe @zygoloid or @rjmccall know that.

That’s precisely the problem here. But more generally, even if the compiler matched the implementation with the correct construction code, passing std::function through a module lowered by the different compiler with different copy or move constructors is still incorrect.

Hmm, I don’t see how that could be an issue, since std::function should essentially type-erase the lambda implementation. I haven’t thought about how it handles copy and move construction, but I can only assume that they are virtualized if not trivial, since std::function erases all information about the contained members.

The general idea is that you can mix compilers if you don’t rely on undefined behavior. If that’s not the case and you have a reproducer for the inconsistency, it might be worth filing an issue at GitHub - itanium-cxx-abi/cxx-abi: C++ ABI Summary, if only to clarify whether the ABI should specify that or whether it’s undefined behavior. I’m not sure about the specifics of your case, but [basic.def.odr] states that lambdas in a function body should be fine, but as default parameters they’re not.

You’re absolutely right, I missed the fact that std::function carries no information about the type of the captures.

Reproduced the error with a minimal example and filed an issue here: Lambda ABI mismatch · Issue #141 · itanium-cxx-abi/cxx-abi · GitHub

1 Like