[ORC JIT][MLIR] GDBRegistrationListener "second attempt to perform debug registration" assert

Hi all,

Attention: Lang Hames

I am developing the nGraph MLIR implementation and hitting the following assert while running nGraph unit tests:

assert(ObjectBufferMap.find(K) == ObjectBufferMap.end() &&

“Second attempt to perform debug registration.”);

Here is a permalink for that line of code on GitHub. The failure only occurs when running multiple unit tests back-to-back. Inevitably, if I rerun the failing unit test, it passes. The failure also tends to move around with a different unit tests failing on successive runs.

I am able to hit the failure in GDB. Here is a partial backtrace:

#3 0x00007fffec4cf412 in __GI___assert_fail (assertion=assertion@entry=0x7ffff1d145e0 “ObjectBufferMap.find(K) == ObjectBufferMap.end() && "Second attempt to perform debug registration."”,

file=file@entry=0x7ffff1d144d8 “/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp”, line=line@entry=164,

function=function@entry=0x7ffff1d15520 <(anonymous namespace)::GDBJITRegistrationListener::notifyObjectLoaded(unsigned long, llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo const&)::PRETTY_FUNCTION> “virtual void {anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey, const llvm::object::ObjectFile&, const llvm::RuntimeDyld::LoadedObjectInfo&)”)

at assert.c:101

#4 0x00007ffff01ed4d3 in (anonymous namespace)::GDBJITRegistrationListener::notifyObjectLoaded (this=0x5555591a9270, K=93825060893584, Obj=…, L=…)

at /localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:163

#5 0x00007ffff01b0eca in std::function<void (unsigned long, llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo const&)>::operator()(unsigned long, llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo const&) const (__args#2=…, __args#1=…, __args#0=, this=0x5555594e09b8) at /usr/include/c++/7/bits/std_function.h:706

#6 llvm::orc::RTDyldObjectLinkingLayer::onObjLoad (this=, K=3, R=…, Obj=…, MemMgr=0x5555591c4160, LoadedObjInfo=std::unique_ptrllvm::RuntimeDyld::LoadedObjectInfo = {…},

Resolved=std::map with 8 elements = {…}, InternalSymbols=std::set with 0 elements)

at /localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:267

A little debug context:

  • Frame #6 with K=3 we are about to execute the NotifyLoaded callback
  • Frame #5 we execute the NotifyLoaded callback which apparently transforms K from 3 to 93825060893584 (0x5555596CF390) which looks like a pointer. I don’t know the code well enough to grok what’s going on here. I’m not even sure where this callback resides, but it could be here in RTDyldObjectLinkingLater.cpp where we are using the address of the MemMgr as K.
  • Frame #4 now in notifyObjectLoaded with K=93825060893584 which will go on to assert in frame #3 due to the fact that that key (K) already exists in the map

All of this leads me to believe there is some sort of race. Theory: with two unit tests A and B running back to back…

+Lang

Hi Adam,

Calls to the listeners should be protected by the RTDyldLayerMutex. Could you apply the attached patch and share the debugging output from one of the failing runs?

Regards,
Lang.

rtdyldlayer-debug-out.patch (1.4 KB)

Sorry for the delay. Hope everyone had a good long weekend. See attached log captured from GDB session. Here is the punchline:

Adding MemMgr 0x55555959f440

unit-test: /localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:164: virtual void {anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey, const llvm::object::ObjectFile&, const llvm::RuntimeDyld::LoadedObjectInfo&): Assertion `ObjectBufferMap.find(K) == ObjectBufferMap.end() && “Second attempt to perform debug registration.”’ failed.

(gdb) print/x K

$2 = 0x555559605c50

The key (K) causing the assert a) does not match the addresses of the most recently added MemMgr and b) does not match the address of ANY MemMgr in the entire log file. The plot thickens.

Wondering if there is some other object (besides a MemMgr) that registers its address-as-key into ObjectBufferMap.

Adam

log2 (86.5 KB)

Referring to the log messages from my previous mail… I confused myself (and probably others) by reading the “Adding MemMgr 0x55555959f440“ message as “Registering MemMgr 0x55555959f440”. Thus the address mismatch made no sense. How could we be registering a MemMgr address/key that does not match once we arrive in notifyObjectLoaded method?

Answer: Because the registrations is NOT coming from RTDyldObjectLinkingLayer.cpp. I found this first by adding an additional debug statement (highlighted) thinking the reinterpret_cast may be at fault for the key mismatch. I never saw this print and I then realized my “Adding” vs. “Registering” mistake.

// Run EventListener notifyLoaded callbacks.

{

std::lock_guardstd::mutex Lock(RTDyldLayerMutex);

dbgs() << "Registering MemMgr " << MemMgr << “\n”;

auto LOIItr = LoadedObjInfos.find(MemMgr);

assert(LOIItr != LoadedObjInfos.end() && “LoadedObjInfo missing”);

for (auto *L : EventListeners) {

dbgs() << " key = " << reinterpret_cast<uintptr_t>(MemMgr) << “\n”;

L->notifyObjectLoaded(

static_cast<uint64_t>(reinterpret_cast<uintptr_t>(MemMgr)), *Obj,

*LOIItr->second);

}

LoadedObjInfos.erase(MemMgr);

}

So, where is the registration coming from? Remember that I am on a slightly older version of MLIR. My mlir/lib/ExecutionEngine/ExecutionEngine.cpp has the following code with highlighted debug statement added by me:

objectLayer->setNotifyLoaded(

[engine = engine.get()](

llvm::orc::VModuleKey, const llvm::object::ObjectFile &object,

const llvm::RuntimeDyld::LoadedObjectInfo &objectInfo) {

if (engine->gdbListener) {

uint64_t key = static_cast<uint64_t>(

reinterpret_cast<uintptr_t>(object.getData().data()));

dbgs() << "ExecutionEngine key = " << key << “\n”;

engine->gdbListener->notifyObjectLoaded(key, object, objectInfo);

}

});

The key printed here matches the key printed in notifyObjectLoaded. This code is NOT present at the tip of MLIR master. Hence the solution here may be to move to latest MLIR master which I probably should have tried before sending this mail. Will do that now.

Adam

Lang,

It looks like the change I need is here, correct?

https://github.com/llvm/llvm-project/commit/3c5dd5863c34ecd51e9d2a49929877d8151dea39

I would like to move the nGraph MLIR pointer all the way to the tip of master but it is expensive (lots of compile and link issues) so hoping to take a baby step.

Adam

Closing the gap on this. Moving nGraph MLIR pointer to the 3c5ddd58 git hash did indeed fix the problem. Thanks so much for all the help.

Hi Adam,

Sorry for the late response! I’m very glad you were able to identify and solve the problem, and thank you for sharing your analysis: It will help anyone else who gets tripped up by this.

Regards,
Lang.