Race condition crashes during launching LLDB

Hi,

I am revising our lldb automation tests into async mode. However, I found it randomly crashes depends on timing. And the crash happens mostly while launching lldb twice in a row. I have narrowed down the code into a simple repro below. Any assumption I made wrong with the LLDB API here?

The crash stack seems to be not consistently. In the small repro below, the crash stack is:

Crashed Thread: 0 Dispatch queue: com.apple.main-thread

Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: EXC_I386_GPFLT

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 _lldb.so 0x00000001088c7179 EventMatcher::operator()(std::__1::shared_ptr<lldb_private::Event> const&) const + 21
1 _lldb.so 0x00000001088c65d2 lldb_private::Listener::FindNextEventInternal(lldb_private::Broadcaster*, lldb_private::ConstString const*, unsigned int, unsigned int, std::__1::shared_ptr<lldb_private::Event>&, bool) + 176
2 _lldb.so 0x00000001088c6952 lldb_private::Listener::WaitForEventsInternal(lldb_private::TimeValue const*, lldb_private::Broadcaster*, lldb_private::ConstString const*, unsigned int, unsigned int, std::__1::shared_ptr<lldb_private::Event>&) + 134
3 _lldb.so 0x00000001088c6ae9 lldb_private::Listener::WaitForEventForBroadcasterWithType(lldb_private::TimeValue const*, lldb_private::Broadcaster*, unsigned int, std::__1::shared_ptr<lldb_private::Event>&) + 27
4 _lldb.so 0x0000000108abce6c lldb_private::Process::WaitForStateChangedEvents(lldb_private::TimeValue const*, std::__1::shared_ptr<lldb_private::Event>&, lldb_private::Listener*) + 112
5 _lldb.so 0x0000000108abcc95 lldb_private::Process::WaitForProcessToStop(lldb_private::TimeValue const*, std::__1::shared_ptr<lldb_private::Event>, bool, lldb_private::Listener, lldb_private::Stream*) + 377
6 _lldb.so 0x0000000108ac516a lldb_private::Process::HaltForDestroyOrDetach(std::__1::shared_ptr<lldb_private::Event>&) + 216
7 _lldb.so 0x0000000108abc8b0 lldb_private::Process::Destroy(bool) + 146
8 _lldb.so 0x0000000108abc56d lldb_private::Process::Finalize() + 91
9 _lldb.so 0x00000001088b63c4 lldb_private::Debugger::Clear() + 148
10 _lldb.so 0x00000001088b61fd lldb_private::Debugger::Destroy(std::__1::shared_ptr<lldb_private::Debugger>&) + 37
11 _lldb.so 0x0000000106bdb144 lldb::SBDebugger::Destroy(lldb::SBDebugger&) + 116
12 _lldb.so 0x0000000106c23daf _wrap_SBDebugger_Destroy(_object*, _object*) + 120
13 org.python.python 0x00000001058dd75f PyEval_EvalFrameEx + 12761

while in the real unit test it is crashing at:

Thread 12 Crashed:
0 libsystem_kernel.dylib 0x00007fff8635a286 __pthread_kill + 10
1 libsystem_c.dylib 0x00007fff919409b3 abort + 129
2 libc++abi.dylib 0x00007fff8a94ea21 abort_message + 257
3 libc++abi.dylib 0x00007fff8a9769d1 default_terminate_handler() + 267
4 libobjc.A.dylib 0x00007fff935e77eb _objc_terminate() + 124
5 libc++abi.dylib 0x00007fff8a9740a1 std::__terminate(void ()()) + 8
6 libc++abi.dylib 0x00007fff8a973b30 __cxa_throw + 121
7 com.apple.LLDB.framework 0x000000010b994c6b std::__1::shared_ptr<lldb_private::Process>::shared_ptr<lldb_private::Process>(std::__1::weak_ptr<lldb_private::Process> const&, std::__1::enable_if<is_convertible<lldb_private::Process
, lldb_private::Process*>::value, std::__1::shared_ptr<lldb_private::Process>::__nat>::type) + 99
8 com.apple.LLDB.framework 0x000000010b8ac762 lldb_private::Process::AppendSTDOUT(char const*, unsigned long) + 86
9 com.apple.LLDB.framework 0x000000010b6951d7 lldb_private::Communication::ReadThread(void*) + 287
10 libsystem_pthread.dylib 0x00007fff8d92c05a _pthread_body + 131
11 libsystem_pthread.dylib 0x00007fff8d92bfd7 _pthread_start + 176

================Repro Code====================

def wait_for_process_stop(process):
while not process.is_stopped:
time.sleep(0.1)

def launch_debugging(debugger, stop_at_entry):
error = lldb.SBError()
listener = lldb.SBListener(‘Chrome Dev Tools Listener’)
target = debugger.GetSelectedTarget()
process = target.Launch (listener,
None, # argv
None, # envp
None, # stdin_path
None, # stdout_path
None, # stderr_path
None, # working directory
0, # launch flags
stop_at_entry, # Stop at entry
error) # error
print ‘Launch result: %s’ % str(error)
event_thread = LLDBListenerThread(debugger)
event_thread.start()
return process

def do_test():
debugger = lldb.SBDebugger.Create()
debugger.SetAsync(True)
target = debugger.CreateTargetWithFileAndArch(executable_path, lldb.LLDB_ARCH_DEFAULT)

process = launch_debugging(debugger, stop_at_entry=True)

wait_for_process_stop(process) # wait for entry breakpoint.
target.BreakpointCreateByName(‘main’)
process.Continue()
wait_for_process_stop(process) # wait for main breakpoint.
lldb.SBDebugger.Destroy(debugger)

def main():
do_test()
do_test()

I don't know what:

    event_thread = LLDBListenerThread(debugger)

does, but from your little sketch it looks like you are starting up a thread listening on this debugger, and so far as I can see you destroy the debugger out from under it without ever closing down that thread. That doesn't seem like a good idea.

Jim

After adding some logging I figured out that the race condition is caused by process.Continue() did not guarantee process has been really resumed yet in async mode, so the second wait_for_process_stop() is skipped immediately to kill listener thread and destroying debugger. I know I have a race condition bug here because of polling for process state, but why is lldb crashing when listener thread has exited and SBDebugger.Destroy() is called? What is the situation that SBDebugger.Destroy() can be called safely?

==================do_test==================
Launch result: success
Listening Thread ID: 4660334592
WaitForEvent…
Target event: ModulesLoaded
WaitForEvent…
Process event: StateChanged, Stopped
Stop reason: 5
WaitForEvent…
Breakpoint event: [Added] SBBreakpoint: id = 1, name = ‘main’, locations = 1
WaitForEvent…
[main] killing listener thread
Process event: StateChanged, Running
Stop reason: 0
Exiting listener thread
[main] destroy debugger
Segmentation fault: 11

My guess on the sequence of events here is this:
- call process.Continue(), which returns immediately
- you check process.is_stopped, which is still true
- set self.should_quit = true
- listener thread exits and you join in
- you call SBDebugger.Destroy()

all of this happens _before_ the process has had a chance to really
start. So now Destroy starts destroying the process while it is just
being started up and things go south. It could be argued that this is
a bug in LLDB (this is the reason our TestEvents is disabled). I've
been investigating this a bit, but it did not look easy.

In any case, what you can do now is to make sure you wait for the
eStateRunning event before you try to do anything to the process
(including killing it). These paths are more tested and they I believe
they should be stable.

pl

Thanks Pavel, yeah, that’s what I figured out yesterday.
In “So now Destroy starts destroying the process while it is just being started up and things go south”, for “started up”, I assume you mean inferior is not continued/resumed from first entry point breakpoint, right? The inferior is definitely started and hitting the first entry point breakpoint because of the first “wait_for_process_stop()” call. Just want to confirm we are on the same page.

I have revised my sample code to use two thread.Event to synchronize running/stopping between main thread and listener thread, the race condition seems to go away after running it 10 times in a loop. I will integrate this logic into our unit test to see if it fixed the race crash.

If you are going to set async to true, then you must consume the events by waiting for events. See the following example:

svn cat http://llvm.org/svn/llvm-project/lldb/trunk/examples/python/process_events.py

So you can rewrite your wait_for_process_stop to use the debugger to fetch the events and do things correctly:

def wait_for_process_stop(debugger):
  event_timeout_in_seconds = 10
  listener = debugger.GetListener()
  event = lldb.SBEvent()
  stopped = False
  while not stopped:
    if listener.WaitForEvent (event_timeout_in_seconds, event):
      if lldb.SBProcess.EventIsProcessEvent(event):
        state = lldb.SBProcess.GetStateFromEvent (event)
        if state == lldb.eStateStopped:
          // Watch for something that stopped and restarted automatically
          if SBProcess::GetRestartedFromEvent(event) == False:
            stopped = True
    
The other option is to do it to set Async to False and then your target.Launch won't return until the process is stopped. Also if you do "process.Continue()" it won't return until the process is stopped. But you lose your ability to stop the process if things go wrong...

Greg

The main problem you are running into is in async mode when you say "launch" or "continue", those calls will return immediately and you must consume the events to make sure it is safe to do things. If your process is stopped, then until your actually resume your process with process.Continue() or any thread.StepXXX() calls, your process will stay stopped. But when you ask to make it run, you must consume events to know what the process is actually doing...