Reliability of the lldb-dap tests

JDevlieghere · May 1, 2025, 4:46pm

I want to raise awareness regarding the current state of the lldb-dap tests. Recently, we’ve seen an increase in flakiness of the tests in CI ([LLDB] Flakey DAP tests on builders · Issue #137660 · llvm/llvm-project · GitHub) which has led to tests for a bunch of major functionality (attaching, breakpoints, disassembling, running commands, …) have been disabled:

Keeping the CI reliable is extremely important and disabling unreliable tests is the right thing to do as long we’re working towards re-enabling them. However, I’m skeptical that the flakiness we’re observing (at least currently) is specific to these particular tests rather than a structural issue and that we won’t need to skip more tests in the coming days. We also can’t continue developing lldb-dap while these tests are disabled and we have reduced to no test coverage. To summarize, the current state of lldb-dap is not sustainable and I think we urgently need to prioritize getting the tests re-enabled.

I’ve been investigating one potential source of unreliability ([lldb-dap] Fix raciness in launch and attach tests by JDevlieghere · Pull Request #137920 · llvm/llvm-project · GitHub), but I’m not confident this is the source of all the flakiness. I’d like to ask everyone for their help to investigate and improve the reliability of the tests.

CC the maintainers: @clayborg @wallace
CC active contributors: @ashgti @avogelsgesang @Da-Viper @eronnen @kusmour @ZequanWu

JDevlieghere · May 1, 2025, 5:04pm

One observation from investigating this over the past week is that at least one source of the unreliability is the event thread. Support for the module event means we’re doing more work there, which changes the timing of operations and has uncovered the race condition I mentioned earlier. This theory is supported by @Felipe, who observed an increase in flakiness on our downstream branch of the Swift fork where I had backported the module change. Although we could temporarily revert that change as a stopgap, the pull request reveals existing race conditions—it would only obscure the issue and reduce its frequency, not eliminate it.

wallace · May 1, 2025, 5:22pm

@clayborg, do you know if anyone in your team that is maintaining actively lldb-dap could lend a hand?

dmpots · May 1, 2025, 5:28pm

I have been debugging a deterministic failure with TestDAP_exception_cpp.py that we see internally. Not sure if its failing on the bots sometimes as well.

I wrote up the details in some debugging notes. This failure is caused by __cxa_current_exception_type() not returning consistent values across debugging sessions. I don’t think its dap-specific, but is triggered by the dap test.

Any pointers to help fix the problem there would be appreciated.

Michael137 · May 1, 2025, 5:46pm

Not sure there is a ton of useful info here but I’ve seen this test be flakey on our bots in the past: [lldb-dap] TestDAP_exception_cpp test flakey · Issue #116040 · llvm/llvm-project · GitHub

dmpots · May 1, 2025, 5:54pm

Thanks for pointing out that issue. Looks like the same problem we see internally. I commented with my findings there.

ashgti · May 1, 2025, 5:58pm

The flakes in the attach test seem to be due to races between the DAP::Loop thread and the event handling thread.

To clarify the DAP flow a little, see ‘Launch Sequencing’ in Overview

Once the DAP server send the response to the initialize request the following happen in parallel:

The initialized event triggers the client sending setBreakpoints, etc. with the configurationDone request at the end of the chain.
launch or attach requests are sent and we send back a process event.

The flow from VSCode at least often ends up with:

initialize seq=1
launch seq=2
setBreakpoints seq=3
configurationDone seq=4

But these do not strictly have to be in that order.

The DAP server can respond to these events asynchronously but today the lldb-dap is single threaded for handling each of these responses.

Looking at the failing test Buildbot, we had the following sequence:

{"command":"initialize", ...}
{"command":"attach", ...}
{"command":"setBreakpoints", ...}
{"command":"configurationDone", ...}
{"event":"stopped", ...}

But that stopped event here is wrong. This attach did not set "stopOnEntry": true, so that should have continued the process at point.

I think we need to establish more clear expectations about the state of the process at each point in the attach request flow. Right now, I don’t think we’re waiting for the process to be ‘stopped’ when we make a connection, around here llvm-project/lldb/tools/lldb-dap/Handler/AttachRequestHandler.cpp at main · llvm/llvm-project · GitHub. We do that if there are attach commands but not if we’re directly making the connection.

I’ll do some testing to see what happens if we add a dap.WaitForProcessToStop after we make the connection. We’re setting the debugger to sync mode, but I’m not sure if the SBTarget.ConnectRemote call is checking that or not.

Da-Viper · May 1, 2025, 8:58pm

I think our current implementation of Launch and Attach Request are wrong.

It is meant to acknowledge if it a launch or an attach. the main start of debugee should be after configurationDone request from the client.

This should eliminate the data race of setting the breakpoint and stopOnEntry (in our current implementation of launch this is checked when we are are at configuration done). If for some reason there is a large delay from the LaunchRequest to the configurationDoneRequest, we would miss it entirely. Same goes for setting breakpoints (Occasionally I have witnessed this happen).

Relevant info from the specification link

Launch Sequencing
There are three main building blocks we’ve discussed to create a debugging session:

Initialization, via the initialize request,
Configuration of breakpoints and exceptions during the configuration stage, and
Starting the debugger, via a launch or attach request.
Initialization happens first, and the debug adapter must respond to the initialize request with any capabilities before any further communication can take place. At any point after the client receives the capabilities, it sends a launch or attach request.

Once the debug adapter is ready to receive configuration from the client, it sends an initialized event to the client. As described above, the client sends zero or more configuration-related requests before sending a configurationDone request.

After the response to configurationDone is sent, the debug adapter may respond to the launch or attach request, and then the debug session has started.

This could solve the problem because we know when exactly the debugee is started. the parallel part of initalizing and launching is handled before the debuggee is started, so It does not send any event that the client may not be ready to handle.

To double check I reconfirmed this with some implementation of protocol.

vscode builtin node debugger link
delve go debugger link
gdb-dap link

JDevlieghere · May 1, 2025, 9:09pm

I think you’re right. @kusmour came to the same conclusion: [lldb-dap] Fix raciness in launch and attach tests by JDevlieghere · Pull Request #137920 · llvm/llvm-project · GitHub

To summarize my comment in the PR, I think there’s two things we could/should change:

Delay handling the launch and attach requests until the configuration is done. The tricky part is that we will need to acknowledge the request with a response, without actually doing the launch or attach. Right now we’re using the response to report any errors during that process.
Do all the launching and attaching in asynchronous mode and delay sending the response to the request until we’ve seen the process stop event. This would be the point where we decide to resume or stop, depending on whether stop-on-entry is true or false. The ConfigurationDone request would become a NO-OP besides signaling that we can handle the attach and launch request?

Does that make sense to folks? Can anyone think of a reason that couldn’t work?

kusmour · May 1, 2025, 9:19pm

The flow from VSCode at least often ends up with:

initialize seq=1

launch seq=2

setBreakpoints seq=3

configurationDone seq=4

But these do not strictly have to be in that order.

And ++ to this:

I think we need to establish more clear expectations about the state of the process at each point

Our current implementation will make sure it’s in this order, at least, for VSCode case. And it’s not compliant with DAP specification (see this comment). But then we also have dap_server.py that’s not completely the same as a real client.

If we also add the responses and events, it will be:

--> initialize req | <-- initialize res
                     --> launch req | <-- launch res
                                      <== process & initialized (in any order)
                                      --> setBreakpoints req ... | <-- setBreakpoints res ...
                                                                   --> configurationDone req | <-- configurationDone res
                                                                                               <== stopped (if stopOnEtry)

This is different from the specification already. And for tests, the dap_server.py might do something different than the above.
For example, we rely on a continue to trigger the configurationDone. See the dap log from TestDap_attach.test_terminate_commands, there will be no configurationDone in the log.

I think it’s also very important to make sure our test behaves the same as a real client to make sure the tests are reliable.

wallace · May 1, 2025, 9:19pm

This all makes sense to me and, tbh, this is how we should have do it since the beginning.
It’s important to mention that this might play better with launch and attachCommands, which can have any arbitrary commands.

ashgti · May 1, 2025, 10:48pm

We’ll need to re-organize some things to support this, today we send initialized in the attach/launch handlers.

We can send initialized after the initialize response is sent and delay the launch / attach request until configurationDone has responded.

I also wonder how we should handle setBreakpoint and friends. Should we collect delay responding to those until we’ve handled the configurationDone?

JDevlieghere · May 1, 2025, 10:58pm

I’m working on a PR/patch set that does exactly that. I’m still debugging some issues and I expect it’ll take some work to update all the tests.

I think that shouldn’t be an issue. It’s totally fine to set a breakpoint before a target has been created. Once they resolve, we’ll send a breakpoint event.

ashgti · May 1, 2025, 11:04pm

To summarize, I think the flow we’re expecting is:

request initialize
response initialize
event initialized ^[1]
request attach or launch ^[2]
request setBreakpoints, setFunctionBreakpoints, etc.
response setBreakpoints, … etc.
request configurationDone
response configurationDone ^[3]
response attach or launch
event process and an optional stopped event if "stopOnEntry": true

triggers setBreakpoints, … configurationDone flow ↩︎
delayed until configurationDone ↩︎
After this is when we’d unblock handling launch or attach ↩︎

tedwoodward · May 1, 2025, 11:35pm

I 100% agree with this

JDevlieghere · May 2, 2025, 12:27am

I’ve created a draft PR: [lldb-dap] Change the launch sequence by JDevlieghere · Pull Request #138219 · llvm/llvm-project · GitHub.

The PR delays handling the launch and attach requests, which means we’re not sending a response until after the configuration is done. That’s causing all the tests to time out because our test harness isn’t expecting that behavior. Visual Studio Code doesn’t mind. I’m curious if other clients are expecting that and I’d need to check if the spec says anything about that.

The spec says that the response to the launch and attach request “is just an acknowledgement, so no body field is required.” which means that we could send the response before handling the request. The PR doesn’t do that because:

The current implementation uses the response for errors, which we wouldn’t be able to do if we want to send an acknowledgement before actually doing the launch or attach.
The current design would require work to support sending a response before handling the request.

If folks have an opinion please chime in here or comment on the PR.

kusmour · May 2, 2025, 3:32am

The potential downside of this will be the breakpoints appear grayed out (unresolved) at the beginning. This is already happening today if a breakpoint is in some modules that are not loaded yet. However, we may want to make sure they have a chance to update before the auto continue in case it’s very early like an entrypoint bp.

JDevlieghere · May 2, 2025, 5:14pm

Yeah that’s a fair observation. I don’t think it can be avoided though. If you look at the launch sequence diagram, there’s no way you could ever resolve a breakpoint before you know what your target is:

init-launch

In pracitce, the breakpoints shouldn’t remain unresolved for long. As soon as we create the target, the breakpoints will resolve and we’ll send breakpoints evetns to the client.

JDevlieghere · May 2, 2025, 10:37pm

I reworked the launch sequence to be compliant with the protocol. I changed the tests to follow the protocol specification rather than mimic what Visual Studio Code does, though I agree with everyone else here that it’s important to have coverage for the way we’re called from Visual Studio Code, as that’s undoubtably the most common client.

The improved launch sequence eliminates one source of unreliability but even with my changes, the tests are still unreliably, at least at my desk. Some tests consistently pass when run individually, but start failing when run in parallel.

Have others been able to reproduce the failures we were seeing on the bots or observe flakiness locally?

ashgti · May 5, 2025, 3:35pm

Yes, I have seen some failures both locally and in CI jobs like these flakes with attaching flows. I’ve also seen some issues with breakpoint resolution during tests where they’re start out invalid and it takes a small amount of time for them to become verified, which I don’t think we always handle correctly in the tests.

Topic		Replies	Views
Lldb-dap frozen on Windows LLDB	7	203	December 19, 2024
Failing LIT-based lldb-mi tests LLDB	32	169	August 20, 2018
test results look typical? LLDB	37	155	September 18, 2015
[lit] check-all hanging LLVM Dev List Archives	23	134	January 8, 2019
Running check-lldb LLDB	11	132	April 20, 2017

Reliability of the lldb-dap tests

Related topics