[RFC] lldb-dap refactoring to support async operations and cancellation

Abstract

Using lldb-dap today on a large binary (for example, I have been debugging an iOS application built for debugging that is ~650MB + 1.3GB of dSYMs).

If you hit a breakpoint and try to perform some operations like stepping or inspecting a variable is not uncommon for lldb-dap to block and appear to hang for a bit.

This is primarily due to the sequential nature of how lldb-dap handles requests.

For example, when you hit a breakpoint the DAP flow is:

event 'stopped'
request 'threads'
request 'stackTrace' { levels: 1, threadId: '<stopped-thread>' }
request 'breakpointLocations' { '<stackTrace[0]>' }
request 'stackTrace' { levels: 19, startFrame: 1, threadId: '<stopped-thread>' }
request 'scopes' { frameId: '<stopped-thread-frame>' }
request 'variables' { variablesReference: 1 }

This chain of events alone on the app I am debugging took ~3s, during which you cannot step or continue the process. If you step, then there is another delay of approx. ~3s as the same data is fetched for the new stopped event.

Additionally if you accidentally hovered over a very large or complex variable (say a UIViewController with many properties) then tried to step your step request would be blocked by the hover request.

To get a sense of how this is affecting lldb-dap, I did some profiling while stopping and stepping through the large app that includes swift, obj-c and c++ code and here is a breakdown of the lldb-dap time spent:

  • request_variables (35% of my trace)
  • request_stackTrace (27%)
  • request_threads (16%)
  • request_breakpointLocations (8%)
  • (other…)

Part of the Debug Adapter Protocol to help address this is support for the cancel request. For example, if you hover over a variable then move the cursor off the variable before the response has been sent a cancel request will be made.

Debug Adapter Protocol Cancel Spec

Proposal

I think to address the responsiveness and improve the overall debugging experience we can refactor lldb-dap to support cancelling requests.

To support the ability to cancel a request we will need to refactor our IO handling and protocol support to break requests into smaller operations that we can try to process as they arrive and support a mechanism for queuing operations.

There is a very similar problem and implementation of a similar protocol in the llvm project in clangd for handling the Language Server Protocol. In clangd, there is a task dispatcher that will queue operations on a per document basis. We don’t have a precise equivalent in the debugger world.

I started a prototype here [lldb-dap] Creating a new mechanism for registering async request han… Ā· ashgti/llvm-project@5248b79 Ā· GitHub that splits the lldb_dap::DAP::Loop into a reader thread and a queue. The queue is currently not yet supporting cancellation but this would be a first step to creating an async flow.

Additionally, for handling requests, I updated the request handler registration to take a callback, for example:

void request_foo(
    DAP &dap,
    const FooArguments &Args,
    Callback<FooResponseBody> Reply) {
  // ... do some blocking operation, then schedule an async reply ...
  dap.PerformAsyncWork([&](auto result){
    // ... Reply off the request evaluation thread ...
    Reply(result);
  });
}

With this change, we’re more easily able to break up calls into async operations that then call Reply(resultOrError).

The basic flow with this change would be to read the requests as they come in, appending them to a queue and then having a task management mechanism to dequeue requests and dispatch them. I think for the first iteration of this change we can only have a single worker evaluating requests. SBDebugger does have a mechanism for requesting an interrupt SBDebugger::RequestInterrupt() to make an attempt at interrupting a running operation. We may be able to use the RequestInterrupt call to stop some inflight operations, but I think that would be something to consider on a case by case basis (e.g. we may not want to cancel an inflight evaluate request, but an inflight variables request may be okay to cancel).

One major challenge I’ve come across is while prototyping is needing some way to support either cancelling a read or to perform a select style call on the input stream. The existing SelectHelper only supports selecting on Sockets at the moment, not pipes or file handles, this may be something to investigate improving but I’m not as familiar with Windows APIs.

1 Like

I think supporting cancellation would be very useful. I think I would implement this slightly differently:

  • Have a reader that that enqueues all incoming requests, except a cancel request. When a cancel request comes in, it flushes the queue and calls SBDebugger::RequestInterrupt() which will attempt to interrupt the ongoing operation. The interrupt mechanism is fully cooperative and I think it’s always safe to call this.
  • Have a worker thread that pops requests from the queue one by one. If a cancel request came in, and it wasn’t handled by the ongoing operation, then this thread would need to call CancelInterruptRequest.

That achieves the same goal but without the need for the callback. I also think that will be easier to reason about, especially since LLDB is not completely thread safe. If in the future we decide that some work can or should happen in parallel (e.g. dealing with events such as progress events) we can create another dedicated worker thread.

Clangd already implements cancelation as it’s also part of the LSP. In my experience, it runs quite smooth with quite some heavy load. So it might give some inspiration as well.

The request comes in here: llvm-project/clang-tools-extra/clangd/ClangdLSPServer.cpp at 57bac14f4bcc8b43edc3e27be3d93609f7f4037b Ā· llvm/llvm-project Ā· GitHub

The cancel request is supposed to cancel a specific operation, not the entire queue. I’m not sure what would happen if we flushed everything.

However, cancelling requests that we haven’t started processing yet is easy (just pluck them out of the queue), so I think that a slightly modified version of your approach would work (and I’m highly supportive of that – particularly because we’ve been discussing something very similar as well :P):

If a cancel request comes in and the request is currently being processed, then set the interrupt flag. The thread handling the request clears the flag once it regains control (returns from an lldb call and reaches a safe point).

Fair enough. The scenario I had in mind is where you have a few operations queued like showing variables and then you do a step, at which point you don’t really care about everything that came before it, but it sounds like that’s a slightly different use case.

Yea, cancel takes a specific requestId to cancel, but if you hit a breakpoint and then step quickly then the DAP should cancel the inflight requests for threads/stacks/frames/variables.

In terms of the flow of events we might see something like:

event 'stopped'
request 'threads' {seq=x}
request 'stackTrace' {seq=x+1, args=<stopped-thread>}
request 'cancel' {requestId=x}
request 'cancel' {requestId=x+1}
request 'step'

In my prototype I was able to get both forms of the cancelling working. Removing a pending request that isn’t running and interrupting a running request.

Here is the updated prototype with a few unit tests to cover both forms of cancel requests (pending and active): [lldb-dap] Creating a new mechanism for registering async request han… Ā· llvm/llvm-project@3001a47 Ā· GitHub

I think the updated signature for requests helps us achieve this support by giving us a common call site that the response goes through. Specifically today in lldb-dap the request handlers tend to create adhoc llvm::json::Object’s that are not actually structured. Because of this, its hard to tell from the lldb_dap::DAP::SendJSON if this should have be marked cancelled or not. In my prototype I used a helper type called ReplyOnce that ensures the handler sends a reply exactly once ([lldb-dap] Creating a new mechanism for registering async request han… Ā· llvm/llvm-project@3001a47 Ā· GitHub). This helper also gives me a spot to convert the response into a cancelled response if we detect the debugger was interrupted during the request.

I did heavily borrow from the clangd implementation in my prototype. The LSP and DAP protocols are similar (but not identical).

1 Like

Right, so it basically does what XCode is doing (at least, what I’ve heard it’s doing), but the protocol is finer grained in that it allows you to cancel individual requests if you need/want it. You don’t have to look this up if you don’t know, but I’m wondering if VSCode uses the cancel request in situations other than stepping (where it really does cancel just one request) – I don’t know, quickly expanding, and un-expanding a variable or something…

This is a natural, though somewhat unfortunate (for us) order of requests as it means that if we are ā€œtoo fastā€ at cancelling request x, we might begin to process x+1 only to find out that it’s being cancelled as well. Hopefully, in practice, we will manage to pop it off the queue before the worker thread starts to process it.

I can certainly understand this sentiment. I’m more frequently working with the gdb remote protocol, which uses a very similar setup and I’ve often wished it had a more explicit method for returning responses. I do have a question though.

What’s the advantage of passing a callback versus just returning the packet from the response handler? Is it the case that (some? which ones?) message handlers need to do work after sending the response packet? I didn’t look at all packets, but in e.g. request_evaluate from your prototype, the call to the callback is always immediately followed by a return, which means that it could just return the response, and let the caller do whatever processing it needs. (The reason I’m trying to do this is because the packet handler are themselves callbacks, so passing another callback to them is starting to feel spaghetti-like. Maybe that can’t be avoided, but if it can, I’d like to try that…)

The main advantage is if we did have an async operation its easier for us to move that work off the thread that is handling requests.

Today, the requests tend to follow the pattern like:

void request_foo(DAP &dap, const llvm::json::Object &request) {
  llvm::json::Object response;
  FillResponse(request, response);
  const auto *arguments = request.getObject("arguments");
  //  ... do some work ...
  dap.SendJSON(llvm::json::Value(std::move(response)));
}

This pattern could be tweaked to a form like:

llvm::Expected<ResponseBody> request_foo(DAP &dap, const RequestArguments args) {
  ResponseBody response;
  // ... do some work ...
  if (/* some condition */)
    return llvm::make_error<DAPError>(...);
  return std::move(response);
}

However, if we did need to move the work to another thread and send an async response then we’d have a hard time because of the return type.

By taking a callback, if we need to move some expensive work to another thread or we wanted to send an async reply after waiting for some debugger action to take place we can by moving the the Reply argument. That might look like:

template <typename T>
using Callback = llvm::unique_function<void(llvm::Expected<T>)>;

void request_foo(DAP &dap, const RequestArguments args, Callback<ResponseBody> Reply) {
  dap.queueBackgroundTask([Reply=std::move(Reply)](...){
    ResponseBody response;
    // ... do some work ...
    if (/* some condition */)
      return llvm::make_error<DAPError>(...);
    return std::move(response);
  });
}

It may be a bit premature to add callbacks to each response handler, but the way they’re structured today they’re not actually required to be fully synchronous.

Thanks for the additional context John. I definitely see the benefit of having a centralized place that deals with the replies and I really like having the requests return their response rather than sending the reply directly. I think those are nice improvements that can stand on their own.

I’m still not convinced about the callback though, but maybe that’s because we’re trying to achieve different things or because I’m missing something. My suggestion above tried to solve two things, so let me detangle those.

  1. Do you imagine all packets being handled asynchronously? Is that how clangd works? As I said earlier, I’m skeptical that’s going to work, at least not until we make more (all?) of the SB API thread safe. Pavel already picked up that I was talking about Xcode, but the reason they have a dedicated thread that handles LLDB request sequentially is part of that.
  2. Assuming we want to go with the asynchronous approach, I still don’t see the need for the callback. In your example, instead of putting queueBackgroundTask inside request_foo, why not schedule wrap request_foo in queueBackgroundTask , something like this:
dap.queueBackgroundTask([...](...){
  llvm::Expected<ResponseBody> maybe_response = request_foo(...);
  if (!maybe_response) 
    // Handle error
  Reply(*maybe_response);
});

clangd will create a queue for each open document. LSP requests are mostly oriented documents and to help improve overall responsiveness clangd handles each document in its own queue.

In the DAP protocol, I don’t think we have a direct analog to the document at the moment and I suppose I’m not sure if we have any sort of lower priority work we would want to handle asynchronously.

I think it might make sense to focus on the cancellation support for now and if we have a use case for async operations we can add support in the future.

I’ll take a little bit of time to see if I can simplify my prototype if we make the requests follow the form of:

llvm::Expected<ResponseBody> request_foo(DAP &dap, const RequestArguments &args) {
  ...
}
1 Like

Okay, I have an updated prototype of this working [lldb-dap] Support cancel requests and add explicit DAP structures. Ā· ashgti/llvm-project@f939ad4 Ā· GitHub

The evaluate request was the main request handler I used for tests and prototyping.

The updated structure of request handlers now looks like:

llvm::Expected<EvaluateResponseBody>
request_evaluate(DAP &dap, const EvaluateArguments &arguments) {
  EvaluateResponseBody body;
  // ... do some work ...
  if (/* check for errors... */)
    return llvm::make_error<DAPError>("...");
  return std::move(body);
}

I think I have the flow of the logic figured out to no longer need a select helper and it shouldn’t need any magic sleeps in the unit tests (previously, I had a sleep between some requests to ensure the function was enqueued by the time the cancel request arrived).

On that note, it might be nice to also follow up with breaking up lldb-dap components in a way they can more easily be unit tested directly using gtest. But this is already a big enough change as is…

I took a look at your prototype and I like where it’s going.

I have some ideas in this space that I hope would work well with what you’re doing. It has always bothered me that all request callbacks are listed in the main file. I would like them to be classes/objects that are instantiated with a DAP instance and they can provide a call operator (or with your patch, an invoke or something that would return an expected). Every request can go in its own file, with its documentation. Having a class will also help with the templates.

Let me know what you think. I’m volunteering to do that work, but at the same time I don’t want to step on your toes as it will create conflicts.

Edit: here’s what I have in mind: [lldb-dap] Move requests into their own object/file by JDevlieghere Ā· Pull Request #128262 Ā· llvm/llvm-project Ā· GitHub.

That sounds reasonable to me.

I’d like to break up the prototype into smaller more reviewable chunks, but I don’t mind rebasing that work as needed.

I think it would be helpful to start with some of the helper types like Protocol.h/cpp (llvm-project/lldb/tools/lldb-dap/Protocol.h at lldb-dap-async-refactor Ā· ashgti/llvm-project Ā· GitHub) because that helps streamline/clean up a lot of serialization logic.

I like the direction of this. Thanks explaining the use case, I now understand what you wanted to achieve with the callback. I think postponing that decision is the right thing to do, since I don’t think that is the only way to achieve something like this (like Jonas, my first instinct was also to wrap the entire request in an async function). I can also imagine a setup where ā€˜simple’ requests return the value directly, but more complicated ones (those that actually want to do some work in the background) get some sort of a callback or a future object so they can respond later.

Sorry for chiming in a few weeks late. I was transitioning to a new job ~~

In any case, this is amazing and interestingly it is a feature that Meta wanted for a long time. They have binaries take many seconds or even minutes to serve some requests, so I imagine that this will become very useful to them.

If we are able to do a good job at removing from the queue requests that won’t be needed anymore, we’ll deliver very noticeable speedups.

Some requests fortunately have fields that indicate somehow if the action was triggered by the user or not, but I’m sure you guys know all the details.

Happy to help in anything needed.