LSP Cancellations - making it universal and measurable?

Hey Ilya and Kadir,

Was trying to understand how much we (can) win from cancellation, and what’s involved in instrumenting a LSP method for cancellation.

Have a couple of proposals, wanted to discuss first them rather than sending a patch because they’re separate but interact.

  1. task should record the time of requested cancellation. For analysis, there are 3 interesting timestamps - task start, (optional) cancellation, and end. The creator of the task currently can get start and end (via context cleanup), but not cancellation time. This allows us to measure how much we can improve cancellation (by bailing out “lame duck” tasks earlier).
    Implementation is easy, just change the atomic to atomic<steady_clock::rep>

  2. we should support cancellation of any method, even if early bailout isn’t yet implemented.
    Main benefit: we can then measure the lame duck time for all methods (as above), and know where to implement early bailout. Side benefit: more uniform/less API churn.
    Implementation (I have a prototype):

  • LSP $/cancelRequest would move to JSONRPCDispatcher so it can cut across all methods. Cleanup of TaskHandles would be handled with a context destructor.
  • other users of ClangdServer would set up cancellation in the same way: by creating a task handle and calling setCurrentTask() before invoking a request. Or they can not do so if they don’t support cancellation (isCancelled() returns false if there’s no task).
    This is very similar to golang context cancellation.
  1. TUScheduler should be cancellation-aware
    This seems like an easy, cross-cutting win, but we should measure.

  2. We should hide the Task object - it adds API noise and it offers too many facilities to the wrong actors.
    (maybe this is controversial, and somewhat less related).

  • Task::isCancelled() appears to be redundant, you never want to check another task’s status and checking for cancellation in a tight loop is an antipattern that doesn’t seem worth optimizing for

  • cancel() is only for the creator of a task (current API enforces this)

  • TaskHandle/ConstTaskHandle/getCurrentTask just exist to support exposing these details

  • most actions have obvious extensions to cases where there is no active task, but the current API makes this awkward
    So this would leave us with something like (modulo names):
    bool isTaskCancelled();
    using Canceler = std::function<void()>;
    pair<Context, Canceler> startCancelableTask();
    (again, this borrows heavily from go

  1. Random thought: we could support deadlines (automatic cancellation after some time), this is useful in hosted scenarios though probably not for standalone workstation use.

WDYT?

Hi Sam,

A few things that come to mind:

  • other users of ClangdServer would set up cancellation in the same way: by creating a task handle and calling setCurrentTask() before invoking a request. Or they can not do so if they don’t support cancellation (isCancelled() returns false if there’s no task).

My stance is that explicit APIs for cancellation are better approach.
They clearly state the contract of the API, it’s impossible to accidentally reuse the existing task handle (ClangdServer is creating a new task every time) multiple times for different tasks from the client code and they are (arguably) easier to use from Cider.
The “task in the context” approach, OTOH, means users will have to read through the comments to even discover that the cancellation is there.

I may be missing the reasons on why the proposed approach is better. Any suggestions I’m missing?

  1. task should record the time of requested cancellation

+1

  1. TUScheduler should be cancellation-aware
    +1
  1. We should hide the Task object.
    (again, this borrows heavily from go)
    Go was definitely designed with cancellation (I assume of both RPCs and general computations?) in mind, but my view on this is that in C++ it’s more idiomatic to make the contracts like this explicit in the API. TaskHandle exists to make sure the callers of async APIs get an explicit object they can poke with, see the first comment about explicit vs implicit APIs.
  1. Random thought: we could support deadlines (automatic cancellation after some time), this is useful in hosted scenarios though probably not for standalone workstation use.
    +1, unless it’s trivial to do it on Cider side. In which case, maybe we could only support this in Cider to avoid adding the code to OS version that we won’t use outside our environment.

Guess I was right about which piece would be controversial.
(For those missing context, Cider is a non-public editor that embeds ClangdServer)

Hi Sam,

A few things that come to mind:

  • other users of ClangdServer would set up cancellation in the same way: by creating a task handle and calling setCurrentTask() before invoking a request. Or they can not do so if they don’t support cancellation (isCancelled() returns false if there’s no task).

My stance is that explicit APIs for cancellation are better approach.

I’d like to clarify what you mean by “explicit”, because the cancellation API is explicit in both cases, and the coordination between the cancellation and the cancel-check is (to a first approximation) magic in both cases.

I think the biggest difference is whether the cancelability of an operation is represented in its static type (e.g. function return type). Is this what you mean?

They clearly state the contract of the API, it’s impossible to accidentally reuse the existing task handle (ClangdServer is creating a new task every time) multiple times for different tasks from the client code and they are (arguably) easier to use from Cider.
The “task in the context” approach, OTOH, means users will have to read through the comments to even discover that the cancellation is there.

I guess I don’t see that as a bad thing? Cancellation is a cross-cutting concern that you mostly want to ignore, except at layers where: a) you can meaningfully abort early to save some work, or b) you want to start a chunk of work and maybe cancel it later.
Examples of why this seems right:

  • there’s no reason that a cancelable span should be coupled 1:1 with an LSP method
  • whether a function will respond to cancellation is always a dynamic property. This is a QoI issue, not a contract.

I may be missing the reasons on why the proposed approach is better. Any suggestions I’m missing?

  • It allows us to support cancellation of all methods without adding a bunch of bookkeeping to each one (or equivalent in templates)
  • it allows embedders to support cancelling methods at the layer it makes sense without plumbing cancellation objects around
  • it avoids divergences in APIs based on whether we’ve implemented early-exit for this operation or not
  • putting values in the return type that are not the result of the computation surprises the reader
  1. task should record the time of requested cancellation

+1

  1. TUScheduler should be cancellation-aware
    +1
  1. We should hide the Task object.
    (again, this borrows heavily from go)
    Go was definitely designed with cancellation (I assume of both RPCs and general computations?) in mind,

For what it’s worth, it wasn’t - Context and cancellation was very much bolted on in the same way it was in clangd.
The only real difference is they decided not to use TLS for Context. But they decided not to signal cancelability explicitly in the API for the usual reasons (intermediate stack frames don’t care).

It’s a good thing we kept from doing intrusive changes to all of the APIs, prototyping only for code completion. Hopefully the refactoring won’t be too painful.
It’s up to you to decide which approach to take at the end. Here’s my take on trying to convince that the current API make sense

I can see the benefits of the proposed API if we were designing cancellation for distributed web services where everything is an RPC and there is a clear need to propagate the cancellation information through RPC call chains, possibly in different processes or even on different machines.
It feels clangd is not like that, it’s a standalone tool (or a library) which at the end that manages its own async tasks. Propagating cancellation handles does not seem like a big problem there, everything is in a single process and data-flow is mostly simple.

E.g. if what we want is to move the cancellation handling to JSONRPCDispatcher, we could easily do it by making the dispatched methods return the TaskHandles that we need to store.
It’s a simple refactoring, spanning only one layer.
I doubt the other use-case that we have (aforementioned internal editor that embeds clangd) will be much work either, though I may be wrong.

I think the biggest difference is whether the cancelability of an operation is represented in its static type (e.g. function return type). Is this what you mean?

Yep, I meant encoding what we want in the type systems.

I guess I don’t see that as a bad thing?
My preference for encoding in type systems is that I can easily trace the created task back to whoever performs cancellations.
It feels like this is gonna be slightly trickier in the new approach, since not only the consumers, but also the producers of cancellable tasks can now be anywhere up the context call chain.

Cancellation is a cross-cutting concern that you mostly want to ignore, except at layers where: a) you can meaningfully abort early to save some work, or b) you want to start a chunk of work and maybe cancel it later.
Agree that the cancellation checking cuts though all the layers and requiring passing objects through all of the layers is a pain.
Not so sure about the “producers” of tasks. We can’t possibly keep functions that start cancellable work unchanged when adding cancellation.
In both the new proposal and the current implementation, we’ll have to:

  • Allow the functions that can be cancelled to report they were cancelled to the callers. I guess both approaches would use the CancelledError for this.
  • Mark the functions that support cancellation somehow. Either with a comment (the new proposal) or with an explicit return type (the current implementation).
    The cancellation tasks would be allowed to be created further up the call chain, but that adds some “magic” to the code and makes the clients responsible for creating the tasks properly and writing all the boilerplate.

Examples of why this seems right:

  • there’s no reason that a cancelable span should be coupled 1:1 with an LSP method
    Totally agree, why do you feel the cancellable span is coupled to an LSP method in the current API?
  • whether a function will respond to cancellation is always a dynamic property. This is a QoI issue, not a contract.
    Totally agree.
  • It allows us to support cancellation of all methods without adding a bunch of bookkeeping to each one (or equivalent in templates)
    I don’t think bookkeeping is much of a pain, most methods will just return TaskHandles and at the level where it’s appropriate to cancel, where one will need to store them.
    I guess the arguments are about the same as with other things that could potentially be passed through Context.
    I would prefer to use Context as little as possible and in that case I would prefer explicit data flow, but can see your point if you don’t think that way.
  • it allows embedders to support cancelling methods at the layer it makes sense without plumbing cancellation objects around
    Point taken, I view the fact that we can trace the data flow back to the cancellation sites as a benefit.
  • it avoids divergences in APIs based on whether we’ve implemented early-exit for this operation or not
    In the current implementation, the APIs that we believe should be cancellable should just always indicate they can be cancelled (i.e. create a new task), even if they won’t do early exit in practice.
  • putting values in the return type that are not the result of the computation surprises the reader
    Providing results in a callback also makes the code slightly tricky. Having cancellation boilerplate does not make the client code any better either. (At any layer really).
    Granted that it’s probably easier to just support cancelling everything somewhere up the call chain with the proposed implementation, but I would argue it’s not much harder in the current approach, the only thing that’s required is plumbing the TaskHandle up to that layer. Does not feel like a lot of work (but, again, we seem to disagree here).

Overall, if you feel strongly we should prefer the new approach over the current implementation, happy to submit to your point of view.
I feel it’s not much work to support the current implementation in the callers that we have, but if you don’t see the benefit in the current APIs and strongly feel the new ones are better, let’s just do the new ones.

After thinking more about it, I actually find my own arguments unconvincing and fully support the suggested API changes.
It occurred to me I was driven by the design that couples async task lifetimes to their cancellation mechanics.
But looking at it more closely, it does not seem that’s the right design. The scope of cancellable pieces of work can be arbitrary large and tying them to the lifetimes of actual async computations does not sound useful.
The “implicit dataflow” concerns that I raised are also mostly circumvented by the fact that we can simply track were cancellation functions are created and stored.

I take my arguments back, sorry for the overly long and defensive discussion, I guess I was overly defensive of the solution that I proposed.
Happy to help with the proposed changes too.

Trying just to reply where I can add something new.

It’s a good thing we kept from doing intrusive changes to all of the APIs, prototyping only for code completion. Hopefully the refactoring won’t be too painful.
It’s up to you to decide which approach to take at the end. Here’s my take on trying to convince that the current API make sense

I can see the benefits of the proposed API if we were designing cancellation for distributed web services where everything is an RPC and there is a clear need to propagate the cancellation information through RPC call chains, possibly in different processes or even on different machines.
It feels clangd is not like that, it’s a standalone tool (or a library) which at the end that manages its own async tasks. Propagating cancellation handles does not seem like a big problem there, everything is in a single process and data-flow is mostly simple.

E.g. if what we want is to move the cancellation handling to JSONRPCDispatcher, we could easily do it by making the dispatched methods return the TaskHandles that we need to store.
It’s a simple refactoring, spanning only one layer.

That doesn’t sound easy. The practical problem I’m trying to solve is: I want to make all LSP actions cancelable, to determine whether we’re missing optimizations. As far as I can see, this requires adding somewhat-subtle boilerplate to ~20 ClangdServer methods, and updating all their out-of-tree callers.

I doubt the other use-case that we have (aforementioned internal editor that embeds clangd) will be much work either, though I may be wrong.

I think the biggest difference is whether the cancelability of an operation is represented in its static type (e.g. function return type). Is this what you mean?

Yep, I meant encoding what we want in the type systems.

I guess I don’t see that as a bad thing?
My preference for encoding in type systems is that I can easily trace the created task back to whoever performs cancellations.
It feels like this is gonna be slightly trickier in the new approach, since not only the consumers, but also the producers of cancellable tasks can now be anywhere up the context call chain.

Cancellation is a cross-cutting concern that you mostly want to ignore, except at layers where: a) you can meaningfully abort early to save some work, or b) you want to start a chunk of work and maybe cancel it later.
Agree that the cancellation checking cuts though all the layers and requiring passing objects through all of the layers is a pain.
Not so sure about the “producers” of tasks. We can’t possibly keep functions that start cancellable work unchanged when adding cancellation.

Yeah this is more subtle.
The scope of a cancellation is a concern of the caller, not the callee (the callee just cares that it is in scope).
Suppose I have a function handleUserRequest() that calls runCompletion() which calls {addDocument(); codeComplete()}. Then the user closes the file.
It’s the “handleUserRequest” we logically want to cancel. “runCompletion” is the one I think should be cut across.

Corollary: I think only the entity that actually wants to cancel things should create cancelable scopes.

In both the new proposal and the current implementation, we’ll have to:

  • Allow the functions that can be cancelled to report they were cancelled to the callers. I guess both approaches would use the CancelledError for this.
  • Mark the functions that support cancellation somehow. Either with a comment (the new proposal) or with an explicit return type (the current implementation).

In the proposal, this is a class/codebase comment, and the is no distinction between functions. (“ClangdServer functions/TUScheduler tasks support cancellation on a best-effort basis”).

The cancellation tasks would be allowed to be created further up the call chain, but that adds some “magic” to the code and makes the clients responsible for creating the tasks properly and writing all the boilerplate.

Examples of why this seems right:

  • there’s no reason that a cancelable span should be coupled 1:1 with an LSP method
    Totally agree, why do you feel the cancellable span is coupled to an LSP method in the current API?

(Sorry, I meant to say ClangdServer method, which is not quite the same)
Because ClangdServer creates the spans.

  • whether a function will respond to cancellation is always a dynamic property. This is a QoI issue, not a contract.
    Totally agree.
  • It allows us to support cancellation of all methods without adding a bunch of bookkeeping to each one (or equivalent in templates)
    I don’t think bookkeeping is much of a pain, most methods will just return TaskHandles and at the level where it’s appropriate to cancel, where one will need to store them.

Sorry, here I was referring to the ClangdServer/ClangdLSPServer implementation. The additions required to each cancelable methods are IMO moderately intrusive and error-prone.

I guess the arguments are about the same as with other things that could potentially be passed through Context.
I would prefer to use Context as little as possible and in that case I would prefer explicit data flow, but can see your point if you don’t think that way.

  • it allows embedders to support cancelling methods at the layer it makes sense without plumbing cancellation objects around
    Point taken, I view the fact that we can trace the data flow back to the cancellation sites as a benefit.
  • it avoids divergences in APIs based on whether we’ve implemented early-exit for this operation or not
    In the current implementation, the APIs that we believe should be cancellable should just always indicate they can be cancelled (i.e. create a new task), even if they won’t do early exit in practice.

Not opposed to this in principle (as long as “we believe should be cancelable” is “every async function”), but if you mean only ClangdServer methods, why?

  • putting values in the return type that are not the result of the computation surprises the reader
    Providing results in a callback also makes the code slightly tricky.

Having

cancellation boilerplate does not make the client code any better either. (At any layer really).
Granted that it’s probably easier to just support cancelling everything somewhere up the call chain with the proposed implementation, but I would argue it’s not much harder in the current approach, the only thing that’s required is plumbing the TaskHandle up to that layer. Does not feel like a lot of work (but, again, we seem to disagree here).

Overall, if you feel strongly we should prefer the new approach over the current implementation, happy to submit to your point of view.
I feel it’s not much work to support the current implementation in the callers that we have, but if you don’t see the benefit in the current APIs and strongly feel the new ones are better, let’s just do the new ones.

So I think we’re talking mostly about the “implementation” part of point 2 in the original email.
I think this is the right thing to do, but from a practical point of view, what I’m trying to do is instrument all APIs for cancellation.
Do you think this is should be done in a different way, or that it’s undesirable, or it’s not so important?
The cancellation code that’s in Clangd[LSP]Server code complete is intrusive enough that I don’t feel adding it is justified without an important performance win.

Sorry for the last long reply, didn’t see your message…

After thinking more about it, I actually find my own arguments unconvincing and fully support the suggested API changes.
It occurred to me I was driven by the design that couples async task lifetimes to their cancellation mechanics.
But looking at it more closely, it does not seem that’s the right design. The scope of cancellable pieces of work can be arbitrary large and tying them to the lifetimes of actual async computations does not sound useful.
The “implicit dataflow” concerns that I raised are also mostly circumvented by the fact that we can simply track were cancellation functions are created and stored.

Ack, I think that was the bit I didn’t explain well previously.

I take my arguments back, sorry for the overly long and defensive discussion, I guess I was overly defensive of the solution that I proposed.

Not at all, this has helped me (us both?) clarify thinking. And I don’t think these changes are so dramatic - the same mechanisms exposed in different ways.
(FWIW I didn’t have any idea about how cancellation APIs should work until we had a concrete design/implementation to examine, so I hope I didn’t come across too… stompy)

Sorry for the late reply,

Seems like most of my concerns has already been argued. Just to state my view:

  • I also think current implementation is quite intrusive on the caller site and it was going to be even more intrusive when we want to propagate cancellation error from lower layers of call. The new implementation seems to be a lot more friendly in that manner.
  • I am also with the idea of making cancellable tasks standout on their own, preferably with type info rather than comments, but if we make every task cancellable then there is no need to worry about marking them.
  • Apart from those I totally agree with 1) and 3) in the starting mail.
  • For 3) actually pair<Context, Canceler> startCancelableTask(); seems kind of scary that was one of the reasons we wrapped them within Task. But apparently current implementation also introduces a similar amount of error proneness so I am OK with the new api.
  • For 5) I think implementing it in clangd side should also be pretty much easy(like checking for elapsed time in addition to cancellation token), but it doesn’t seem to be very useful for LSP since it should be clients responsibility to decide that response for a request is no longer necessary and cancel that one.