DraftStore::updateDraft compares lsp's rangeLength in UTF16 code units against the computed rangeLength in bytes

Hi,

I’ve started using clangd with VSCode and discovered an issue when accidentally typing a unicode character and immediately removing it afterwards. When removing the unicode character clangd errors with: “Change’s rangeLength (1) doesn’t match the computed range length (3)”. Afterwards, clangd doesn’t work for the current file until it is restarted.

Since I had to build LLVM anyways to add support for passing relative paths to -compile-commands-dir, I tried to debug this issue as well. What I’ve discovered is that when the range length’s are compared in DraftStore::updateDraft (DraftStore.cpp:80) , the LSP’s range length seems to be specified in UTF16 code units while clangd’s computed range length is specified in bytes (see positionToOffset in SourceCode.hpp:83). This causes the check to fail which causes the error in VSCode.

Two possible fixes I can see are:

  1. Remove the check (from a quick test clangd still functions perfectly after commenting out the check).
  2. Use the StartIndex and EndIndex to get a substring from the Contents variable from which we can then calculate the computed range length in UTF16 code units using the utf16Len function (SourceCode.cpp:72). This would require lifting out the utf16Len function into SourceCode.h or elsewhere since at the moment it’s a static function in SourceCode.cpp.

I also checked the LSP docs and they don’t explicitly say the rangeLength is specified in UTF16 code units. However, I’m assuming it is since a LSP Range consists out of two LSP Positions of which the LSP Character field is specified in UTF16 code units.

I’m not sure which solution would be preferred so I’m posting here first before implementing a fix.

Regards,

Daan De Meyer

Hi Daan,
Glad to meet you, thanks for helping out!

I’ve started using clangd with VSCode and discovered an issue when accidentally typing a unicode character and immediately removing it afterwards. When removing the unicode character clangd errors with: “Change’s rangeLength (1) doesn’t match the computed range length (3)”. Afterwards, clangd doesn’t work for the current file until it is restarted.

Since I had to build LLVM anyways to add support for passing relative paths to -compile-commands-dir, I tried to debug this issue as well. What I’ve discovered is that when the range length’s are compared in DraftStore::updateDraft (DraftStore.cpp:80) , the LSP’s range length seems to be specified in UTF16 code units while clangd’s computed range length is specified in bytes (see positionToOffset in SourceCode.hpp:83). This causes the check to fail which causes the error in VSCode.

That sounds right to me. The use of UTF-16 in the protocol is (IMO) unfortunate, but it is reality.
We added handling for this in the SourceCode helpers but it’s not surprising there are a few places we’re assuming bytes by mistake.
Unfortunately most clangd developers and our tests tend to code in ASCII only, so we haven’t shaken out those bugs.

Thanks for digging!

Two possible fixes I can see are:

  1. Remove the check (from a quick test clangd still functions perfectly after commenting out the check).
  2. Use the StartIndex and EndIndex to get a substring from the Contents variable from which we can then calculate the computed range length in UTF16 code units using the utf16Len function (SourceCode.cpp:72). This would require lifting out the utf16Len function into SourceCode.h or elsewhere since at the moment it’s a static function in SourceCode.cpp.

I also checked the LSP docs and they don’t explicitly say the rangeLength is specified in UTF16 code units. However, I’m assuming it is since a LSP Range consists out of two LSP Positions of which the LSP Character field is specified in UTF16 code units.

Yeah, I think this is a safe assumption.

I’m not sure which solution would be preferred so I’m posting here first before implementing a fix.

Keeping the sanity check and making it correct seems best to me (i.e. option 2). Either is better than the status quo, though.

My only concern is that we’ll break clients that make the equal and opposite mistake (sending byte lengths instead of UTF-16 lengths).
Maybe we should also downgrade this error to a log message, since we can presumably trust the length from the string. What do you think?

Cheers, Sam

My only concern is that we’ll break clients that make the equal and opposite mistake (sending byte lengths instead of UTF-16 lengths).
Maybe we should also downgrade this error to a log message, since we can presumably trust the length from the string. What do you think?

I think the check is useful just to verify if the client and server are on the same page so I’m partial to option 2 as well. Downgrading to a log message avoids outright breaking existing clients so that seems like the way to go. As soon as Github is back up I’ll make an issue asking to clarify the unit of rangeLength in the lsp repository as well.

Assuming we go with option 2 (including downgrading to an error message), I have two more questions:

  1. Which header should I put the utf16Len function in (from SourceCode.cpp)? I tried searching for an existing header with UTF functions but I didn’t immediately find one.
  2. How do I go about downgrading to a log message? Do I just write to errs()?

Regards,

Daan

My only concern is that we’ll break clients that make the equal and opposite mistake (sending byte lengths instead of UTF-16 lengths).
Maybe we should also downgrade this error to a log message, since we can presumably trust the length from the string. What do you think?

I think the check is useful just to verify if the client and server are on the same page so I’m partial to option 2 as well. Downgrading to a log message avoids outright breaking existing clients so that seems like the way to go. As soon as Github is back up I’ll make an issue asking to clarify the unit of rangeLength in the lsp repository as well.

Thanks!

Assuming we go with option 2 (including downgrading to an error message), I have two more questions:

  1. Which header should I put the utf16Len function in (from SourceCode.cpp)? I tried searching for an existing header with UTF functions but I didn’t immediately find one.

I think SourceCode.h is a reasonable fit within clangd. (There are some UTF libraries in llvm/support, but it’s not clear this belongs there and it’d be a bunch of work).

Consider renaming it something like lspLength() or so and keeping the bits about UTF-16 in the function comment.
That’s kind of a low-level detail.

  1. How do I go about downgrading to a log message? Do I just write to errs()?

We use the log() functions in clangd/Logger.h. (Abstracts the destination, handles locking, does formatting).

This one is scary+rare enough it should probably be an error log.
The functions use formatv syntax, so something like:
elog(“bad TextDocumentContentChangeEvent: rangeLength={0} but computed length {1} for {2}”,
*change.rangeLength, Len, change.text);

Cheers, Sam

After implementing option 2, I’ve changed my mind and I think we should remove the check.

Right now, DraftStore::UpdateDraft takes the start position and end position of the LSP message (specified in UTF-16 code units) and calculates the length of the range in bytes. What we’d be doing with option 2 is converting that length back to UTF-16 code units and checking whether it’s equal to the rangeLength of the LSP message. This is nothing more than a really complicated check to see whether ‘LSP.end - LSP.start == LSP.rangeLength’. The check could just as easily done without the conversion to bytes, but then we’re really only checking whether the client calculated the rangeLength correctly. Since the rangeLength isn’t used by the rest of clangd I propose to just ignore it. rangeLength seems to be redundant since we just calculate it ourselves. I also found an issue about this in the LSP repository: https://github.com/Microsoft/language-server-protocol/issues/9.

Regards,

Daan

After implementing option 2, I’ve changed my mind and I think we should remove the check.

Right now, DraftStore::UpdateDraft takes the start position and end position of the LSP message (specified in UTF-16 code units) and calculates the length of the range in bytes. What we’d be doing with option 2 is converting that length back to UTF-16 code units and checking whether it’s equal to the rangeLength of the LSP message. This is nothing more than a really complicated check to see whether ‘LSP.end - LSP.start == LSP.rangeLength’.

Yes, but start and end aren’t just UTF-16 code unit offsets, they’re (line, col) pairs where col is in UTF-16.

So it’s an indirect assertion about the state of the remote buffer (you can’t convert between offsets and line/col without the buffer).
I think there’s some value in validating that assertion, particularly in light of the fact that the protocol talks about it in two different encodings (the actual buffer content is sent as JSON-escaped strings over UTF-8) and we store as a third (UTF-8).

But it’s not critical. Up to you.

Cheers, Sam