libclang: failing assert

Hi,

When using libclang's clang_annotateTokens, I have a consistently failing assert in CIndex.cpp, line 5429. This is for both 3.2 and trunk. When I comment the assert out, everything looks to be fine. Is this assert valid, and if so, what is it checking?

Regards,
Erik.

Hi,

When using libclang's clang_annotateTokens, I have a consistently failing assert in CIndex.cpp, line 5429. This is for both 3.2 and trunk. When I comment the assert out, everything looks to be fine. Is this assert valid, and if so, what is it checking?

It makes sure that when lexing for preprocessor directives tokens, that any cursors produced will refer to the correct clang_tokenize'd token.
Are you able to provide a preprocessed file for testing ?

Hi,

When using libclang's clang_annotateTokens, I have a consistently failing assert in CIndex.cpp, line 5429. This is for both 3.2 and trunk. When I comment the assert out, everything looks to be fine. Is this assert valid, and if so, what is it checking?

It makes sure that when lexing for preprocessor directives tokens, that any cursors produced will refer to the correct clang_tokenize'd token.
Are you able to provide a preprocessed file for testing ?

You can use "-E -frewrite-includes" to preserve preprocessor directives.

I can’t reproduce using:

$ c-index-test -test-annotate-tokens=main.pp:1:1:100000:1 -x c++ main.pp > /dev/null

Does it occur with a particular source range ? What is the file whose tokens you are annotating ? Does using c-index-test on it reproduce the assertion hit ?
If yes, could you separate the includes of that file, for example:

#include “a.h”
#include “b.h”
#include “c.h”

====>

#include “preprocessed-headers.h”

I cannot reproduce it either with c-index-test, so I modified it to do it in the same way as I use the annotation. Patch is attached. With that patch, I get:

erik@Road-Warrior:untitled6$ ~/dev/builds/llvm-release-build/bin/c-index-test -test-annotate-tokens=main.pp:62024:1:1000000:1 -std=gnu++98 -x c++ main.pp
Assertion failed: (Tok.getLocation() == SourceLocation::getFromRawEncoding(Tokens[TokIdx].int_data[1])), function annotatePreprocessorTokens, file /Users/erik/dev/clang-llvm/llvm-git/tools/clang/tools/libclang/CIndex.cpp, line 5429.
libclang: crash detected while annotating tokens
Identifier: “std” [62025:17 - 62025:20]
Identifier: “main” [62027:5 - 62027:9]
Identifier: “cout” [62029:5 - 62029:9]
Identifier: “endl” [62029:31 - 62029:35]

When I move everything up to (and including) line 62022 into the a header file (like you mentioned above), and include that file, I get exactly the same crash.

my-c-indexing.patch (1.85 KB)

Maybe a bit of explanation might be needed here to explain the “why” part…

What is done in that patch is to
a) iterate over all tokens, and select (copy) all those that are of the CXToken_Identifier kind into another array, and then
b) call clang_annotateTokens with that specific array

So not all tokens are queried, but only the identifiers, which somehow seems to trigger the assert.

The reason is that when doing semantic highlighting for C/C++/ObjC/ObjC++, only identifiers can get a different meaning/use when the preprocessor is involved, or be marked with their specific kind of declaration. The reason for selecting only the identifiers is that querying only for these tokens already flagged up as a significant cost when doing time profiling (back in the (I think) 3.0 days). I could change the code to ask for all tokens and skip the less interesting ones, but I was wondering if that would be band-aid to work around a bug in clang_annotateTokens or not. Hence my question on what that assert actually checks.

That said, I do have to come back to my remark about removing the assert and that showing no impact. After using it like that for some real code-editing, I noticed that the first couple (10? 20? 50?) identifiers do not get annotated correctly. Well, not at all. Which would actually be kinda fine with me, but then I’m wondering why the clang_annotateTokens function takes a list of tokens, instead of, say, a TU.

I’m kind of curious about that assert, and might find some time to look into it, but I’d appreciate some help/explanation on what’s being checked there before I do that.

– Erik.

Maybe a bit of explanation might be needed here to explain the “why” part…

What is done in that patch is to
a) iterate over all tokens, and select (copy) all those that are of the CXToken_Identifier kind into another array, and then
b) call clang_annotateTokens with that specific array

So not all tokens are queried, but only the identifiers, which somehow seems to trigger the assert.

Ok I see.

The reason is that when doing semantic highlighting for C/C++/ObjC/ObjC++, only identifiers can get a different meaning/use when the preprocessor is involved, or be marked with their specific kind of declaration. The reason for selecting only the identifiers is that querying only for these tokens already flagged up as a significant cost when doing time profiling (back in the (I think) 3.0 days).

Hmm, there were improvements and I’d be surprised if it makes much difference now.

I could change the code to ask for all tokens and skip the less interesting ones, but I was wondering if that would be band-aid to work around a bug in clang_annotateTokens or not. Hence my question on what that assert actually checks.

That said, I do have to come back to my remark about removing the assert and that showing no impact. After using it like that for some real code-editing, I noticed that the first couple (10? 20? 50?) identifiers do not get annotated correctly. Well, not at all. Which would actually be kinda fine with me, but then I’m wondering why the clang_annotateTokens function takes a list of tokens, instead of, say, a TU.

I’m kind of curious about that assert, and might find some time to look into it, but I’d appreciate some help/explanation on what’s being checked there before I do that.

The function annotatePreprocessorTokens in CIndex.cpp assumes that the array given to it came from clang_tokenize so it doesn’t try to skip tokens and it asserts to make sure its token index points to the right token in the given array.
It should be modified to only assume that the given tokens are in source order.