Understanding AST Cursors, Tokens, Includes

I’m experimenting with syntax highlighting using the AST. What I want to do is end up with a stream of tokens for a file; each annotated with a scope, which represents the AST cursors it belongs to (broadest to narrowest) and finally its token kind.

The idea is then to apply scoping rules to these, similar to how TextMate’s parses does it. Thus I can colour an IDENTIFIER inside a PARM_DECL different to an IDENTIFIER on it’s own.

The problems I’ve got are:

  1. I can’t just walk the cursor tree, as Cursors don’t cover all tokens; thus don’t include things like comments
  2. I can’t just walk the token stream, as Tokens don’t include the AST context and you can only get the most detailed cursor from clang_annotateTokens
  3. I can’t walk back up from the detailed cursor as cursors have no “parent” (and lexical and semantic are often both None)1. Even if I solve this, includes cause me problems
  4. Although the AST walk includes everything imported from the include; I can’t find a way to get the includes Tokens - so I don’t even “see” comments in them
  5. You get a INCLUSION_DIRECTIVE where the include starts; but theres no way to find what the include path actually resolved to (without relying on there being an AST symbol inside it)
  6. You can’t get the scope of an include (from the AST) as the included file isn’t a TranslationUnit

Now, I’ve got a solution to all these - but it’s not pretty and I can’t help feeling I’m missing something obvious.

My solution works by:

  • for each ccmd in compile_command.json
  • Create translation unit with PARSE_DETAILED_PROCESSING_RECORD (a root_tu)
  • for each token in root_tu
  • save in a lookup table by it’s location- for each include in root_tu (from clang_getInclusions)
  • tweak the root_tu’s compiler arguments, replacing the compiled filename with the include path
  • Create an inc_tu with PARSE_INCOMPLETE
  • for each token in inc_tu
  • save in a lookup table by it’s location- I walk the AST from root_tu (which goes into the included header)
  • for each token in the AST cursor
  • lookup the token by location and append the cursor to it’s scope

As I said, it appears to work … but it’s not pretty.

Am I missing some easy way to do this, particular when it comes to handling includes?