Incorrect AST column numbers?

Please note that there appears to be something wrong with this forum in that the stdexcept is missing from the #include in my code below and one of the AST lines appears concatenated onto the previous AST line, even though everything looks good when I am editing it ???

I have generated an AST for the following test code. I’ve added line numbers for reference:

  1. #include
  2. int main()
  3. {
  4. throw std::runtime_error(“Îļþ”);
  5. throw std::runtime_error(“1234”);
  6. }

I have a question about the four specific AST lines below, with the first two being for the first throw statement above and the second two being for the second throw statement. The AST indicates that the statement on line 4 extends to column 36, whereas it is indicating column 32 for the statement on line 5. However, the code for both statements clearly only extends to column 32. This presents a significant problem when trying to parse items in the source code based on the AST column number information. Is this a bug?

| |-ExprWithCleanups 0x217cd583800 <line:4:1, col:36> ‘void’
| | `-StringLiteral 0x217cd5836e8 col:26 ‘const char [9]’ lvalue “\303\216\303\204\302\274\303\276”

| -ExprWithCleanups 0x217cd5839f0 <line:5:1, col:32> 'void' | -StringLiteral 0x217cd5838d8 col:26 ‘const char [5]’ lvalue “1234”

You could try delimiting the code blocks with triple-backtick lines, which should turn off the default formatting. For example:

#include <stdexcept>

The column numbers being off looks to be a result of counting bytes in the line. Your line 4 contains 4 UTF-8 characters which would take up 2 bytes each, and would thus be 4 bytes longer than a line of the same number of pure ASCII characters.

As to whether or not this is a bug: properly determining length when it comes to Unicode characters is a challenging proposition, since there are multiple potential definitions which all have various downsides:

  • Counting the number of bytes the text takes up. This is algorithmically trivial to do, but tends to produce very noticeably off results with anything non-ASCII.
  • Counting the number of Unicode codepoints. This is easy, if not trivial, but also produces potentially weird results (e.g., a decomposed e + accent versus precomposed é character, which visually look the same but consists of 2 or 1 Unicode codepoints).
  • Counting the number of Unicode grapheme clusters. At this point, you need to start embedding some noticeably hefty tables to compute this properly. And it still has issues, most noticeably that it doesn’t account for the different width that characters take up in a fixed-width context (e.g., the unified CJK ideograph characters usually take up the same space as 2 ASCII characters in a fixed-width font).
  • There is a Unicode computed property for the amount of fixed-width text a codepoint will take up… but, IIRC, it actually isn’t properly filled in for all characters.
  • And don’t forget the issue of tab stops, wherein a character has a variable width!

I’ll note that my usual text editor chooses to report column number as both “number of fixed-width character spaces from the left gutter” and “number of bytes following the beginning of line” if those numbers diverge (which includes non-ASCII text or tabs), which suggests that taking the easy route and relying on bytes for column widths isn’t an incorrect option.

LLVM does have code to try to guess the rendered width of a string in a terminal; see llvm-project/Unicode.h at ddb85f34f534ed74312ef91e4c1f8792ad8f08f0 · llvm/llvm-project · GitHub . We don’t use it to compute column numbers; as the comment notes, it’s not based on any standard.