RFC: Flexible Lexer Buffering for Handling Incomplete Input in Interactive C/C++

vvassilev · July 31, 2022, 6:23am

tl;dr; Q: How can we support incremental user input where code available so far might be incomplete? A: By teaching the lexer infrastructure in Clang to optionally work with additional bytes when string literal, #ifdef directive, or brace block is cut-off just before reaching EOF.

Motivation

As per RFC of Moving (parts of) the Cling REPL in Clang we outlined several areas where Interactive C++ is useful and has seen a growing community. One of the key challenges in usability of a C++ interpreter such as Cling (and clang-repl) is understanding when the user intended to type in more input before compilation. For example,

[clang-repl] void f() { // #1
[clang-repl cont ] } // #2

Here line #1 is incomplete input and the user intent is to be completed in the next couple of input lines. This example can be relatively well handled by counting the imbalance in the punctuators (implementation experience in Cling can be seen here). Often people cut and paste in code which has a different coding style and the punctuator imbalance cannot be used as a reliable anchor. In addition, there are more subtle cases such as preprocessor directives (#ifdef) which require full lexing steps. (this has been raised in cling community before) For instance,

[clang-repl] void f() { 
[clang-repl] #ifdef __UNDEFINED_SYMBOL
[clang-repl] {
[clang-repl] #endif
[clang-repl] }

requires an ability to expand macro directive in order to properly decide if a function is indeed complete. A simple brace counting fails in this case.

This RFC intends to explore a more advanced input continuation mechanism based on the extension of Lexer and Preprocessor.

Implementation

The overall idea is to add a callback in the clang::Lexer which might provide extra characters to lex. The intent of this callback is to continue on the lexing and the preprocessing process when it reaches the end of the currently available buffer.

Alternatives considered

We have considered punctuator counting but it lacks reliability and generality. We have considered lexing the entire buffer every time a line is inserted (D129265) The biggest main issue of this approach is that it’s O(N^2) time complexity where N is the number of input lines. Another issue of this approach is that the compiler has taken the error path there, and the token stream might be augmented with recovery tokens which makes it fragile. Additionally, it can’t properly support string literal cut-off cases.

Proposed Implementation

The only feasible solution we see natural for the Clang pipeline is implementing a hook early in the clang::Lexer. There are three challenges:

Buffer is assumed to be fixed – each part of code handles eof in their own ways and doesn’t have a mechanism to request to grow the buffer.
Every part of code uses direct pointer (CurPtr, BufferPtr) to bytes not offset to buffer – once buffer is swapped every pointer needs to be updated including all trivial local variables.
Extension of issue 1: token points to the same fixed allocated buffer. Once we change the buffer, almost all tokens generated before get invalidated but they are cached in the upper level of the lexer.
Code has been optimized very heavily, we should introduce as little overhead as possible.

Here is a detailed proposed implementation that handles all of these issues while affecting the non-incremental clang world minimally.

We change the direct buffer pointers with buffer offset (CurPtr → CurOffset), and code will access the memory by BufferStart[CurOffset]. (Part 1)
1. The changes are concentrated in Lex.h and Lex.cpp, so it’s transparent to the outside world.
2. Code is very clear. Since we will not ever allow the change in consumed bytes (we only append the bytes) offset will be constant after the buffer is swapped.
3. Introduces a minor indirection to BufferStart. (BufferStart is very likely to be within L1 cache though)
To solve the token literal data invalidation issue, we will simply keep separate copies of literal data in incremental mode. (Part 2)
Finally, we will call the TryExpandBuffer function when Lexer reaches the end of the buffer. In non-incremental mode, TryExpandBuffer will simply return false which will trigger the traditional “eof token detected” diagnostics. In an incremental world, this will return true if it received additional bytes by the user callback function and continue on the Lexing. (Part 3)

Another ways were considered as well which we dropped eventually because of suboptimal code readability/amount of code changes:

Maintain BufferPtrsToUpdate in order to solve the buffer pointer invalidation issues, which required a smallest code change but made code a bit awkward.
Make Token, Lexer, Preprocessor, Parser generic and receive “LexerPtr” type parameter which will be simply const char * in a non-incremental world but will access the buffer by offset in incremental mode.

Implementation Plan

We will separate off our implementation into three patches.

The first patch will simply change Lexer to use offset to buffer instead of direct buffer pointer. We will make sure no breakage or performance regression is discovered here. This is the only part where our additions affect the non-incremental clang world.

The second patch will keep separate copies of the token literal data in incremental mode, which will finally make swapping buffers in the middle perfectly safe.

The third patch will add the Buffer Grow callback and change Lexer to request for additional bytes when it reaches the end of the buffer. We are also considering extending or deriving the MemoryBuffer interface for greater flexibility if necessary. This patch will not affect the non-incremental mode, but is the main changes required to support incomplete inputs efficiently in interactive clang.

Risk

The Lexer is highly optimized and each change introduces a risk to break the optimizations. To control the risk, we are performing multiple benchmarks and inspecting the produced assembly code to detect regressions early. We will conduct a further performance analysis on bigger codebases such as LLVM itself.

What do you think? Are we missing something? We would love to hear your feedback!

Suhno and Vassil

davrec · July 31, 2022, 1:57pm

I like this approach. Whatever performance regressions might be encountered by calculating BufferStart+CurOffset in place of CurPtr, might be offset by changing getSourceLocation(const char *Loc) (which begins by calculating unsigned CharNo = Loc-BufferStart) to getSourceLocation(unsigned CharNo).

It may be advisable to move BufferStart, BufferEnd, TryExpandBuffer etc to a base class which only handles managing the buffer memory, to restrict access to BufferStart in future patches.

AaronBallman · August 1, 2022, 6:23pm

I think this sounds like a very reasonable thing to research. Is your plan is to attempt this research with an out-of-tree clang, iterate until you’ve got functionality and performance characteristics you’re happy with, and then propose appropriate patches in community?

This is my largest concern, to be honest. The lexer is incredibly performance sensitive and adding a callback to perhaps get extra bytes not in the source buffer sounds like it could be quite expensive. But also, even changing from pointers that are always pointing to the current location in the buffer to instead be pointer arithmetic is worrying. I would also expect caches to help out here, but I don’t count on it until I can measure something concretely.

I would strongly encourage you to make use of http://llvm-compile-time-tracker.com/ to measure compile time performance impact as part of this work, because performance numbers will almost certainly be asked for during code review unless they’re provided up front.

Note, one possible mitigation if you find that compile time performance is impacted in ways that can’t be helped would be to see what other vectorization opportunities exist, like what’s used here: https://github.com/llvm/llvm-project/blob/main/clang/lib/Lex/Lexer.cpp#L2713 and see if you can “make up” the difference in performance that way.

Xazax-hun · August 1, 2022, 7:13pm

Did you consider using indices/ranges for token literals as well?

I think sharing a list of benchmarks you want to run might be useful so people can jump in and suggest some more.

sunho · August 2, 2022, 1:04am

Thanks for feedback! It totally makes sense to make it a base class. We’ll have to think about how we are going to fit that into clang::SourceManager as well, since things like diagnostics will still rely on SourceLocation.

sunho · August 2, 2022, 1:27am

Thanks for feedbacks!

Note that the buffer grow callback will only be called at the eof of buffer – in non-incremental clang case, the callback should only be called once per each lexer. So, I expect it to be not that expensive, but we’ll have to see how things turn out as we profile and measure on.

The one that worries me the worst is actually pointer arithematics part. This introduces quite a large number of indirections. We are planning to measure the performance implications of pointer arithematic patch extensively.

http://llvm-compile-time-tracker.com/ looks really useful! I think I will have to ask for a permission in order to use if for changes outside upstream.

It makes sense to introduce another set of optimizations to make up the performace regression. We might consider something similar to the vectorization done in block comment lexing part.

sunho · August 2, 2022, 1:47am

We have considered it before. But, it turned out things goes hairly when previous buffer is freed. We will have to provide tokens a mean to fetch the latest buffer which is not really worth at all for non-incremental clang. We tried to solve this by introducing generic parameter “LexerPtr” that overrides dereferencing operators to Token and Lexer, but this required the code changes in almost every part of clang.

We were mostly planning to rely on chrome compilation benchmark and ROOT data analysis framework benchmark (ROOT uses clang in incremental fashion) http://llvm-compile-time-tracker.com/ suggested by AaronBallman is very enthralling and I think we’ll use it extensively. Please suggest more benchmark methods if there is any.

cor3ntin · August 2, 2022, 1:44pm

Profiling with large files (like preprocessed C++ sources for example) during the effort might also be worthwhile. It’s likely the perf issues are not where we think they are.

I would go so far as to suggest that hiding that behind an iterator interface is likely not to have impact in release mode. But again, that sort of question can only be answered by profiling.

Overall, i really like the outlined plan

sunho · February 9, 2023, 12:34am

I have published the patches that implement this RFC:
⚙ D143142 [clang][lex] Enable Lexer to grow its buffer (the main one)
⚙ D143144 [clang][lex] Add TryGrowLexerBuffer/SourceFileGrower
⚙ D143148 [clang-repl] Add basic multiline input support

Here’s also the link to my talk that provides more detailed information about the context: Lazy Lexer Presentation - YouTube

Please take a look.

Best,

Topic		Replies	Views
Incremental parsing/compilation question Clang Frontend	1	74	December 2, 2011
[RFC] A C++ pseudo parser for tooling Clang Frontend	50	4660	March 15, 2022
libelling - tokens/AST and the preprocessor Clang Frontend	2	118	March 3, 2015
[clang-repl] Implement autocompletion in clang-repl GSoC clang , gsoc2022 , gsoc2023	13	1401	March 27, 2023
RFC: Handle Execution Results in clang-repl Clang Frontend clang	12	1156	May 14, 2023