tl;dr; Q: How can we support incremental user input where code available so far might be incomplete? A: By teaching the lexer infrastructure in Clang to optionally work with additional bytes when string literal, #ifdef directive, or brace block is cut-off just before reaching EOF.
Motivation
As per RFC of Moving (parts of) the Cling REPL in Clang we outlined several areas where Interactive C++ is useful and has seen a growing community. One of the key challenges in usability of a C++ interpreter such as Cling (and clang-repl) is understanding when the user intended to type in more input before compilation. For example,
[clang-repl] void f() { // #1
[clang-repl cont ] } // #2
Here line #1 is incomplete input and the user intent is to be completed in the next couple of input lines. This example can be relatively well handled by counting the imbalance in the punctuators (implementation experience in Cling can be seen here). Often people cut and paste in code which has a different coding style and the punctuator imbalance cannot be used as a reliable anchor. In addition, there are more subtle cases such as preprocessor directives (#ifdef
) which require full lexing steps. (this has been raised in cling community before) For instance,
[clang-repl] void f() {
[clang-repl] #ifdef __UNDEFINED_SYMBOL
[clang-repl] {
[clang-repl] #endif
[clang-repl] }
requires an ability to expand macro directive in order to properly decide if a function is indeed complete. A simple brace counting fails in this case.
This RFC intends to explore a more advanced input continuation mechanism based on the extension of Lexer and Preprocessor.
Implementation
The overall idea is to add a callback in the clang::Lexer
which might provide extra characters to lex. The intent of this callback is to continue on the lexing and the preprocessing process when it reaches the end of the currently available buffer.
Alternatives considered
We have considered punctuator counting but it lacks reliability and generality. We have considered lexing the entire buffer every time a line is inserted (D129265) The biggest main issue of this approach is that it’s O(N^2) time complexity where N is the number of input lines. Another issue of this approach is that the compiler has taken the error path there, and the token stream might be augmented with recovery tokens which makes it fragile. Additionally, it can’t properly support string literal cut-off cases.
Proposed Implementation
The only feasible solution we see natural for the Clang pipeline is implementing a hook early in the clang::Lexer. There are three challenges:
- Buffer is assumed to be fixed – each part of code handles eof in their own ways and doesn’t have a mechanism to request to grow the buffer.
- Every part of code uses direct pointer (
CurPtr
,BufferPtr
) to bytes not offset to buffer – once buffer is swapped every pointer needs to be updated including all trivial local variables. - Extension of issue 1: token points to the same fixed allocated buffer. Once we change the buffer, almost all tokens generated before get invalidated but they are cached in the upper level of the lexer.
- Code has been optimized very heavily, we should introduce as little overhead as possible.
Here is a detailed proposed implementation that handles all of these issues while affecting the non-incremental clang world minimally.
- We change the direct buffer pointers with buffer offset (
CurPtr
→CurOffset
), and code will access the memory byBufferStart[CurOffset]
. (Part 1)- The changes are concentrated in
Lex.h
andLex.cpp
, so it’s transparent to the outside world. - Code is very clear. Since we will not ever allow the change in consumed bytes (we only append the bytes) offset will be constant after the buffer is swapped.
- Introduces a minor indirection to
BufferStart
. (BufferStart
is very likely to be within L1 cache though)
- The changes are concentrated in
- To solve the token literal data invalidation issue, we will simply keep separate copies of literal data in incremental mode. (Part 2)
- Finally, we will call the
TryExpandBuffer
function when Lexer reaches the end of the buffer. In non-incremental mode,TryExpandBuffer
will simply return false which will trigger the traditional “eof token detected” diagnostics. In an incremental world, this will return true if it received additional bytes by the user callback function and continue on the Lexing. (Part 3)
Another ways were considered as well which we dropped eventually because of suboptimal code readability/amount of code changes:
- Maintain
BufferPtrsToUpdate
in order to solve the buffer pointer invalidation issues, which required a smallest code change but made code a bit awkward. - Make
Token
,Lexer
,Preprocessor
,Parser
generic and receive “LexerPtr
” type parameter which will be simplyconst char *
in a non-incremental world but will access the buffer by offset in incremental mode.
Implementation Plan
We will separate off our implementation into three patches.
The first patch will simply change Lexer to use offset to buffer instead of direct buffer pointer. We will make sure no breakage or performance regression is discovered here. This is the only part where our additions affect the non-incremental clang world.
The second patch will keep separate copies of the token literal data in incremental mode, which will finally make swapping buffers in the middle perfectly safe.
The third patch will add the Buffer Grow callback and change Lexer to request for additional bytes when it reaches the end of the buffer. We are also considering extending or deriving the MemoryBuffer
interface for greater flexibility if necessary. This patch will not affect the non-incremental mode, but is the main changes required to support incomplete inputs efficiently in interactive clang.
Risk
The Lexer is highly optimized and each change introduces a risk to break the optimizations. To control the risk, we are performing multiple benchmarks and inspecting the produced assembly code to detect regressions early. We will conduct a further performance analysis on bigger codebases such as LLVM itself.
What do you think? Are we missing something? We would love to hear your feedback!
Suhno and Vassil