TL;DR
I propose to refactor existing TableGen code to actively include the TableGen files they need. This makes using tablegen-lsp-server
, TableGen’s Language Server Protocol (LSP) server, on TableGen code outside MLIR and especially those in LLVM backends, much easier.
Having a useful LSP for (generic) TableGen code improves productivity and fosters a more friendly environment for beginners to learn, let’s say, LLVM backend development. In the following RFC I will show my motivation, problem statements, my solution and preliminary refactoring results, as well as the current limitations.
Motivation
Language Server Protocol (LSP) is a open standard for creating modular syntax highlighting, semantic analysis, and code completion solutions for code editors and IDEs. Usually a IDE would be a LSP client, sending request like “go to the definition of this symbol” to LSP server, which analyzes the code and return the result (e.g. “the definition is at line 98 column 7”).
TableGen (TG) is a Domain Specific Language (DSL) for LLVM and its subprojects. LLVM backends and MLIR are two of the biggest users. Both have substantial amount of TG code – over 300 and 44 KLOC and growing, respectively. For both LLVM backends and MLIR, TG code is the backbone of their development, thus having a LSP for TG is not just essential but somewhat critical to the productivity.
Last year (2022) @River707 pushed a LSP server implementation for TG, tablegen-lsp-server
, to upstream MLIR codebase. Such LSP server has been serves pretty well for TG code in MLIR since then.
But here is the catch: tablegen-lsp-server
basically only works for TG in MLIR. For TG code in LLVM backends, it cannot analyze the majority of the files. @MzFlrx has also raised a concern on this early this year. The culprit of this problem is that tablegen-lsp-server
only analyzes the TG file users requested + whatever files it includes. However, TG files in LLVM backend have an exotic if not weird include strategy / hierarchy (which I’ll cover in the Problem Statements that follows) so it’s almost certain the LSP server will fail to even parse the file.
I tried to approach this issue by modifying tablegen-lsp-server
but it turned out to be non-trivial. More importantly, LLVM backends’ TG file including pattern just feels…wrong and I don’t think we should create a special case (in tablegen-lsp-server
) for something we don’t really encourage in the future. Thus, I would like to solve this problem by changing how existing TG code include files they need, such that tablegen-lsp-server
can analyze individual TG files in a standalone fashion.
Problem Statements
As I mentioned before, the root cause of this problem is how many of the existing TG files include their dependencies. TableGen has the same file including model as C/C++, namely, effectively copy the included file content in replacement of the include
statement. Normally when coding C/C++, we will structure the include hierarchy like this:
In which we only reference symbols from one of the included files (and their ancestors).
However, in LLVM backends, TG files are usually organized like this:
In this case, individual files don’t reference symbol by including the required files but instead, reference “through” the root file, File1
. This pattern, IMHO, is error-prone as the include
statements in the root file (File1
) are now sensitive to their order (e.g. if we swap include "File4"
and include "File5"
in File1
it will fail to parse); it’s also hard for developers to find the definition of a symbol as there is no include
statement in File3, 4, and 5.
Note that tablegen-lsp-server
uses a file, tablegen_compile_command.yml
, to specify file-level dependencies. Those information mostly coincide with include folder flags (i.e. -I
) supplied to the llvm-tblgen
command. However, that still doesn’t fully solve our problem, which is primarily caused by the lack of proper include
statements.
Solution
To insert proper include
statements and header guards (i.e. #ifndef TEST_TD
#define TEST_TD
), I created a script to refactor TG files in LLVM backends at scale.
Regarding what files to include, I tapped into the TableGen parser to print out the origin files of the symbols used in the current TG file. This dependency information comes from llvm::Record::getReferencesLocs
.
As the title suggested, I follow the include-what-you-use (IWYU) principle. I don’t try to “coalesce” transitive including (i.e. if both A -> B -> C
and A -> C
exist, then the latter include statement can be remove) as it’s actually a difficult problem (at scale) and somewhat discourage by the IWYU principles, since it might catch developer off the guard when the upstream include edge (i.e. A -> B
) is removed.
I’ve put the prototype here. The refactoring script is here and the aforementioned dependency analysis logics is here.
Preliminary Refactoring Results
I’ve refactored 4 mainstream targets + a smaller experimental target. Here are the links to their refactoring changes:
All of these targets’ codegen, MC, and disassembler tests passed after refactoring. More importantly, tablegen-lsp-server
can now correctly analyze every TG files in these (refactored) backends.
Note that a fix on a TableGen lexer bug ⚙ D159236 [TableGen] Fix incorrect handling of nested `#ifndef` directives is required (it will be really helpful if anyone can help me to review the patch!).
Limitations
The biggest problem so far is parsing speed: up to 7% regression of parsing speed are observed on the five targets listed above.
Here is the parsing time before (i.e. baseline):
Target | Mean [s] | Min [s] | Max [s] |
---|---|---|---|
X86 | 1.008 ± 0.010 | 0.984 | 1.018 |
AArch64 | 0.563 ± 0.002 | 0.559 | 0.566 |
ARM | 0.431 ± 0.002 | 0.428 | 0.435 |
RISCV | 0.928 ± 0.006 | 0.921 | 0.939 |
M68k | 0.237 ± 0.001 | 0.236 | 0.238 |
Here is the parsing time after the refactoring:
Target | Mean [s] | Min [s] | Max [s] | Relative to Baseline |
---|---|---|---|---|
X86 | 1.061 ± 0.016 | 1.039 | 1.091 | +5.28% |
AArch64 | 0.606 ± 0.012 | 0.595 | 0.624 | +7.64% |
ARM | 0.439 ± 0.004 | 0.435 | 0.448 | +1.86% |
RISCV | 0.973 ± 0.005 | 0.967 | 0.981 | +4.85% |
M68k | 0.239 ± 0.001 | 0.238 | 0.241 | +0.84% |
Further investigation showed that the regressions are caused by the extra time spend on…skipping code. Header guards prevent a file from being included twice by skipping the entire file on its second occurrence in the include stack – but this code skipping is not zero-cost: preprocessor still scans through the entire file looking for other preprocessor directives like #else
and #endif
. And in our cases, the time spend on scanning skipped lines from duplicate include files adds up and hinders the overall parsing performance.
Take AArch64 as an example, one of the outstanding files is AArch64InstrInfo.td
, which has over 8000 lines and got skipped 32 times. A potential solution for this can be splitting AArch64InstrInfo.td
to smaller parts for some of its users.
Any comments are appreciated!
also cc @mehdi_amini who might be interested in