[RFC] TableGen: Include What You Use (IWYU)

mshockwave · September 1, 2023, 12:47am

TL;DR

I propose to refactor existing TableGen code to actively include the TableGen files they need. This makes using tablegen-lsp-server, TableGen’s Language Server Protocol (LSP) server, on TableGen code outside MLIR and especially those in LLVM backends, much easier.

Having a useful LSP for (generic) TableGen code improves productivity and fosters a more friendly environment for beginners to learn, let’s say, LLVM backend development. In the following RFC I will show my motivation, problem statements, my solution and preliminary refactoring results, as well as the current limitations.

Motivation

Language Server Protocol (LSP) is a open standard for creating modular syntax highlighting, semantic analysis, and code completion solutions for code editors and IDEs. Usually a IDE would be a LSP client, sending request like “go to the definition of this symbol” to LSP server, which analyzes the code and return the result (e.g. “the definition is at line 98 column 7”).

TableGen (TG) is a Domain Specific Language (DSL) for LLVM and its subprojects. LLVM backends and MLIR are two of the biggest users. Both have substantial amount of TG code – over 300 and 44 KLOC and growing, respectively. For both LLVM backends and MLIR, TG code is the backbone of their development, thus having a LSP for TG is not just essential but somewhat critical to the productivity.
Last year (2022) @River707 pushed a LSP server implementation for TG, tablegen-lsp-server, to upstream MLIR codebase. Such LSP server has been serves pretty well for TG code in MLIR since then.

But here is the catch: tablegen-lsp-server basically only works for TG in MLIR. For TG code in LLVM backends, it cannot analyze the majority of the files. @MzFlrx has also raised a concern on this early this year. The culprit of this problem is that tablegen-lsp-server only analyzes the TG file users requested + whatever files it includes. However, TG files in LLVM backend have an exotic if not weird include strategy / hierarchy (which I’ll cover in the Problem Statements that follows) so it’s almost certain the LSP server will fail to even parse the file.

I tried to approach this issue by modifying tablegen-lsp-server but it turned out to be non-trivial. More importantly, LLVM backends’ TG file including pattern just feels…wrong and I don’t think we should create a special case (in tablegen-lsp-server) for something we don’t really encourage in the future. Thus, I would like to solve this problem by changing how existing TG code include files they need, such that tablegen-lsp-server can analyze individual TG files in a standalone fashion.

Problem Statements

As I mentioned before, the root cause of this problem is how many of the existing TG files include their dependencies. TableGen has the same file including model as C/C++, namely, effectively copy the included file content in replacement of the include statement. Normally when coding C/C++, we will structure the include hierarchy like this:
TableGen Include - Ideal
In which we only reference symbols from one of the included files (and their ancestors).
However, in LLVM backends, TG files are usually organized like this:
TableGen Include - Reality
In this case, individual files don’t reference symbol by including the required files but instead, reference “through” the root file, File1. This pattern, IMHO, is error-prone as the include statements in the root file (File1) are now sensitive to their order (e.g. if we swap include "File4" and include "File5" in File1 it will fail to parse); it’s also hard for developers to find the definition of a symbol as there is no include statement in File3, 4, and 5.

Note that tablegen-lsp-server uses a file, tablegen_compile_command.yml, to specify file-level dependencies. Those information mostly coincide with include folder flags (i.e. -I) supplied to the llvm-tblgen command. However, that still doesn’t fully solve our problem, which is primarily caused by the lack of proper include statements.

Solution

To insert proper include statements and header guards (i.e. #ifndef TEST_TD #define TEST_TD), I created a script to refactor TG files in LLVM backends at scale.
Regarding what files to include, I tapped into the TableGen parser to print out the origin files of the symbols used in the current TG file. This dependency information comes from llvm::Record::getReferencesLocs.
As the title suggested, I follow the include-what-you-use (IWYU) principle. I don’t try to “coalesce” transitive including (i.e. if both A -> B -> C and A -> C exist, then the latter include statement can be remove) as it’s actually a difficult problem (at scale) and somewhat discourage by the IWYU principles, since it might catch developer off the guard when the upstream include edge (i.e. A -> B) is removed.

I’ve put the prototype here. The refactoring script is here and the aforementioned dependency analysis logics is here.

Preliminary Refactoring Results

I’ve refactored 4 mainstream targets + a smaller experimental target. Here are the links to their refactoring changes:

All of these targets’ codegen, MC, and disassembler tests passed after refactoring. More importantly, tablegen-lsp-server can now correctly analyze every TG files in these (refactored) backends.

Note that a fix on a TableGen lexer bug ⚙ D159236 [TableGen] Fix incorrect handling of nested `#ifndef` directives is required (it will be really helpful if anyone can help me to review the patch!).

Limitations

The biggest problem so far is parsing speed: up to 7% regression of parsing speed are observed on the five targets listed above.

Here is the parsing time before (i.e. baseline):

Target	Mean [s]	Min [s]	Max [s]
X86	1.008 ± 0.010	0.984	1.018
AArch64	0.563 ± 0.002	0.559	0.566
ARM	0.431 ± 0.002	0.428	0.435
RISCV	0.928 ± 0.006	0.921	0.939
M68k	0.237 ± 0.001	0.236	0.238

Here is the parsing time after the refactoring:

Target	Mean [s]	Min [s]	Max [s]	Relative to Baseline
X86	1.061 ± 0.016	1.039	1.091	+5.28%
AArch64	0.606 ± 0.012	0.595	0.624	+7.64%
ARM	0.439 ± 0.004	0.435	0.448	+1.86%
RISCV	0.973 ± 0.005	0.967	0.981	+4.85%
M68k	0.239 ± 0.001	0.238	0.241	+0.84%

Further investigation showed that the regressions are caused by the extra time spend on…skipping code. Header guards prevent a file from being included twice by skipping the entire file on its second occurrence in the include stack – but this code skipping is not zero-cost: preprocessor still scans through the entire file looking for other preprocessor directives like #else and #endif. And in our cases, the time spend on scanning skipped lines from duplicate include files adds up and hinders the overall parsing performance.

Take AArch64 as an example, one of the outstanding files is AArch64InstrInfo.td, which has over 8000 lines and got skipped 32 times. A potential solution for this can be splitting AArch64InstrInfo.td to smaller parts for some of its users.

Any comments are appreciated!
also cc @mehdi_amini who might be interested in

programmerjake · September 1, 2023, 2:56am

you could add the equivalent of #pragma once so a file is only included once when it contains that directive.

mshockwave · September 1, 2023, 5:20am

I’d thought about that before and IMO it’s a pretty neat solution, but I would like to see everyone’s feeling about the cost of using header guards (i.e. parsing speed regressions) first before adding such specific feature

programmerjake · September 1, 2023, 5:41am

one other option is to add special handling for header guards, where it detects it as a header guard if everything is in one big #ifndef so then all you have to do is, next time it’s included, check if that symbol is defined, if so, you know the include does nothing so you can skip parsing the included file again.

jrtc27 · September 1, 2023, 6:29am

This seems like a lot of boilerplate for something that isn’t hugely useful outside of your one specific use case.

MaskRay · September 1, 2023, 7:05am

I share the concerns as well.

I created ccls, contributed to some language servers, used to maintain emacs-lsp, and investigated many language clients.

For C/C++, I do like self-contained header files as we have one specific main file for parsing to get code completion and diagnostics. Projects with self-contained header files are much easy to deal with and our language server has to be generic.

For TableGen files, our use pattern is very specific and I think we can overfit our tool.
For example, if we are reading X86SchedAtom.td, our tool can figure out that the main file is X86.td and parse X86.td instead. For large C++ projects, this gives us code completion from other files, which may be undesired. However, for TableGen, I assume that such a behavior (getting completion results from X86SchedPredicates.td when reading X86SchedAtom.td is fine.

mshockwave · September 1, 2023, 5:15pm

The refactoring script is just a starting point, ideally people can simplify the files to include manually. Another way can be teaching the refactoring script to optimize the number of files to include.

MLIR’s TableGen files are self-contained, which are quite different than LLVM backends’ use pattern. Thus I would argue that TableGen’s LSP should be generic as well. Personally I don’t think supporting a specific pattern (i.e. LLVM backends’ current pattern) that is by itself less ideal in LSP server is a good idea.

zero9178 · September 1, 2023, 5:31pm

C and C++ IDEs have to deal with this issue as well as C and C++ headers are not necesserily self- contained. They do this by having global information about the project and knowing in which files the headers are included and then picking a cpp file that it is included in to for context.

MLIRs cmake already appends compile commands to tablegen_compile_commands.yml for its files. LLVM could do the same and then the LSP server could index the commands there and then pick a context in which these files are actually being compiled in.

I’d argue that’d be the proper but much much much harder and more involved solution to this problem, so not advocating for this to be implemented. Rather just wanted to mention a more definite solution with precedence that would not require any changes to the TD files.

mehdi_amini · September 2, 2023, 12:59am

I would not qualify this as a “proper” solution but as a clutch around badly written project. This makes sense for IDEs which are aiming to target any kind of project like this, but I don’t think it justifies itself in a more constrained environment (i.e. TableGen LSP and infra) where “making files self-contained” seems like more of “proper solution” to me.

This is also a very “C/C++” mentality, but if you step back and think about how some other languages: in many case there aren’t “textual include” preprocessor-style but proper imports!

Pretty impressive work!

That’s not ideal, this is actually a problem with IWYU and in my experience to make IWYU useful at scale you need to add annotations to control this behavior: for example be able to annotate in B that C is “exported” by B and so A can include only B and not C (which may be a private header from B-library).

That seems ideal to me! Easier to put in place, more explicit about the intent, and eliminates the parsing overhead!

tmatheson · September 6, 2023, 12:50pm

The tablegen file dependency patterns are indeed a mess, and it would be better fix them to IWYU. However, while tablegen currently supports textual include statements, this is a bug in tablegen-lsp-server. As others have pointed out C++ language servers deal with this just fine using global project information.

There is no guarantee people will use these optional include guards. The tablegen files can still break tablegen-lsp-server again in future. Instead, why not make “don’t include the same file twice” the normal behaviour of tablegen?

I would prefer a solution that made tablegen-lsp-server work with all valid tablegen, i.e. either:

Fix tablegen-lsp-server to be able to handle the existing include statement.
Change the behaviour of tablegen’s include, for example so that the current file can only reference included symbols (and ignore duplicate re-definitions). Make IWYU a requirement of the language, not a suggestion.

The second approach is better because it both improves the file structure and solves the tablegen-lsp-server problem. Both would avoid the need for header guards, which are an unpleasant necessity C/C++ that we don’t need to copy.

mehdi_amini · October 19, 2023, 3:44am

Fully agree here!
Anyone want to volunteer to start implementing this?

Topic		Replies	Views
The tablegen-lsp-server of MLIR is well-intentioned, but difficult to use in practice Community llvm	6	298	May 15, 2023
[PSA] Tablegen language features via MLIR VSCode Extension MLIR	2	769	June 7, 2022
Organization of LLVM utilities LLVM Dev List Archives	9	76	August 4, 2020
[PSA] Updates to MLIR and TableGen language tooling MLIR core	0	501	October 3, 2022
TableML status LLVM Dev List Archives	1	78	December 17, 2008