RFC: prototype of clang-scan-deps, faster dependency scanning tool for explicit modules and clangd

hyp · October 17, 2018, 1:53am

Hi,

Bruno (CCed), Duncan (CCed) and I have been exploring if we can migrate some of our clients to explicit modules. As part of this work Duncan and I developed a new prototype dependency scanning service tool (clang-scan-deps) that computes the set of file dependencies for a particular compiler invocation using some optimizations that are outlined below. This tool makes the non-modular dependency scanning up to 10 times faster for particular workloads (e.g. llc target, 1542 C++ files) on one of our machines, when compared to parallel invocations of clang with -Eonly. We are still in the early stages of proper modules support, but our initial crude prototype can get up to 4x when run on the first 1000 files from clang’s compilation database for a build of LLVM with modules turned on.

We still run the full Clang preprocessor. Here’s what we do to reduce its workload:

Minimize sources by stripping away unused tokens. We keep only the interesting PP directives (#define, #if, #include, etc.), i.e. those that might impact the set of dependencies.
Assume the filesystem is immutable for one run of the service, and cache the files and their minimized contents in memory in a global cache.
Skip over excluded preprocessor ranges by bumping up the buffer pointer in the lexer instead of lexing the skipped tokens.

We intend to upstream this service in the upcoming months. We also would like to integrate this service into Clangd as part of our migration to Clangd to help us determine a good compilation command for a header file from a set of known compilation invocations.

I posted a very rough WIP patch on Phabricator (https://reviews.llvm.org/D53354). It’s based on LLVM checkout r343343. Please take a look if you’re interested.

Duncan, Bruno and I will be at the LLVM dev meeting. We are interested in discussing this prototype and collecting feedback from anyone who might be interested in this work.

Thanks,

Alex

dblaikie · November 6, 2018, 6:22pm

Thanks for sending this out!

Yeah, I’m super interested in how (future standard) C++ modules will interact with build systems, as it’s unlikely to be feasible to use an implicit compilation model (in part because of the code generation/linkage requirements - you could put everything from a C++ Modules definition in comdats, etc as is done for headers today (rather than in separate object files), but not really how it’s meant to work).

All models boil down to something like this - the build system having some explicit knowledge (through library dependencies within a project) and having to do some discovery (to find external dependencies (the standard library (if/once modularized and used as such), other external libraries written using modules) and to reduce internal dependencies (not all code in one library depends on all the libraries that library depends on - so by discovering the specific modular imports used in a given module, that module may be able to be built sooner (when only some of its libraries dependencies have been built, because it only needs that subset)) before executing any compilation steps (& then, ideally, passing around the compilation inputs/outputs rather than relying on the compiler to discover them itself in a cache directory or the like).

You mentioned a few performance metrics
Up to 10x speedup in non-modular dependency scanning - what do you mean by non-modular dependency scanning? (what’s the non-modular part - in contrast to?)
4x when run on the first 1000 files in Clang’s compilation database, compared to clang -Eonly - so this is running the whole tool, including generating the trimmed preprocessed files, and then reading those to discover the header module dependencies, compared to running -Eonly, then scanning those files? & the output is currently in what form? .d-like files?

You mention relying on the compilation database for discovering the files to run over - is this the long term goal/design, or a current stepping stone? I was about to say that seems circular (thinking that the compiler/compilation phase generates the compilation database) but then realized/remembered that it’s the build system that generates that, not the compiler, so you can have/use/run over the compilation database before compilation has begun. Sounds good. So the build system would have to have a phase that runs after generating the compilation database that runs this tool, then adds the module compilations produced by this tool to the list of commands it will execute (& probably also adds them back into the compilation database, too, really).

So, as you mentioned (maybe in the phab review), the format of the output of this tool is still unknown, but the input is currently a (currently the classic json, I assume - but if the tool uses the compilation database access APIs, other sources implemented in that API could be used) compilation database - cool cool.

Thanks again!

Dave

Eric_Liu · November 8, 2018, 10:13am

+clangd-dev@lists.llvm.org

Whisperity · November 9, 2018, 9:38am

I’m definitely interested. Currently, I’m working on admittedly small tools that help discover module-tight coupling… sort of like a bad code metric, but also a highlight of code quality and organisation… issues. Although of course currently, I’m going off on the include level and am trying to assign (or refine an assign) of implementation to modules. (I think it’s not a big revelation that directory-based organisations are rarely like Java packages…)

“Minimize sources by stripping away unused tokens.”
You mentioned unused “>>tokens<<”. One my question is: What is to be understood behind this? What constitutes an unused >token<.

;; Whisperity

Alex L via cfe-dev <cfe-dev@lists.llvm.org> ezt írta (időpont: 2018. okt. 17., Sze, 3:53):

Sam_McCall · November 13, 2018, 12:41pm

Hi Alex,

Sorry for the late reply here.
This seems like a very useful tool, that (at least in the current implementation) comes at the cost of adding complexity to various layers of clang.
Is the 10x speedup vs invoking the preprocessor programmatically? That indeed seems like a lot. What is the performance target (e.g. is that speed up just nice-to-have, is it sufficient, are further wins needed?). Did you measure what running the preprocessor directly buys you? (Your prototype runs Eonly by invoking clang in a subshell through the driver).

The design seems to leave headroom in a couple of dimensions:

performance: avoiding reprocessing files (in any capacity) could in principle be a pretty big win, as the average number of transitive includers scales somewhat with codebase size
complexity: the full power of the lexer/preprocessor system is not needed for the vast majority of include-scanning cases. (This also impacts performance).

How important is it that scanning is precise w.r.t preprocessor state (e.g. #ifdef’d out headers are not considered dependencies), and accurate in edge cases (#include SOME_MACRO)? Do you have any measurements on how often these scenarios occur?

We also would like to integrate this service into Clangd as part of our migration to Clangd to help us determine a good compilation command for a header file from a set of known compilation invocations.
We introduced a heuristic approach to this (simply examining filenames to find a decent match) into Tooling as a CompilationDatabase, it would be great if it’s possible to do something similar with dep scanning so it’s reusable by other tools and layering is preserved.

hyp · December 8, 2018, 1:10am

Sorry for the late replies, just got back to working again on this very recently. I posted a patch for the source minimization https://reviews.llvm.org/D55463 as part of us starting our upstreaming work.

Michael (who will be working on the explicit modules support).

Thanks for sending this out!

Yeah, I’m super interested in how (future standard) C++ modules will interact with build systems, as it’s unlikely to be feasible to use an implicit compilation model (in part because of the code generation/linkage requirements - you could put everything from a C++ Modules definition in comdats, etc as is done for headers today (rather than in separate object files), but not really how it’s meant to work).

All models boil down to something like this - the build system having some explicit knowledge (through library dependencies within a project) and having to do some discovery (to find external dependencies (the standard library (if/once modularized and used as such), other external libraries written using modules) and to reduce internal dependencies (not all code in one library depends on all the libraries that library depends on - so by discovering the specific modular imports used in a given module, that module may be able to be built sooner (when only some of its libraries dependencies have been built, because it only needs that subset)) before executing any compilation steps (& then, ideally, passing around the compilation inputs/outputs rather than relying on the compiler to discover them itself in a cache directory or the like).

You mentioned a few performance metrics
Up to 10x speedup in non-modular dependency scanning - what do you mean by non-modular dependency scanning? (what’s the non-modular part - in contrast to?)

By non-modular dependency scanning I mean getting the dependency list of a regular compilation, so figuring out all of the headers included for a compilation that doesn’t use -fmodules.

4x when run on the first 1000 files in Clang’s compilation database, compared to clang -Eonly - so this is running the whole tool, including generating the trimmed preprocessed files, and then reading those to discover the header module dependencies, compared to running -Eonly, then scanning those files? & the output is currently in what form? .d-like files?

Yes, the 4x number comes from a comparison in the time the tool takes to preprocess all the files with source minimization, building all of the implicit modules, and subsequently creating the list of dependencies for all compilations in a compilation database to the time that it takes to run parallel -Eonly clang invocations with the regular implicit modules path for dependency discovery.

The output is either printed out by the tool or saved to .d files. Ultimately an explicit module builder will consume it in a different way probably.

You mention relying on the compilation database for discovering the files to run over - is this the long term goal/design, or a current stepping stone? I was about to say that seems circular (thinking that the compiler/compilation phase generates the compilation database) but then realized/remembered that it’s the build system that generates that, not the compiler, so you can have/use/run over the compilation database before compilation has begun. Sounds good. So the build system would have to have a phase that runs after generating the compilation database that runs this tool, then adds the module compilations produced by this tool to the list of commands it will execute (& probably also adds them back into the compilation database, too, really).

We rely on the CDB solely for the dependency discovery to simplify testing and integration with existing project builds. It’s definitely not the final goal of what an integration with a build system should look like, but it can be a way for the build system to feed in the compilations to the tool if it desires to do so.

Topic		Replies	Views
[RFC] Modules Build Daemon: Build System Agnostic Support for Explicitly Built Modules Clang Frontend gsoc2023	52	2485	July 13, 2023
Reasoning for clang-scan-deps tool Beginners clang	5	922	June 5, 2023
[analyzer][tooling] Architectural questions about Clang and ClangTooling Static Analyzer	8	188	May 11, 2020
[RFC] C++20 modules dependency discovery Clang Frontend	8	261	August 16, 2019
[RFC][Driver] Link the Driver against clangDependencyScanning, clangAST, clangFrontend, clangSerialization and clangLex Clang Frontend clang , driver	6	415	August 11, 2025

RFC: prototype of clang-scan-deps, faster dependency scanning tool for explicit modules and clangd

Related topics