Include what you use / include cleanup

Oh, and as a driver for whether a particular change is worth a patch, I gave up with measuring build-time: it’s too noisy and not reproducible enough for small - to medium changes. Instead I’m using this per-library approach:

clang++ -E  -Iinclude -I../llvm/include ../llvm/lib/Support/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l
1 Like

It’d be good if anyone’s got the time/resources to invest in figuring out some solution to reduce the chance of backsliding - doing a one time cleanup is of unfortunately limited value if the project starts organically backsliding again immediately after.

I second that opinion. I’ve been starting to think of that aspect, while gathering experience on the whole ‘include dependency’ topic when doing the actual changes. More to come soon (?)

We’ve been thinking about stopping backsliding in the context of Chromium. I’ve put what we’ve done below, but I’m also keen on hearing if others have ideas in this area.

One thing we did a few years ago is the Clang -Wmax-tokens flag and corresponding pragma. For example, if foo.h is a key header file, we’d put this in the corresponding .cc file:

// in
#include "foo.h"
#pragma clang max_tokens_here 123456

Which will cause Clang to warn (we treat it as an error) if the header grows too much, going over preprocessor token limit. (See -Wmax-tokens)

Chromium has a big advantage here though, in that we tightly control the build environment so all developers and buildbots use the same toolchain and libraries, etc, and so the preprocessed header sizes should be the same for everyone. But even then, we only turn on the warning in certain blessed build configs.

That does prevent some backsliding, but it’s annoying to have to put in and tweak these pragmas in too many places, and it’s also annoying for the developer who happens to add the “last straw” which brings a header size over some limit.

What we want to do instead is to use the Include Analysis tool mentioned above to surface the total include size in our code review tool whenever someone uploads a patch, and flag any large increases. We’ve had great success doing that to catch binary size regressions, so we think that will be a good tool. We don’t have this implemented yet, though.

That kind of thing is also a complete pain for downstreams that make significant changes. I would be quite opposed to that unless there’s some global opt-out we can just define somewhere for our fork to completely kill it downstream.

I assume you mean the “max tokens” thing. I agree, I don’t think that would work for LLVM.

I think having the code review system warn on patches that increase build size significantly would be useful though, but probably hard with LLVM’s infra today.

Yes, max tokens.

Some process update: In order to limit the number of ‘build breakage’, I’m going to systematically go through phabricator for these refactoring patches, so that I benefit from the pre-commit validation, esp. the Windows one. as some reviews may be a bit boring, I’ll probably merge some of them if I don’t get reviews, but I’ll try to avoid that. Is anyone interested in being automatically added as reviewer?

Yes please. My Phab username is lenary too.

Sure, happy to help

Another update: I gathered some data on the ‘expanded line count’ implied by the last ~2k commits when building llvm, then did some number crunching to identify interesting commits

diff > 10000 +74689 +10065 +85526 +160148 +127034 +25032 +51257 +63493 +27279 +51336 +16659 +16054 +49424 +16659 +12308 +96141 +13193

diff < -10000 -73084 -11340474 -74689 -10350 -31555 -14666 -11630 -48533 -16659 -12308 -16190 -53859 -28975 -412876

This is really cool, thanks for working on this. I’ll +1 the thing Hans said earlier about displaying some sort of metric for expanded line or token count in phabricator. Easier said than done, but we can’t expect developers to optimize a metric if they can’t see it.

@rnk I can write that script, I already have the basic block ready. Do you know who I should contact / is there a documentation page on adding pre-commit hook?

In a longer-term, I’d like to write a small clang-tidy pass that warns on “obviously unused” includes, but a report of the impact on preprocessed size is already nice

I think Christian Kühnel (sorry, I don’t know his Discourse handle) and @goncharov have the most experience with Phabricator pre-commit hooks.

It’s all here: llvm-premerge-checks/scripts at main · google/llvm-premerge-checks · GitHub

I suppose there are two ways of checking things before merging to main:

  1. adding it to arc lint as we have it for clang-format. This only works for tools that run relatively fast on the modified files/lines.
  2. Adding it to pre-merge testing, there we have more time as it’s running in the background. I created an issue to track this effort. Please comment on the issue of what needs to be added to pre-merge testing to run these checks and what the output of the script would look like.

However all of these checks only cover changes going through Phabricator. Anything that is directly pushed to the main branch does not get covered.

1 Like

Does anyone know what is the status of clangd’s IncludeCleaner? Is that able to usefully analyse the llvm/clang source code and how does it compare to IWYU?

CC’ing @kirillbobyrev who might be able to comment.

@RKSimon Compared to IWYU:

  • it currently only suggests removals, not additions.
  • it produces individual warnings (conservatively) rather than canonicalizing a whole file.
  • it runs incrementally in the IDE rather than as a batch job. (Should be possible to put a batch job around it)
  • it’s less mature in general, and still off by default in clangd. specifically:
    • it lacks support for “private” IWYU pragmas, so umbrellas like <gtest.h> yield false positives. (Blocker to turn on by default)
    • only limited support for stdlib headers so far. (So we never suggest removal of stdlib headers)

I use it regularly when working on clangd, and recently cleaned up a lot of unused includes across the project. (Worked well, but the workflow of opening every file and “apply fix” on every diagnostic is slow).

I think its role is:

  • small-scale opportunistic cleanups (a file or two)
  • reducing natural backsliding during editing
  • making users aware that a larger cleanup is needed

(@kirillbobyrev is indeed driving this effort, but is less available right now for geopolitical reasons :-()


Hey folks, some updates:

First, big thanks to @MaskRay @RKSimon and @lenary for reviewing most of the commits, that’s greatly appreciated.

The following graph shows the number of lines output by the preprocessor when building


on the x axis, number of preprocessor output line, on the y axis, commits starting with 07486395d2d05c9c567994456774cafdcc1611d0.

As big drops and big increase are the most interesting ones, here is a dump of the major changes, sorted in ascending order:

Nothing to surprising there: header cleanup removes lines and new features add lines.

It’s also interesting to see the scale of the changes: I’ve been investing tons of hours in the process, leading to more than 6M lines of preprocessed code being removed but libLLVM requires ~238M lines of preprocessed code to build, so that’s largely negligible in terms of impact on the whole code base.

I still think this is useful, even just considering the (assumed) lower coupling that results from the cleanup.

So what’s next? I still need to finish the cleanup for libLLVM, and I’m probably not going further down that slope.

I also would love to improve the IncludeCleaner step from clangd to output less false positive on LLVM codebase (not to self: track those bugs somewhere…), so that we could use it in the CI. I’ve submitted a first step to be able to use that engine as a stand-alone tool here: ⚙ D121593 [clangd][WIP] Provide clang-include-cleaner.

Alternatively, I wrote a python script that runs IWYU twice, once before and once after a patch, and makes a diff of the output to detect added dependencies. Would someone be interested in taking that script and plug it in the CI?

That’s probably too long for a summary, but hey, that’s how it is.

PS: I’ve been diligently breaking the CI with the cleanup commits. I do test all projects on my setup, but not in a multi-platform way, and not under EXPENSIVE_CHECKS neither under NDEBUG. My future self is working on more diligently checking pre-commit CI.


I think if we want a clangd’s include-cleaner as a standalone tool or clang-tidy check (and I agree using clangd interactively for cleanups is painful), we need to extract it out of clangd.
(We’d also really like to use this for cleaning up other codebases).

Proposed it here: [RFC] Lifting "include cleaner" (missing/unused include detection) out of clangd - #2 by sam-mccall


Some closing notes on my side: I’ve finished (with ⚙ D122576 Cleanup includes: final pass) cleaning up the headers required to build libLLVM, and I’m not planning to go any further on the codebase.

I’ve also put a few scripts that proved to be useful during that journey online GitHub - serge-sans-paille/preprocessor-utils . I’m still going to run iwyu-diff every now and then to track regressions, leading to small PRs like ⚙ D123036 [iwyu] Fix some header include regression.

I don’t know if someone feel confident enough to write the glue to add that as a phabricator pre-commit hook? If so I’d be happy to help.