Include what you use / include cleanup

serge-sans-paille · January 19, 2022, 6:59pm

Hi Folks,
Inspired by the work on faster Linux kernel build [0], I’ve started to cleanup the LLVM headers, helped by iwyu [1] (but not in an automatic way, the tool has some limitations wrt. LLVM codebase).

Why such a cleanup?

Speedup builds by having smaller compilation units. In current state, on my laptop, I’ve been able to rebuild libLLVMDebugInfoDWARF.so in 7m01s using 3 cores, compared to a base run of 7m11s. Not a 10% speedup, but still a decent step at this stage
Please incremental builds: less dependencies imply less rebuilds when one touches a header
More informative includes: reading the header gives a decent approximation of the requirements of the CU

What kind of cleanup?

Remove unused forward declaration of types
Add forward declaration of types instead of header inclusion
Remove headers that don’t bring any used identifier

Can I see the patch?

I’ve not submitted it to phabricator yet, but I’m unsure it will be a pleasant read… I’m still balancing the best way to have it reviewed (maybe it’s not even meant to be reviewed…)

I’ve done roughly 33% of the header cleanup, before I go one, I’ll happily take feedback about the approach.

[0] [PATCH 0000/2297] [ANNOUNCE, RFC] "Fast Kernel Headers" Tree -v1: Eliminate the Linux kernel's "Dependency Hell"
[1] https://include-what-you-use.org/

nikic · January 19, 2022, 9:09pm

Reducing header dependencies is always welcome Touching headers in IR is always a pain because you have to wait for 1.5k files to rebuild.

I’ve not submitted it to phabricator yet, but I’m unsure it will be a pleasant read… I’m still balancing the best way to have it reviewed (maybe it’s not even meant to be reviewed…)

Depends on the kind of cleanup you’re doing. Just removing unnecessary includes or converting them to forward declarations doesn’t need review. Of course, sometimes more substantive changes are needed to decouple things.

Something to watch out for is that removing an include in a core header will usually require adding it in some other headers that were previously relying on transitive includes – which may be in code one doesn’t usually build (say all the way over in MLIR or LLDB).

You might find LLVM #include Analysis interesting, which continuously tracks impact of headers on compilation unit size.

pogo59 · January 19, 2022, 9:37pm

Thanks for taking this on! I’ve seen other people do a few small changes along these lines to help out, but it has generally been just a few big offenders.

LLVM has always favored incremental development, so I would say you do not want to post one mega-patch. Or, if you take the “obvious cleanup doesn’t need review” route, you do not want to commit one mega-patch either.

I’m sure you are doing the work step by step, so as @nikic suggested it might be best to commit simpler changes as you go. If there is some more substantive change then of course a review may be appropriate, especially if you had choices to make or questions about what’s preferable in a particular patch.

snehasish · January 19, 2022, 10:55pm

This is a good idea with incremental cleanup across the codebase.

For the proposed changes, while removing headers and unused forward declarations are great, adding forward declarations to remove headers may cause trouble in the future (or sooner for out of tree LLVM API users). The google style guide does a good job of summarizing some of the potential issues here [1]. The LLVM style guide doesn’t seem to have a particular preference for forward declarations apart from requiring forward declarations for raw_ostream [2] in public headers.

[1] Google C++ Style Guide
[2] https://llvm.org/docs/CodingStandards.html#use-raw-ostream

mehdi_amini · January 19, 2022, 11:18pm

I’m surprised we don’t have a guideline for adding forward declarations to eliminate header includes, I thought we were heavily favoring doing this (opposite to the Google style basically)?

mehdi_amini · January 19, 2022, 11:24pm

@serge-sans-paille : I don’t know if you’ll process the entire repo, but I’m interested in applying the recipe to MLIR ; so if you can share the script / process it can be useful!

mehdi_amini · January 19, 2022, 11:42pm

Actually we do have a guideline, it just does not have the words “forward declaration” but it is pretty explicit : LLVM Coding Standards: include-as-little-as-possible

snehasish · January 20, 2022, 7:03pm

Thanks for clarifying Mehdi. I’ll admit my comment was based on a search for the term “forward” in the LLVM Coding Standards doc

dblaikie · January 24, 2022, 10:55pm

It’d be good if anyone’s got the time/resources to invest in figuring out some solution to reduce the chance of backsliding - doing a one time cleanup is of unfortunately limited value if the project starts organically backsliding again immediately after.

Maybe there’s some clang-tidy based IWYU that could be integrated into Phabricator precommit checks? Or something run on a buildbot? (I doubt there’s anything quite fast enough to run as part of the regular build - but if there is, that’d be cool too/instead) Not sure - but hopefully something like that.

RKSimon · January 25, 2022, 1:42am

Its not fantastic, but the LLVM #include Analysis page (already mentioned above) keeps a running tally on the total build size for clang - you can already see a fall in the past week which I think we can attribute to Serge’s great work. I’d love to see something that can help us log changes in build (+check) time as well.

https://commondatastorage.googleapis.com/chromium-browser-clang/llvm_includes-index.html

mehdi_amini · January 25, 2022, 1:43am

Ping @serge-sans-paille ?

clattner · January 25, 2022, 4:12am

Yes, this is my understanding of our approach. We prefer forward declarations wherever reasonable, there are some things that cannot be forward declared (e.g. in STL headers). The rationale is that we can refactor things to improve build time and dependencies, and we don’t have a source compatibility guarantee for out of tree users.

-Chris

serge-sans-paille · January 25, 2022, 6:38am

A small update on the methodology I’m using:

For each LLVM library, I’m first running iwyu to get rid of explicit false dependencies, and potential forward declarations). Beware though that an iwyu run is bound to that particular run in terms of macros (and thus of system, thanks to windows/apple/linux specific defines), including debug macros. So once the log is generated, I do the change manually.

These changes most of the time exhibit previous false dependencies, so I’m recompiling the whole LLVM ecosystem in order not to break the build bots. Again, that check is bound to my system, but that greatly limits the number of patches I need to send once the first one lands.

Once this first cleanup round is done, based on the preprocessed line tracker quoted by @nikic and @RKSimon I manually scan each header of the given library for uses of <chrono>, <algorithm>, <vector>, <iterator>… that could be avoided. This can mean moving that include and its usage to the cpp file, changing the original algorithm etc, the answer varies. To help me in doing so, I currently use that small script, which is just a beautifier around the little-known -H flag of the preprocessor.

#!/bin/sh
tmp=`mktemp`
clang++ -H -E -Iinclude -I../llvm/include $1 -std=c++14 -fno-rtti -fno-exceptions 1>/dev/null 2> $tmp
grep -E '^\.{1,2} ' $tmp | sed -r -e 's,/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c\+\+/11/(.*),<\1>,' -e 's,/usr/include/(.*),<\1>,' -e 's,../llvm/include/(.*),"\1",' -e 's,^\. ,+--,' -e 's,^\.\. , \\____,'
rm $tmp

I’ve not developed that yet, but a tool that map each first-level include to the set of identifiers from that include that’s actually used in the file would help

serge-sans-paille · January 25, 2022, 6:40am

Oh, and as a driver for whether a particular change is worth a patch, I gave up with measuring build-time: it’s too noisy and not reproducible enough for small - to medium changes. Instead I’m using this per-library approach:

clang++ -E  -Iinclude -I../llvm/include ../llvm/lib/Support/*.cpp -std=c++14 -fno-rtti -fno-exceptions | wc -l

serge-sans-paille · January 25, 2022, 7:03am

It’d be good if anyone’s got the time/resources to invest in figuring out some solution to reduce the chance of backsliding - doing a one time cleanup is of unfortunately limited value if the project starts organically backsliding again immediately after.

I second that opinion. I’ve been starting to think of that aspect, while gathering experience on the whole ‘include dependency’ topic when doing the actual changes. More to come soon (?)

hansw2000 · January 25, 2022, 1:28pm

We’ve been thinking about stopping backsliding in the context of Chromium. I’ve put what we’ve done below, but I’m also keen on hearing if others have ideas in this area.

One thing we did a few years ago is the Clang -Wmax-tokens flag and corresponding pragma. For example, if foo.h is a key header file, we’d put this in the corresponding .cc file:

// in foo.cc:
#include "foo.h"
#pragma clang max_tokens_here 123456

Which will cause Clang to warn (we treat it as an error) if the header grows too much, going over preprocessor token limit. (See Chromium Docs - -Wmax-tokens)

Chromium has a big advantage here though, in that we tightly control the build environment so all developers and buildbots use the same toolchain and libraries, etc, and so the preprocessed header sizes should be the same for everyone. But even then, we only turn on the warning in certain blessed build configs.

That does prevent some backsliding, but it’s annoying to have to put in and tweak these pragmas in too many places, and it’s also annoying for the developer who happens to add the “last straw” which brings a header size over some limit.

What we want to do instead is to use the Include Analysis tool mentioned above to surface the total include size in our code review tool whenever someone uploads a patch, and flag any large increases. We’ve had great success doing that to catch binary size regressions, so we think that will be a good tool. We don’t have this implemented yet, though.

jrtc27 · January 25, 2022, 3:13pm

That kind of thing is also a complete pain for downstreams that make significant changes. I would be quite opposed to that unless there’s some global opt-out we can just define somewhere for our fork to completely kill it downstream.

hansw2000 · January 25, 2022, 3:33pm

I assume you mean the “max tokens” thing. I agree, I don’t think that would work for LLVM.

I think having the code review system warn on patches that increase build size significantly would be useful though, but probably hard with LLVM’s infra today.

jrtc27 · January 25, 2022, 3:43pm

Yes, max tokens.

serge-sans-paille · January 27, 2022, 9:07am

Some process update: In order to limit the number of ‘build breakage’, I’m going to systematically go through phabricator for these refactoring patches, so that I benefit from the pre-commit validation, esp. the Windows one. as some reviews may be a bit boring, I’ll probably merge some of them if I don’t get reviews, but I’ll try to avoid that. Is anyone interested in being automatically added as reviewer?

Topic		Replies	Views
Who wants faster LLVM/Clang builds? LLVM Dev List Archives	37	398	December 15, 2017
Widescale clang-tidy (or similar) based cleanup LLVM Dev List Archives	9	132	April 10, 2017
[RFC] Slight tweak to header inclusion policy to promote IWYU LLVM Project llvm	8	496	October 2, 2024
[RFC] Add include-what-you-use tool to clang-tools-extra Clang Frontend	17	399	September 14, 2017
Announcing include-what-you-use 0.1 Clang Frontend	12	181	February 8, 2011

Include what you use / include cleanup

Related topics