Clang Preprocessor Speed Up

Hello, folks!

Currently me with one other guy are trying to play with clang. The proposal may seem stupid, excuse me, if it was already discussed, we just want to try to implement something useful which seems absent for now.

Ok, the idea. It seems interesting to try to make lexer a little bit more efficient in terms of macro expanding by applying partial expansion of macros. the idea is that some libraries have rather deeply nested macro definitions, and each time lexer sees it in code, it reexpands definition fully. This seems to be overkill sometimes, for rather often macros are not redefined in code, so expansion can be reused.

Of course, the typical nesting is rather low, but for example BOOST_PP_REPEAT can cause such situations.

So, the question is, what do you think about possible utility of such research and the reasons for you think so?

First, surely the right place for this discussion is the cfe-dev mailing list?

Second, have you determined that this is a noticeable amount of time when compiling? I have no idea - in my Pascal compiler, parsing the code is ~0.1%, codegen to IR ~1.9% and LLVM 98%. But I’m sure Clang is more complex in many ways, so the proportion is probably a bit different - a measurement of the time spent expanding macros would probably help determine if it’s worth doing or not.

Hi Mats,

Thanks for the reply. Yep, you are right, the time should be measured and I guess I can imagine the typical workflow

  • implement prototype
  • take bunch of big real projects
  • compare preprocessing time for initial and changed clang
  • make conclusion whether the idea is sane or not

About usage - probably, some IDEs can act better for they need iteratively relex source for correct autocomplete.

What I’m also curious about is if somebody already did something on this or had thought about it.
If the idea was already thought (which I guess is rather possible), it’s interesting, did somebody already prove it’s useless?

Once measured times for one of the Boost libraries example, preprocessing (-E) was about 20% of total compilation time. This is not typical in general but quite common with Boost libraries as 100s-1000s files may be included with tons of macros and nested macros.

But even then, how much of the total time is expanding macros, and how much is “reading and finding the actual files” (and writing the output)?

I’m not saying this is not worth doing, I’m just trying to avoid someone spending time on something that doesn’t provide benefit - I speak from experience, I’ve “optimized” code, and then found that it didn’t make any improvement at all - I’ve also done work with gives 3-30x speedups by some simple steps… So, measure, make improvement, measure.

Or, use perf on some typical usecase, and figure out where the time goes…

So, I have implemented a prototype. For anyone interested, I attach the patch (I used clang release 38 from github as startup code).

е║
The prototype consumes 40% less time on preprocessing all boost headers than original clang (780 seconds vs 1330). On real-world source code patched clang seems to be usually faster than original.
е║
The big problem with it is that it currently doesn’t generate proper source locations for expanded tokens. It doesn’t affect the final code, but of course it makes diagnostics hard in case of errors. On the other hand, this situation is similar with inline functions debugging: without special flag, debugging will not be that easy.е║
е║
I have also measured timings for patched clang and clean clang, but with removed information about expansion locations (for it could happen, that all the profit came from switching off this info). But it turned out, that patched is still 28% faster (1100 seconds vs 780 seconds).е║
е║
I think, it may be useful probably to have some flag in clang that allows fast preprocessing, for sometimes profit can reach up to x4 times!

I’m pretty sure there are some bugs now I haven’t yet recognized, so any feedback is highly appreciated.

е║
25.03.2016, 12:05, “mats petersson” :

But even then, how much of the total time is expanding macros, and how much is “reading and finding the actual files” (and writing the output)?

I’m not saying this is not worth doing, I’m just trying to avoid someone spending time on something that doesn’t provide benefit - I speak from experience, I’ve “optimized” code, and then found that it didn’t make any improvement at all - I’ve also done work with gives 3-30x speedups by some simple steps… So, measure, make improvement, measure.

Or, use perf on some typical usecase, and figure out where the time goes…


Mats

Once measured times for one of the Boost libraries example, preprocessing (-E) was about 20% of total compilation time. This is not typical in general but quite common with Boost libraries as 100s-1000s files may be included with tons of macros and nested macros.

е║
е║

Hi Mats,
е║
Thanks for the reply. Yep, you are right, the time should be measured and I guess I can imagine the typical workflow

  • implement prototype
  • take bunch of big real projects
  • compare preprocessing time for initial and changed clang
  • make conclusion whether the idea is sane or not

About usage - probably, some IDEs can act better for they need iteratively relex source for correct autocomplete.
е║
What I’m also curious about is if somebody already did something on this or had thought about it.
If the idea was already thought (which I guess is rather possible), it’s interesting, did somebody already prove it’s useless?е║
е║

First, surely the right place for this discussion is the cfe-dev mailing list?

Second, have you determined that this is a noticeable amount of time when compiling? I have no idea - in my Pascal compiler, parsing the code is ~0.1%, codegen to IR ~1.9% and LLVM 98%. But I’m sure Clang is more complex in many ways, so the proportion is probably a bit different - a measurement of the time spent expanding macros would probably help determine if it’s worth doing or not.


Mats

Hello, folks!

Currently me with one other guy are trying to play with clang. The proposal may seem stupid, excuse me, if it was already discussed, we just want to try to implement something useful which seems absent for now.

Ok, the idea. It seems interesting to try to make lexer a little bit more efficient in terms of macro expanding by applying partial expansion of macros. the idea is that some libraries have rather deeply nested macro definitions, and each time lexer sees it in code, it reexpands definition fully. This seems to be overkill sometimes, for rather often macros are not redefined in code, so expansion can be reused.

Of course, the typical nesting is rather low, but for example BOOST_PP_REPEAT can cause such situations.

So, the question is, what do you think about possible utility of such research and the reasons for you think so?


LLVM Developers mailing list

е║
е║

–е║
Regards,
Andrei Serebro
tel. +79111758381
е║


LLVM Developers mailing list
llvm-dev@lists.llvm.org

е║
е║
–е║
Regards,
Andrei Serebro
tel. +79111758381
е║

macros_patch.diff (47.4 KB)

Adding Reid and Nico who have been struggling with lexer / preprocessor compile times recently.

Hi,

I came across your post rummaging through the mailing list and I am most interested in it, not least because with minimal effort and an albeit fairly pathological macro expansion, I can make clang crash with the message "Ran out of source locations!" (assert inside SourceLocations.cpp). So anything that will help there will be great.

However, I am unable to get your patch to work at all. It seems to totally garble the macro expansion, even on simple expansions. And the test-suite produces over 200 unexpected fails. Other problems I see are that it gives an error on the following define (claiming an unterminated function-like macro invocation):

#define M1(...) M2(__VA_ARGS__

and I see lots of errors relating to supposed redefinitions of macros, when they aren't.

Is it possible that you attached an incomplete patch? Do you have a working test-case that I can start with? I would be delighted to test your work further.

Cheers,
Andy

Hi,

Yep, I have noticed several problems with the patch after I have sent it, it’s my bad, really: first was with variadic macos, second there was an erroneous ## processing, and also now I’m experiencing some problems with compiling Boost 1.61.
Currently I put a gap for the first problem, so that variadics are just not cached and so expanded in a standard way.
The second I guess was fixed, now I’m struggling with the third one.

However, I am unable to get your patch to work at all. It seems to totally
garble the macro expansion, even on simple expansions. And the test-suite
produces over 200 unexpected fails.

That’s interesting. Simple samples should definitely work fine (in case there were no variadics). You can share pieces that cause bad behavior, I will try to understand whats wrong with them (probably, it’s about that errors with ##).
Unexpeccted failures I guess are fine since llvm uses Lit to check the test results, and this checking consists of pattern matching the output with the one that is expected. Since I have removed SourceLocation information, the output definitely varies from the expected output that uses them. But I will also look at it in case something not connected with locations also breaks the tests.

and I see lots of errors relating to supposed redefinitions of macros, when
they aren’t.

Can you please attach some code that can reproduce redefinition problems? I hoped I have found all the cases where it happens, however, I could miss something.
Thanks a lot for the feedback.

Hi,

After your patch, you can run all clang regresion tests with “ninja check-clang”.
Passing these would be a good start.

Yaron

Ok, that would be a miracle.

As I understand, unfortunately getting rid of proper source locations causes big problems, yet it’s currently hard for me to find some small samples on which everything crashes. I mean, it would be great to reduce the crash code dramatically, for now the examples of misbehavior are huge.

I think a good understanding of interactions between SourceManager, Lexer and Parser is needed now because it’s somewhere in between their communication, so to say, which turns to be broken after the patch.
I see a possibility in trying to save proper SourceLocations, yet I’m not sure it will be such an easy feature.

I wonder if I can find some person who deeply understands the concept of SourceLocations so that it’s possible to discuss the underlying problems?
Guys, I need your piece of advice :expressionless:

My suggestion, but others should jump in here too, is that you split the patch
up into two separate aims.

The first aim would be the reduction of SourceLocation information.

The second aim would be caching of previous macro expansions (from what I
understand of your patch, I think this is what you are attempting?).

I think this would be a benefit since it may be easier to get the patches
working if you work on the two parts independently. Plus you indicated that
the first still has a sizeable benefit. And I think the second will give you
problems (detailed below).

I'm afraid I am not able to help you much on the interactions of the
SourceManager, Lexer and Parser. There are others on this mailing list who
would be able to help you much better in this regard.

Regarding reduction of SourceLocation information, these are my thoughts.
Please note that I haven't assessed the practicality of this, and I'm writing
as I think so it may not even be coherent! Caveats aside, you talked
previously of having this as a compiler switch. I think this is a good idea
since the SourceLocation information is needed for debugging macro expansions.
I would suggest implementing this switch as a depth counter so that
SourceLocation information is only once a macro expansion reaches a certain
depth the SourceLocation information will be elided. In a sense, the
important information of a macro expansion is the SourceLocations of the
original tokens going into the macro and the SourceLocation of the final
scratch buffer. At this point, I'm wondering whether it is useful to have
SourceLocation information for concatenated tokens... Maybe someone else can
give there thoughts here.

The other thing you can think about here is that of actually releasing a
scratch buffer no longer in use. I think clang doesn't do this currently
(probably because of keeping all the SourceLocation information), but if you
looked at having a reference count on a scratch buffer then maybe it will be
possible to release it once its last use is complete. This has the benefit of
also freeing memory... and may also improve the speed of the compiler. Again,
others should give their thoughts here. Personally, I would be very
interested to see if this was a possible line of improvement since with
extensive macro expansions, the memory consumption of clang is huge.

In fact, thinking about this, this may be the first place to start -- it is
sort of linked to eliding SourceLocation information, but it has the advantage
that you are freeing the whole scratch buffer since it has been incorporated
into another higher-level scratch buffer, and so is no longer referenced
anywhere and its disappearance shouldn't (hopefully!) cause so many problems
in clang diagnostics, for example, which even so would need to be adapted to
skip these elided macro expansion steps.

Finally, regarding the caching of previous macro expansions, you need to be
very careful because macro definitions can change, yes even inside a macro
expansion too (think pragma pop_macro). For that matter, you need to think
about how to handle any _Pragma expansion since these can cause side effects.
This is why I would leave this side of your patch until last. It will be the
hardest to test and guarantee correctness.

Keep up the good work! :o)

Cheers,
Andy

Well, you know, this splitting sounds like a good idea. I guess I’ll try to follow your advice, for it’s really somehow strange to solve two tasks in a time.

I weighed the advantages and disadvantages and I guess that saving source locations would just vanish any sense from the activity of boosting macro expansion, because we will need to do all the same works as we did without caching at all: to create proper locations we will need to go down from the very top macro downside for every token. So, the idea from my previous post was not that sane - saving proper source locations means just give up the idea at all, it will barely reduce time consumption at all.

You are right about debugging purpose of SL, and I think the idea with limitation on depths, for which info will be present, is solid. Looks like more flexible then complete turning off.

About scratch space - to be honest, I haven’t thought about it yet. I didn’t even look in this side, so that’s good you have pointed it. First of all, I need to understand what is this scratch space all about. =)

Such things like pragmas inside macro definitions which spoil caching can be overcome by just forbidding caching in certain situations, yet of course all such situations should be defined carefully.

Interesting moment about Andy’s comment. I found out that GCC already has (if I understood correctly) teh feature with expansion tracking depth.

The option is -ftrack-macro-expansion[=level] and it’s description can be found here:
https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html

As I understand, clang doesn’t bother with this yet.

So, taking your advice I got rid of caching for now and concentrated on turning off source locations.

I have introduced new flag -no-macro-exploc-tracking. Passing it to compiler does exactly what is intended to do - we don’t generate expansion locations for macro tokens. To evaluate correct LINE expansion, we just memorize the top most macro that is currently being expanded (it is ok since all macro expansions turn to produce one-liners… but there are some cases with multiline arguments for function-like macros, I didn’t really get right now should we do something with it).

I have tested it tough on boost 1.61. Among first 11000 files at least the success of compilation is the same for patched and clean release 38.

Average boosting of preprocessing is 1.2 times. Also cases in which clang ran out of source locations will be less frequent I hope. At least, here is a fine example where compiler doesn’t fall with the flag and fall without:

clang -c -ftemplate-depth-32 -O3 -finline-functions -Wno-inline -Wall -pedantic -Wno-variadic-macros -pedantic-errors -std=c++11 -I $BOOST_1_61_HOME -no-macro-exploc-tracking $BOOST_1_61_HOME/libs/vmd/test/test_doc_modifiers_return_type.cpp

If I don’t put the flag, the process hangs (at least, on my machine).

I attach the patch, hope for some feedback.

P.S. Also here are preprocessing benchmarks for boost.

patch_slocs.diff (20.7 KB)

Ouch, forgot to attach stats. Here they are.

boost_times_slocs.ods (1.7 MB)

Hi, the speedups look great. Do this patch passes a clean `ninja check-clang`?

Yep, of course it passes, since without using the new flag it’s pretty much the same release-38.
I don’t know, probably, either some new tests should be generated and clang with new flag should be run on them or we can just use old ones but somehow make clang run with new flag…

I didn’t have a lot of time to dig into the patch, but an off by default flag to speed things up with a reduced the diagnostic experience doesn’t seem that valuable.

I thought there was a macro expansion note backtrace limit. Is there a way that we can leverage that to generate unique expansion source locations when the expansion level is low but cut them off when the expansion level is deep? I’m not a preprocessor expert, so can you elaborate on exactly what you’re caching and reusing here?

Yep, sure

You are right, currently there is just a possibility to turn off source locations tracking. It would be nice I guess to make a depth limit as we discussed earlier, and I think it is possible (at least, currently there is already a depth counter but it’s used for a bit different purpose of setting the root macro when expansion is being performed). But this is the simplest part could be done at all.

I will try to add depth parameter soon so we will see where it leads us. I understand, that rather easy is to do the following: for each token, that has depth more than D with respect to top-level expanding macro we simply don’t evaluate expansion location. I am not sure if you mean another problem, like “for each token keep information about D closest parents”, but I guess this problem solution won’t improve anything, even though it looks like useful. I will try to understand whether second problem can be solved efficiently, but for now I assume we are talking about the first one.

Example for the first problem statement:

#define M1(X) X + M2(X)
#define M2(X) X
M1(1) // expansion of M1; M1 is a top-level expanding Macro

Suppose we have passed a depth limit of D=1. So, then the result will contain tokens ‘1’, ‘+’ with correct expansion locations, and the second ‘1’, which comes from expansion of M2, will have no expansion location.

About caching: currently this is postponed for a separate patch. The idea was rather simple, but as Andy Gibbs said, it’s uncooked and it’s better to split the tasks into two separate parts. So, now no caching is performed, at all.