clang performance when building Linux

Hi,

I did a performance profile of clang while building Linux kernel.
The difference is not much compared to gcc, clang with asserts off is a
bit faster than gcc, and with asserts on a bit slower.

Here is the 'perf report' for clang with asserts off:
https://gist.github.com/918497

Does anything there ring a bell to someone? Anything to be improved?

Most of that stuff looks like preprocessor to me, I thought that clang
is supposed to be faster than gcc there?

Also interesting is that getFileIDSlow shows up, which probably shouldn't.

Best regards,
--Edwin

Hi,

I did a performance profile of clang while building Linux kernel.
The difference is not much compared to gcc, clang with asserts off is a
bit faster than gcc, and with asserts on a bit slower.

Here is the 'perf report' for clang with asserts off:
https://gist.github.com/918497

    3.65% clang clang [.] llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)

I'm a bit surprised that StringMap is the most expensive entry here, maybe microoptimizing
the hash function (which is a byte-wise djb hash at the moment) can help a bit. If someone is
really bored it would also be useful to test if other string hash functions like murmurhash or google's
new city hash give better performance.

    2.87% clang clang [.] clang::Lexer::LexTokenInternal(clang::Token&)

This is expected, the lexer usually dominates any C compiler's run time (well, except for crazy stuff like C++ ;),
clang's lexer is heavily optimized, I doubt we can improve much here.

    2.74% clang [kernel.kallsyms] [k] clear_page_c

This is probably zero filling pages for malloc.

    2.07% genksyms genksyms [.] yylex
    1.88% clang clang [.] clang::SourceManager::getFileIDSlow(unsigned int) const

Despite that the name contains the word slow, this function is optimized well and known to be hot.

    1.66% clang libc-2.13.so [.] _int_malloc

Reducing mallocs is an ongoing process, clang already pool allocates a lot of its data,
LLVM's types, constants and other stuff aren't allocated in pools at the moment but there are plans to fix that.

Hi,

I did a performance profile of clang while building Linux kernel.
The difference is not much compared to gcc, clang with asserts off is a
bit faster than gcc, and with asserts on a bit slower.

Here is the 'perf report' for clang with asserts off:
https://gist.github.com/918497

    3.65% clang clang [.] llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)

I'm a bit surprised that StringMap is the most expensive entry here, maybe microoptimizing
the hash function (which is a byte-wise djb hash at the moment) can help a bit. If someone is
really bored it would also be useful to test if other string hash functions like murmurhash or google's
new city hash give better performance.

If I interpret this right then there are many collisions, and most time
is wasted comparing the key against collided items, so yes changing hash
function might help:
0.38 : 15fb0d9: 48 89 de mov %rbx,%rsi
16.46: 15fb0dc: f3 a6 repz cmpsb %es:(%rdi),%ds:(%rsi)
15.37: 15fb0de: 75 b8 jne 15fb098
<llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)+0x88>
0.85 : 15fb0e0: 48 83 c4 20 add $0x20,%rsp

Too bad I don't have callgraphs here, would be interesting to see what
calls StringMap this often.

Interesting. I’m familiar with murmurhash and watched the development of city hash and am quite familiar with it. I’ll take a look at what it would take to use cityhash here. Anything special done to produce these numbers? Just a build of the kernel?

If you could paste how you collected the perf data that would be useful as well… i’ve not used the ‘perf’ tool extensively before.

Here is what I used:
$ make allmodconfig
$ perf record make CC=clang -j6
(this creates a file perf.data, let it run for at least 2 or 5 minutes,
then interrupt it, or wait for it to finish)
$ perf report
(ncurses-like interface to browse perf.data)

If you get error messages from clang, you should probably use the kernel
from here:
https://github.com/lll-project/kernel

I used the LLVM and clang from here, but performance profiling could be
done equally well with llvm.org trunk versions:
https://github.com/lll-project/llvm
https://github.com/lll-project/clang

Best regards,
--Edwin

Cool, thanks!

I was never able to get the lookup to take as much of my CPU time as you did, but the benchmarks were very noisy. When I used my own stress test benchmarks (massive C++ file and the single-source GCC file) I would see roughly 1.5% of the CPU cycles in this function.

I got CityHash into the codebase and taught StringMap to use it. This saved roughly 50% of the time in the function, taking it under the 1% line. I haven’t looked in detail to see what is taking the time now.

On another benchmark where this function was a bit hotter (2.4% roughly, similar numbers to those I got by profiling the kernel build) I saw as much as 1% over-all speed up. Nothing stellar, but not terrible either.

If folks are interested, I’ll look at getting City Hash checked in, and investigate using it in a few other places as well where collisions and/or hashing cost us some.

3.65% clang clang [.]
llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)

I’m a bit surprised that StringMap is the most expensive entry here,
maybe microoptimizing
the hash function (which is a byte-wise djb hash at the moment) can
help a bit. If someone is
really bored it would also be useful to test if other string hash
functions like murmurhash or google’s
new city hash give better performance.

Interesting. I’m familiar with murmurhash and watched the development of
city hash and am quite familiar with it. I’ll take a look at what it
would take to use cityhash here. Anything special done to produce these
numbers? Just a build of the kernel?

If you could paste how you collected the perf data that would be useful
as well… i’ve not used the ‘perf’ tool extensively before.

Here is what I used:
$ make allmodconfig
$ perf record make CC=clang -j6
(this creates a file perf.data, let it run for at least 2 or 5 minutes,
then interrupt it, or wait for it to finish)
$ perf report
(ncurses-like interface to browse perf.data)

Cool, thanks!

I was never able to get the lookup to take as much of my CPU time as you did, but the benchmarks were very noisy. When I used my own stress test benchmarks (massive C++ file and the single-source GCC file) I would see roughly 1.5% of the CPU cycles in this function.

I got CityHash into the codebase and taught StringMap to use it. This saved roughly 50% of the time in the function, taking it under the 1% line. I haven’t looked in detail to see what is taking the time now.

On another benchmark where this function was a bit hotter (2.4% roughly, similar numbers to those I got by profiling the kernel build) I saw as much as 1% over-all speed up. Nothing stellar, but not terrible either.

If folks are interested, I’ll look at getting City Hash checked in, and investigate using it in a few other places as well where collisions and/or hashing cost us some.

Definitely interested!

So, even using a “torture test” for this part of Clang (-Eonly on gcc.c, a 0.75 MLOC file) I can only make it about 0.5% to 1.5% faster overall. We stop getting collisions, and the CPU time spent in LookupBucketFor drops from 7% to 4%, along with memcmp time drops, but the time just goes elsewhere, at least for the test cases I have and the CPU I’m measuring on.

If anyone has a good test case to reproduce the performance impact and measure significant benefit from this, let me know and I’ll send you my patch.

    Sent from my iPhone

        >
        > > 3.65% clang clang
         [.]
        > llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)
        >
        > I'm a bit surprised that StringMap is the most expensive
        entry here,
        > maybe microoptimizing
        > the hash function (which is a byte-wise djb hash at the
        moment) can
        > help a bit. If someone is
        > really bored it would also be useful to test if other
        string hash
        > functions like murmurhash or google's
        > new city hash give better performance.
        >
        >
        > Interesting. I'm familiar with murmurhash and watched the
        development of
        > city hash and am quite familiar with it. I'll take a look at
        what it
        > would take to use cityhash here. Anything special done to
        produce these
        > numbers? Just a build of the kernel?
        >
        > If you could paste how you collected the perf data that
        would be useful
        > as well... i've not used the 'perf' tool extensively before.

        Here is what I used:
        $ make allmodconfig
        $ perf record make CC=clang -j6
        (this creates a file perf.data, let it run for at least 2 or 5
        minutes,
        then interrupt it, or wait for it to finish)
        $ perf report
        (ncurses-like interface to browse perf.data)

    Cool, thanks!

    I was never able to get the lookup to take as much of my CPU time
    as you did, but the benchmarks were very noisy. When I used my own
    stress test benchmarks (massive C++ file and the single-source GCC
    file) I would see roughly 1.5% of the CPU cycles in this function.

    I got CityHash into the codebase and taught StringMap to use it.
    This saved roughly 50% of the time in the function, taking it
    under the 1% line. I haven't looked in detail to see what is
    taking the time now.

    On another benchmark where this function was a bit hotter (2.4%
    roughly, similar numbers to those I got by profiling the kernel
    build) I saw as much as 1% over-all speed up. Nothing stellar, but
    not terrible either.

    If folks are interested, I'll look at getting City Hash checked
    in, and investigate using it in a few other places as well where
    collisions and/or hashing cost us some.

    Definitely interested!

So, even using a "torture test" for this part of Clang (-Eonly on gcc.c,
a 0.75 MLOC file) I can only make it about 0.5% to 1.5% faster overall.
We stop getting collisions, and the CPU time spent in LookupBucketFor
drops from 7% to 4%, along with memcmp time drops, but the time just
goes elsewhere, at least for the test cases I have and the CPU I'm
measuring on.

So no overall improvement of build time?
Would using PTH help here? If so could clang cache the parsed headers,
and only recreate the PTH when they changed?

If anyone has a good test case to reproduce the performance impact and
measure significant benefit from this, let me know and I'll send you my
patch.

I'd be happy to try your patch when I have time.
Please send it.

Best regards,
--Edwin

19.04.2011, 05:34, "Chandler Carruth" <chandlerc@google.com>;:

šI got CityHash into the codebase and taught StringMap to use it. This saved roughly 50% of the time in the function, taking it under the 1% line. I haven't looked in detail to see what is taking the time now.

šOn another benchmark where this function was a bit hotter (2.4% roughly, similar numbers to those I got by profiling the kernel build) I saw as much as 1% over-all speed up. Nothing stellar, but not terrible either.

šIf folks are interested, I'll look at getting City Hash checked in, and investigate using it in a few other places as well where collisions and/or hashing cost us some.

CityHash is optimized for x86_64. Won't it lead to performance degradation on 32 bit machines?

Also, there is a warning inside CityHash sources:

"WARNING: This code has not been tested on big-endian platforms!"