gnu X sysv hash performance

I got curious how the lld produced gnu hash tables compared to gold. To
test that I timed "perf record ninja check-llvm" (just the lit run) in a
BUILD_SHARED_LIBS build.

The performance was almost identical, so I decided to try sysv versus
gnu (both produced by lld). The results are interesting:

% grep -v '^#' perf-gnu/perf.report-by-dso-sym | head
    38.77% ld-2.24.so [.] do_lookup_x
     8.08% ld-2.24.so [.] strcmp
     2.66% ld-2.24.so [.] _dl_relocate_object
     2.58% ld-2.24.so [.] _dl_lookup_symbol_x
     1.85% ld-2.24.so [.] _dl_name_match_p
     1.46% [kernel.kallsyms] [k] copy_page
     1.38% ld-2.24.so [.] _dl_map_object
     1.30% [kernel.kallsyms] [k] unmap_page_range
     1.28% [kernel.kallsyms] [k]
     filemap_map_pages
     1.26% libLLVMSupport.so.6.0.0svn [.] sstep
% grep -v '^#' perf-sysv/perf.report-by-dso-sym | head
    42.18% ld-2.24.so [.] do_lookup_x
    17.73% ld-2.24.so [.] check_match
    14.41% ld-2.24.so [.] strcmp
     1.22% ld-2.24.so [.] _dl_relocate_object
     1.13% ld-2.24.so [.] _dl_lookup_symbol_x
     0.91% ld-2.24.so [.] _dl_name_match_p
     0.67% ld-2.24.so [.] _dl_map_object
     0.65% [kernel.kallsyms] [k] unmap_page_range
     0.63% [kernel.kallsyms] [k] copy_page
     0.59% libLLVMSupport.so.6.0.0svn [.] sstep

So the gnu hash table helps a lot, but BUILD_SHARED_LIBS is still crazy
inefficient.

Cheers,
Rafael

What is "100%" in these numbers? If 100% means all execution time,
ld-2.24.so takes more than 70% of execution time. Is this real?

perf usually measures cycles ("CPU_CLK_UNHALTED" for core/xeon, e.g.). So
it's not time but cycles. This is a critical distinction when the thing
being measured has delays/synchronization/disk/network I/O.

Also it looks like this report might be decomposed by some other attribute
(DSO-at-a-time?) that would affect what "100%" means.

Doing perf on "ninja check-llvm" seems like it would measure cycles
contributed by lots of non-lld things, in fact it's worth ruling out
whether it's dominated by non-lld things. Doesn't testing itself perhaps
spend more cycles than the linking being done here?

He is measuring the performance of the dynamic linker/loader to see if
lld-generated dynamic symbol tables and their corresponding .hash or
.gnu.hash tables are efficient. So that is a correct way of testing it.

Oh, I see "just the lit run" -- I misunderstood. So we are trying to
measure the difference between lld-generated outputs. And so therefore
we're only interested in differences in what the loader does for each? So
they're all ld-2.24.so and the other non-ld.so samples are either
interrupts/system management activity that couldn't be excluded, or perhaps
they're from system calls on behalf of ld.so?

Rui Ueyama <ruiu@google.com> writes:

I got curious how the lld produced gnu hash tables compared to gold. To
test that I timed "perf record ninja check-llvm" (just the lit run) in a
BUILD_SHARED_LIBS build.

The performance was almost identical, so I decided to try sysv versus
gnu (both produced by lld). The results are interesting:

% grep -v '^#' perf-gnu/perf.report-by-dso-sym | head
    38.77% ld-2.24.so [.] do_lookup_x
     8.08% ld-2.24.so [.] strcmp
     2.66% ld-2.24.so [.]
_dl_relocate_object
     2.58% ld-2.24.so [.]
_dl_lookup_symbol_x
     1.85% ld-2.24.so [.] _dl_name_match_p
     1.46% [kernel.kallsyms] [k] copy_page
     1.38% ld-2.24.so [.] _dl_map_object
     1.30% [kernel.kallsyms] [k] unmap_page_range
     1.28% [kernel.kallsyms] [k]
     filemap_map_pages
     1.26% libLLVMSupport.so.6.0.0svn [.] sstep
% grep -v '^#' perf-sysv/perf.report-by-dso-sym | head
    42.18% ld-2.24.so [.] do_lookup_x
    17.73% ld-2.24.so [.] check_match
    14.41% ld-2.24.so [.] strcmp
     1.22% ld-2.24.so [.]
_dl_relocate_object
     1.13% ld-2.24.so [.]
_dl_lookup_symbol_x
     0.91% ld-2.24.so [.] _dl_name_match_p
     0.67% ld-2.24.so [.] _dl_map_object
     0.65% [kernel.kallsyms] [k] unmap_page_range
     0.63% [kernel.kallsyms] [k] copy_page
     0.59% libLLVMSupport.so.6.0.0svn [.] sstep

So the gnu hash table helps a lot, but BUILD_SHARED_LIBS is still crazy
inefficient.

What is "100%" in these numbers? If 100% means all execution time,
ld-2.24.so takes more than 70% of execution time. Is this real?

I think so, BUILD_SHARED_LIBS is very slow.

On another machine this time (amazon c5.9x) I just checked the time that
lit reports in "ninja check-llvm":

regular build: Testing Time: 23.69s
BUILD_SHARED_LIBS: Testing Time: 57.60s

It is a lot of libraries where almost all the symbols have default
visibility.

Cheers,
Rafael

Brian Cain <brian.cain@gmail.com> writes:

On Fri, Dec 1, 2017 at 3:55 PM, Rui Ueyama via llvm-dev <
perf usually measures cycles ("CPU_CLK_UNHALTED" for core/xeon, e.g.). So
it's not time but cycles. This is a critical distinction when the thing
being measured has delays/synchronization/disk/network I/O.

I forgot to mention this time, but I always use a tmpfs. In this case
the system tools (python for example) would be on ssd, but I discarded
a run before measuring, so I don't expect any interference from that.

Also it looks like this report might be decomposed by some other attribute
(DSO-at-a-time?) that would affect what "100%" means.

I sorted by dso,sym.

Doing perf on "ninja check-llvm" seems like it would measure cycles
contributed by lots of non-lld things, in fact it's worth ruling out
whether it's dominated by non-lld things. Doesn't testing itself perhaps
spend more cycles than the linking being done here?

No, the idea was to measure the quality of the lld produced hash tables
that are used by the dynamic linker.

Cheers,
Rafael

Rui Ueyama <ruiu@google.com> writes:

>
>>
>> I got curious how the lld produced gnu hash tables compared to gold. To
>> test that I timed "perf record ninja check-llvm" (just the lit run) in a
>> BUILD_SHARED_LIBS build.
>>
>> The performance was almost identical, so I decided to try sysv versus
>> gnu (both produced by lld). The results are interesting:
>>
>> % grep -v '^#' perf-gnu/perf.report-by-dso-sym | head
>> 38.77% ld-2.24.so [.] do_lookup_x
>> 8.08% ld-2.24.so [.] strcmp
>> 2.66% ld-2.24.so [.]
>> _dl_relocate_object
>> 2.58% ld-2.24.so [.]
>> _dl_lookup_symbol_x
>> 1.85% ld-2.24.so [.]
_dl_name_match_p
>> 1.46% [kernel.kallsyms] [k] copy_page
>> 1.38% ld-2.24.so [.] _dl_map_object
>> 1.30% [kernel.kallsyms] [k] unmap_page_range
>> 1.28% [kernel.kallsyms] [k]
>> filemap_map_pages
>> 1.26% libLLVMSupport.so.6.0.0svn [.] sstep
>> % grep -v '^#' perf-sysv/perf.report-by-dso-sym | head
>> 42.18% ld-2.24.so [.] do_lookup_x
>> 17.73% ld-2.24.so [.] check_match
>> 14.41% ld-2.24.so [.] strcmp
>> 1.22% ld-2.24.so [.]
>> _dl_relocate_object
>> 1.13% ld-2.24.so [.]
>> _dl_lookup_symbol_x
>> 0.91% ld-2.24.so [.]
_dl_name_match_p
>> 0.67% ld-2.24.so [.] _dl_map_object
>> 0.65% [kernel.kallsyms] [k] unmap_page_range
>> 0.63% [kernel.kallsyms] [k] copy_page
>> 0.59% libLLVMSupport.so.6.0.0svn [.] sstep
>>
>> So the gnu hash table helps a lot, but BUILD_SHARED_LIBS is still crazy
>> inefficient.
>
>
> What is "100%" in these numbers? If 100% means all execution time,
> ld-2.24.so takes more than 70% of execution time. Is this real?

I think so, BUILD_SHARED_LIBS is very slow.

On another machine this time (amazon c5.9x) I just checked the time that
lit reports in "ninja check-llvm":

regular build: Testing Time: 23.69s
BUILD_SHARED_LIBS: Testing Time: 57.60s

Aah, I knew Unix DSO's are not efficient in resolving symbol names, but
it's too slow. I really don't like the Unix semantics of the dynamic
linking object. Windows is much better.

I also dislike the fact that ELF/Unix/C are trying to make DSOs usable
transparently. On Windows, you have to explicitly mark imported/exported
functions as dllimported/dllexported, and that is IMO much better than
trying to hide it.

It is a lot of libraries where almost all the symbols have default