Help needed for triaging code generation

Hey,

I don’t really know how to proceed:

I have created the Issue: GCC 15 - LLVM Clang 20 Compiler Performance comparison · Issue #127069 · llvm/llvm-project · GitHub

Unfortunately, no one has supported me with the problems so far.
Now I have tried the following for a sub issue, the Quicksilver OpenMP HPC benchmark:

  1. compiled Quicksilver with gcc and executed with perf
  2. compiled Quicksilver with clang and executed with perf
  3. tried to compare the top 3 functions with the highest runtime:

clang-trunk:

32% MCT_Nearest_Facet				   
22% NuclearData::getReactionCrossSection
15% macroscopicCrossSection

gcc-14:

28% NuclearData::getReactionCrossSection
23% macroscopicCrossSection
18% (anonymous namespace)::MC_Nearest_Facet MCT_Nearest_Facet_3D_G

I know I should have used clang 19 as base but wanted to look if clang trunk maybe has already fixed the issues, but its still slower then gcc 14.

Iam irritated as the perf command on my test PC (Core i3-6100) shows the following ASM sequence:

vpgatherqd
vpxor
vgatherqpd

which shouldn’t be generated because the CPU only has AVX2 while some of these instructions are AVX512?!

  1. try to display these functions in Godbolt with different compilers and versions:

MCT_Nearest_Facet

NuclearData::getReactionCrossSection

MacroscopicCrossSection

What iam doing wrong that I can’t see clear differences in the generated code?
How can i compare the ASM with LTO enabled in godbolt?
Any thoughts how i can efficient find the root cause?

Performance is not my strong suit but…

Are you sure about this? I have basically no Intel experience so take this for what it’s worth, but AVX2 Instructions - x86 Assembly Language Reference Manual lists the ones you have shown as AVX2 (I know that’s not from Intel but idk what the official reference is).

It could be that they are also available in AVX512. If your CPU is running them, they must be allowed by something.

I think comparing performance against clang-19 will be more informative. As the distribution of time spent for the gcc compiled version may be so different that it’s hard to tackle all at once.

Though the top 3 functions are the same and take the same proportion of time overall but if the overall runtime is different, maybe that breaks the correspondence there. Maybe not, but comparing to clang-19 seems ideal given that’s where we regressed from.

A clang 19 to 20 perf comparison is your next step I think.

If you add -flto and enable “Link to binary”, this does…something. I think you will need to add a main function that at least calls the function you’re interested in.

In other words, if the top 3 functions are the same this tells you that the majority of the work no matter what compiler, happens there. The regression could still be elsewhere in the benchmark.

That top 3 would be the first candidates if you wanted to improve performance in general though.

I’m not sure if you posted the wrong links because while you’re asking comparison between GCC and Clang, the Godbolt links show comparison between two different versions of Clang. Also, they’re using -march=znver5 while you’re targeting Core-i3.

You can start with the most obvious thing like comparing the assembly code – I don’t know how to do that on Godbolt but you can always use the good’o vimdiff or something. Though in this particular case, the fact that MCT_Nearest_Facet only appears in Clang suggested that GCC might inline it into NuclearData::getReactionCrossSection, so the assembly diff might be noisy.

Another low-hanging fruit might be using source lines to correlate assembly code generated by different compilers: for instance, line X maps to 10 instructions in one binary but maps to 100 instructions in the other. I believe Godbolt has some feature like that, or you can just use objdump -S (assuming you build your binary with debug info).

The general rule of thumb of performance engineering is that, for better or for worse, there is no fixed procedure. You can build a checklist but at the end of the day, you probably have to act adaptively.

You mean that MCT_Nearest_Facet suggests a standalone function, but GCC’s (anonymous namespace)::MC_Nearest_Facet MCT_Nearest_Facet_3D_G suggests one function inlined into another?

I’m not sure how to interpret the name of the GCC function.

1 Like

Thx for the Help so far.
Sadly i cant reproduce the phoronix results ( Zen 5) for Quicksilver on my PC (Skylake S).
The Results for Clang 19, trunk and gcc are very close here.
So i think they should only occur on newer architectures.

But i can at least reproduce a remarkable performance difference for coremark:
gcc-14 scores 69K Points to 60K Points for Clang19 and 63K Points for Clang trunk.

perf gcc14:

Samples: 293K of event ‘cpu-clock’, Event count (approx.): 73430750000
Overhead Command Shared Object Symbol
42,00% coremark.exe gc coremark.exe gcc14 [.] calc_func ◆
38,16% coremark.exe gc coremark.exe gcc14 [.] core_bench_list ▒
19,74% coremark.exe gc coremark.exe gcc14 [.] core_state_transition ▒
0,09% coremark.exe gc coremark.exe gcc14 [.] main ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] update_blocked_averages ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] handle_softirqs ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] rebalance_domains ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] ___slab_alloc ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] __memcg_slab_free_hook ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] __mod_timer ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] _raw_spin_unlock_irq ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] down_read_trylock ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] finish_task_switch.isra.0 ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] load_balance ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] next_uptodate_folio ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] perf_event_mmap_output ▒
0,00% coremark.exe gc [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 ▒
0,00% coremark.exe gc ld-linux-x86-64.so.2 [.] _dl_relocate_object ▒
0,00% coremark.exe gc libc.so.6 [.] shmat

perf clang trunk:

Samples: 333K of event ‘cpu-clock’, Event count (approx.): 83330500000
Overhead Command Shared Object Symbol
31,19% coremark.exe cl coremark.exe clang21 [.] core_state_transition
30,97% coremark.exe cl coremark.exe clang21 [.] core_bench_list
23,28% coremark.exe cl coremark.exe clang21 [.] core_bench_matrix
8,29% coremark.exe cl coremark.exe clang21 [.] crcu16
6,25% coremark.exe cl coremark.exe clang21 [.] core_bench_state
0,00% coremark.exe cl coremark.exe clang21 [.] iterate
0,00% coremark.exe cl [kernel.kallsyms] [k] irqentry_exit_to_user_mode
0,00% coremark.exe cl [kernel.kallsyms] [k] process_csb
0,00% coremark.exe cl [kernel.kallsyms] [k] update_blocked_averages
0,00% coremark.exe cl [kernel.kallsyms] [k] __get_user_8
0,00% coremark.exe cl [kernel.kallsyms] [k] _raw_spin_unlock_irq
0,00% coremark.exe cl [kernel.kallsyms] [k] __pte_offset_map
0,00% coremark.exe cl [kernel.kallsyms] [k] do_user_addr_fault
0,00% coremark.exe cl [kernel.kallsyms] [k] handle_softirqs
0,00% coremark.exe cl [kernel.kallsyms] [k] load_balance
0,00% coremark.exe cl [kernel.kallsyms] [k] mas_leaf_max_gap
0,00% coremark.exe cl [kernel.kallsyms] [k] neigh_timer_handler
0,00% coremark.exe cl [kernel.kallsyms] [k] queue_work_on
0,00% coremark.exe cl [kernel.kallsyms] [k] run_rebalance_domains
0,00% coremark.exe cl [kernel.kallsyms] [k] tcp_orphan_update
0,00% coremark.exe cl [kernel.kallsyms] [k] up_write
0,00% coremark.exe cl [kernel.kallsyms] [k] update_sg_lb_stats
0,00% coremark.exe cl ld-linux-x86-64.so.2 [.] check_match

I would like to Help if i know how to :slight_smile:

I tried to get the Code with -flto running, but when use “link to binary” i still get errors (gcc pane):

The Analyse is still very hard for me to find the underlying Problem.
Code without -flto is the same for clang in trunk and 19.

By Analyzing the core_bench_list function i found the following chain: crc16 → 2* crcu8

When i compare the crcu8 function i get the following differences in the binary:

clang:
mov 22
xor 15
shr 15
tes 7
cmo 8
and 1
cmp 1
ret 1

sum 70 instructions

gcc:
and 16
mov 12
shr 9
or 8
sal 6
add 4
rol 3
xor 2
ret 1

sum 61 instructions

In difference to clang, gcc doesnt call functions in crcu32 and crc16.
So the resulting binary is much larger for gcc.