Help needed for triaging code generation

Grenzschutzeinheit · June 25, 2025, 3:18pm

Hey,

I don’t really know how to proceed:

I have created the Issue: GCC 15 - LLVM Clang 20 Compiler Performance comparison · Issue #127069 · llvm/llvm-project · GitHub

Unfortunately, no one has supported me with the problems so far.
Now I have tried the following for a sub issue, the Quicksilver OpenMP HPC benchmark:

compiled Quicksilver with gcc and executed with perf
compiled Quicksilver with clang and executed with perf
tried to compare the top 3 functions with the highest runtime:

clang-trunk:

32% MCT_Nearest_Facet				   
22% NuclearData::getReactionCrossSection
15% macroscopicCrossSection

gcc-14:

28% NuclearData::getReactionCrossSection
23% macroscopicCrossSection
18% (anonymous namespace)::MC_Nearest_Facet MCT_Nearest_Facet_3D_G

I know I should have used clang 19 as base but wanted to look if clang trunk maybe has already fixed the issues, but its still slower then gcc 14.

Iam irritated as the perf command on my test PC (Core i3-6100) shows the following ASM sequence:

vpgatherqd
vpxor
vgatherqpd

which shouldn’t be generated because the CPU only has AVX2 while some of these instructions are AVX512?!

try to display these functions in Godbolt with different compilers and versions:

MCT_Nearest_Facet

NuclearData::getReactionCrossSection

MacroscopicCrossSection

What iam doing wrong that I can’t see clear differences in the generated code?
How can i compare the ASM with LTO enabled in godbolt?
Any thoughts how i can efficient find the root cause?

DavidSpickett · June 25, 2025, 3:57pm

Performance is not my strong suit but…

Are you sure about this? I have basically no Intel experience so take this for what it’s worth, but AVX2 Instructions - x86 Assembly Language Reference Manual lists the ones you have shown as AVX2 (I know that’s not from Intel but idk what the official reference is).

It could be that they are also available in AVX512. If your CPU is running them, they must be allowed by something.

DavidSpickett · June 25, 2025, 4:08pm

I think comparing performance against clang-19 will be more informative. As the distribution of time spent for the gcc compiled version may be so different that it’s hard to tackle all at once.

Though the top 3 functions are the same and take the same proportion of time overall but if the overall runtime is different, maybe that breaks the correspondence there. Maybe not, but comparing to clang-19 seems ideal given that’s where we regressed from.

A clang 19 to 20 perf comparison is your next step I think.

If you add -flto and enable “Link to binary”, this does…something. I think you will need to add a main function that at least calls the function you’re interested in.

DavidSpickett · June 25, 2025, 4:13pm

In other words, if the top 3 functions are the same this tells you that the majority of the work no matter what compiler, happens there. The regression could still be elsewhere in the benchmark.

That top 3 would be the first candidates if you wanted to improve performance in general though.

mshockwave · June 25, 2025, 4:28pm

I’m not sure if you posted the wrong links because while you’re asking comparison between GCC and Clang, the Godbolt links show comparison between two different versions of Clang. Also, they’re using -march=znver5 while you’re targeting Core-i3.

You can start with the most obvious thing like comparing the assembly code – I don’t know how to do that on Godbolt but you can always use the good’o vimdiff or something. Though in this particular case, the fact that MCT_Nearest_Facet only appears in Clang suggested that GCC might inline it into NuclearData::getReactionCrossSection, so the assembly diff might be noisy.

Another low-hanging fruit might be using source lines to correlate assembly code generated by different compilers: for instance, line X maps to 10 instructions in one binary but maps to 100 instructions in the other. I believe Godbolt has some feature like that, or you can just use objdump -S (assuming you build your binary with debug info).

The general rule of thumb of performance engineering is that, for better or for worse, there is no fixed procedure. You can build a checklist but at the end of the day, you probably have to act adaptively.

DavidSpickett · June 25, 2025, 4:33pm

You mean that MCT_Nearest_Facet suggests a standalone function, but GCC’s (anonymous namespace)::MC_Nearest_Facet MCT_Nearest_Facet_3D_G suggests one function inlined into another?

I’m not sure how to interpret the name of the GCC function.

Grenzschutzeinheit · July 4, 2025, 3:18pm

Thx for the Help so far.
Sadly i cant reproduce the phoronix results ( Zen 5) for Quicksilver on my PC (Skylake S).
The Results for Clang 19, trunk and gcc are very close here.
So i think they should only occur on newer architectures.

But i can at least reproduce a remarkable performance difference for coremark:
gcc-14 scores 69K Points to 60K Points for Clang19 and 63K Points for Clang trunk.

perf gcc14:

Samples: 293K of event ‘cpu-clock’, Overhead Command Shared Object 42,00% coremark.exe gc coremark.exe gcc14 38,16% coremark.exe gc coremark.exe gcc14 19,74% coremark.exe gc coremark.exe gcc14 0,09% coremark.exe gc coremark.exe gcc14 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc [kernel.kallsyms] 0,00% coremark.exe gc ld-linux-x86-64.so.2 0,00% coremark.exe gc libc.so.6 Event count (approx.): 73430750000
Symbol
[.] calc_func ◆
[.] core_bench_list ▒
[.] core_state_transition ▒
[.] main ▒
[k] update_blocked_averages ▒
[k] handle_softirqs ▒
[k] rebalance_domains ▒
[k] ___slab_alloc ▒
[k] __memcg_slab_free_hook ▒
[k] __mod_timer ▒
[k] _raw_spin_unlock_irq ▒
[k] _raw_spin_unlock_irqrestore ▒
[k] down_read_trylock ▒
[k] finish_task_switch.isra.0 ▒
[k] load_balance ▒
[k] next_uptodate_folio ▒
[k] perf_event_mmap_output ▒
[k] update_sd_lb_stats.constprop.0 ▒
[.] _dl_relocate_object ▒
[.] shmat

perf clang trunk:

Samples: 333K of event ‘cpu-clock’, Event count (approx.): 83330500000
Overhead Command Shared Object Symbol
31,19% coremark.exe cl coremark.exe clang21 [.] core_state_transition
30,97% coremark.exe cl coremark.exe clang21 [.] core_bench_list
23,28% coremark.exe cl coremark.exe clang21 [.] core_bench_matrix
8,29% coremark.exe cl coremark.exe clang21 [.] crcu16
6,25% coremark.exe cl coremark.exe clang21 [.] core_bench_state
0,00% coremark.exe cl coremark.exe clang21 [.] iterate
0,00% coremark.exe cl [kernel.kallsyms] [k] irqentry_exit_to_user_mode
0,00% coremark.exe cl [kernel.kallsyms] [k] process_csb
0,00% coremark.exe cl [kernel.kallsyms] [k] update_blocked_averages
0,00% coremark.exe cl [kernel.kallsyms] [k] __get_user_8
0,00% coremark.exe cl [kernel.kallsyms] [k] _raw_spin_unlock_irq
0,00% coremark.exe cl [kernel.kallsyms] [k] __pte_offset_map
0,00% coremark.exe cl [kernel.kallsyms] [k] do_user_addr_fault
0,00% coremark.exe cl [kernel.kallsyms] [k] handle_softirqs
0,00% coremark.exe cl [kernel.kallsyms] [k] load_balance
0,00% coremark.exe cl [kernel.kallsyms] [k] mas_leaf_max_gap
0,00% coremark.exe cl [kernel.kallsyms] [k] neigh_timer_handler
0,00% coremark.exe cl [kernel.kallsyms] [k] queue_work_on
0,00% coremark.exe cl [kernel.kallsyms] [k] run_rebalance_domains
0,00% coremark.exe cl [kernel.kallsyms] [k] tcp_orphan_update
0,00% coremark.exe cl [kernel.kallsyms] [k] up_write
0,00% coremark.exe cl [kernel.kallsyms] [k] update_sg_lb_stats
0,00% coremark.exe cl ld-linux-x86-64.so.2 [.] check_match

I would like to Help if i know how to

Grenzschutzeinheit · July 8, 2025, 9:20am

I tried to get the Code with -flto running, but when use “link to binary” i still get errors (gcc pane):

The Analyse is still very hard for me to find the underlying Problem.
Code without -flto is the same for clang in trunk and 19.

Grenzschutzeinheit · July 8, 2025, 10:49am

By Analyzing the core_bench_list function i found the following chain: crc16 → 2* crcu8

When i compare the crcu8 function i get the following differences in the binary:

clang:
mov 22
xor 15
shr 15
tes 7
cmo 8
and 1
cmp 1
ret 1

sum 70 instructions

gcc:
and 16
mov 12
shr 9
or 8
sal 6
add 4
rol 3
xor 2
ret 1

sum 61 instructions

In difference to clang, gcc doesnt call functions in crcu32 and crc16.
So the resulting binary is much larger for gcc.

Topic		Replies	Views
speed and code size issues LLVM Dev List Archives	17	196	July 19, 2009
Compilation benchmark: bzip2 Clang Frontend	11	185	January 3, 2008
Parsing benchmark: LibTomMath Clang Frontend	15	167	November 19, 2007
One little test, comparing clang and gcc Clang Frontend	10	151	October 30, 2009
clang generates way more code than - Optimizer bug? Clang Frontend	5	146	December 6, 2021

Help needed for triaging code generation

Related topics