I’ve been using X86_64 architecture with xeon processor, and trying to generate assembly code of matrix multiplication of integer data type of order 3000*3000 using clang and gcc compiler r enabling avx (vector length-512). But unfortunately clang is not able to generate ymm or zmm registers throughout the assembly , where as gcc generated assembly is generating more than 3+ zmm and ymm registers which is the reason for its performance betterment I assume. I’m using clang version v16.0.0 and gcc version 12.2.0.
clang -O3 -mavx512f changes.c
gcc -O3 -mavx512f changes.c
Please correct me if i’m wrong.
Please can you incluude a compiler explorer link with an example?
Actually I’m working on a local intel cluster, here it is not generating zmm registers.Whereas with the compiler explorer (godbolt), and for same architecture , compiler and flags , clang is generating multiple zmm registers (more than 15).
What maybe the reasons for such results ??