Build is terribly slow on Arm

My OpenMP offloading app builds within a minute or so on
x86, but the same app takes longer than 5 minutes on Arm.
How do I investigate this further, supplying -v and post
the log here?

They were both Debug build, but it took 20-25 minutes to finish on Arm.

[+David]

You are talking about compile time, right?

And the regression is caused by OpenMP (hence not present without -fopenmp), correct?

Yes, that's right.

I'm building my app on HPE Apollo with CUDA Toolkit 11, just released
the other day.

You should not use Debug build to compare the performance of compilers.
As you're targeting different architectures, you might run into
different checks in the two backends. I believe it's possible this is
triggered by OpenMP code generation which might hit some corner cases.
This could also happen in Release builds, but might be less probable.

Also keep in mind that AArch64 (which you're probably talking about in
the context of HPC) now uses GlobalISel in some configurations. I don't
remember if it's with or without optimizations (or maybe even both
nowadays?), but you can try to deactivate this on the command line to
rule out this area.

Regards
Jonas

Why do you think I build in Debug? I try to

avoid that all time.

You wrote "They were both Debug build". Maybe I misunderstand?

Sorry I take it back then. An X86 build finishes just a few minutes, while on AArch64 takes a half hour, thus I’m wondering.

AArch64 LLVM build is much faster when no offloading stuff is explicitly set at CMake configuration time.

Also as I’m using local SSD, I suspect it’s due to something with offloading options at this moment.

Itaru could mean compiling his application with debug symbols (-O0 -g)
using a release build of clang (it's always good to mention which
version of clang you are using) or an optimized version of his
application using an CMAKE_BUILD_TYPE=Debug build of clang. Maybe he
could clarify?

If you are running linux, a performance trace (flame graph / Linux
perf record) would be useful, especially if you are using a debug
build of clang.

Michael

Michael, David,

Sorry I wasn't clear; I am building LLVM on Arm with CUDA Toolkit 11 RC.

How do you compile LLVM (your cmake ... command line)?

What flags do you use to compile your application?

Michael


cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_LLD=ON
-DCMAKE_INSTALL_PREFIX=$HOME/opt/clang/${now} -DCMAKE_C_COMPILER=clang
-DCMAKE_CXX_COMPILER=clang++ -DLLVM_TARGETS_TO_BUILD=all
-DLLVM_ENABLE_PROJECTS="openmp;clang;lld;libcxx;libcxxabi;libunwind"
-DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_70
-DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=70 /tmp/llvm-project/llvm

Ok, you are using a non-assert/release build of clang, therefore
Jonas' remark does not apply.

Since your original question was how to investigate further, we have
to find out where the time is spent.

1. Run clang -ftime-trace:
https://aras-p.info/blog/2019/01/16/time-trace-timeline-flame-chart-profiler-for-Clang/

2. Create a performance trace, e.g. GitHub - brendangregg/FlameGraph: Stack trace visualizer
Note that for this, a non-optimized build with debug symbols of clang
would be better, but Linux perf may still get useful information
without.

Michael

Michael, all,

An LLVM build process together with clang, openmp, and other projects
on Arm becomes normal, ie
about 15 minutes or so (in my last couples of attempts). So, I will
move on to app build performance evaluation.

Wait, what was your original test about? Aren't you interested in
getting the problem fixed?

Michael

I’ve been trying to build LLVM on Arm with CUDA Toolkit within a reasonable amount of time that’s all.