Offloading application build with a Trunk Debug Clang is so slow

Hi,
I’ve been using a Trunk Debug Clang that can print out libomptarget debugging
information, but I noticed in the last couples of days the built time increased
significantly to an hour or so. While a non Debug clang builds my app in a few
minutes.
Has anyone committed intrusive changed recently to the library?

Jon, do you know if something happened that could explain this?

Hey,

The patches that landed in the last few days are all moving existing functions behind a single call. The only vaguely interesting one for clang is D71580 in that it uses templates that dispatch to cuda’s min function, but even that seems unlikely to change compile time.

It’s 7k of pretty trivial C++, so taking over an hour to build doesn’t sound right. Have you tried rolling the deviceRTL back by a few days and building with your current debug toolchain? I’m wondering if trunk clang’s debug build has regressed in performance.

Thanks

Jon

The app build time increases can be reproduced with a simple int main() {} code.

I thought the problem was with the compile time of deviceRTL.

If you’re compiling int main() {}, then there’s no openmp involved, so that seems out of scope for this list.

Sorry, of course it should include this #pragma

omp target parallel for for(;:wink: {}. My local Debug Clang is updated daily, so the tool chain is pretty much up to date.

Ah, OK. In that case I still suspect a regression in clang, as opposed to a change in the library. Does the compile time remain very slow if you roll deviceRTL back by a few days?

The Clang at this commit:
commit 1c49553c19a7044fbbf4528b732926f19f210e54
Author: Bjorn Pettersson <bjorn.a.pettersson@ericsson.com>

[kitayama1@juron1-adm pcp0151]$ time clang++ -g -fopenmp -fopenmp-targets=nvptx64 mini.cpp

real 0m2.344s
user 0m2.231s
sys 0m0.049s
Release Clang 9
[[kitayama1@juron1-adm pcp0151]$ time clang++ -g -fopenmp -fopenmp-targets=nvptx64 mini.cpp

real 0m0.693s
user 0m0.533s
sys 0m0.047s

The CMake was executed as below:

cmake -GNinja -DCMAKE_BUILD_TYPE=Debug -DLIBOMPTARGET_ENABLE_DEBUG=True -DCMAKE_INSTALL_PREFIX=$SCRATCH/pcp0151/opt/clang/${today} -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DGCC_INSTALL_PREFIX=/gpfs/software/opt/gcc/7.2.0/ -DLLVM_ENABLE_PROJECTS=“clang;openmp;libcxx;libcxxabi;lld” -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_60 -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=60 -DLLVM_TARGETS_TO_BUILD=“PowerPC;NVPTX” $SCRATCH/pcp0151/projects/llvm-project/llvm

Trunk Clang built with setting both variables to -DCMAKE_BUILD_TYPE=Release -DLIBOMPTARGET_ENABLE_DEBUG=False
builds my application in a minute or so.

On POWER8, my app builds 40 minutes with the Debug Clang (40 times increase of time, when
it is built with Release Clang), but on an x86 host, it just takes about 10 minutes and it is bearable.

Could this be PowerPC specific, is it why most people aren’t aware of the slowness?

That could most certainly be. Alexey should build on Power regularly though.

Try to use gold linker instead of default ld.

Best regards,
Alexey Bataev

It isn’t much exciting.
$ time clang++ -fuse-ld=gold -g -fopenmp -fopenmp-targets=nvptx64 test.cpp
^[[O^[[Iptxas /tmp/test-e95684.s, line 2715; warning : Instruction ‘vote’ without ‘.sync’ is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version

real 0m19.138s
user 0m18.377s
sys 0m0.375s
[kitayama1@juron1-adm pcp0151]$ time clang++ -fuse-ld=ld -g -fopenmp -fopenmp-targets=nvptx64 test.cpp
^[[O^[[I^[[Optxas /tmp/test-498b51.s, line 2715; warning : Instruction ‘vote’ without ‘.sync’ is deprecated since PTX ISA version 6.0 and will be discontinued in a future PTX ISA version

real 0m19.373s
user 0m18.677s
sys 0m0.477s

Using a trunk Clang which was built with -DCMAKE_BUILD_TYPE=Release and omitting
this variable entirely; -DLIBOMPTARGET_ENABLE_DEBUG=True, a build of my app goes back to only taking a minute or two.

Experts, should I start a git bisect on the trunk tree of llvm-project during the holidays?

On x86 I see the same significant slowdown when building an app with Debug Clang.

Does make check-openmp exercise any part of the offloading features?

Check-libomptarget and check-libomptarget-nvptx do.

Best regards,
Alexey Bataev

I’ve collected Clang build performance data with the -ftime-trace like below

clang++ -ftime-trace -g -fopenmp -fopenmp-targets=nvptx64 mini.cpp

. I’m uploading corresponding .json files for your review as to why the latest Clang is slow.

mini-b098da.json (2.22 KB)

mini-30d1d8.json (12.2 KB)

a-2c5875.json (1.89 KB)

mini-0f6a0b.json (4.22 KB)

Hi all,
From discussions I had in the llvm-dev mailing list, I realized thjat I really should not have used a Debug Clang at all for my
app build. Although at this moment, trunk is broken for my application of interest, I can go back to mid October of trunk and
I can build my app with the Clang which is a Release build, but also configured to print informative messages from libomptarget.

Alexey, I was saying that Initializing offloading rum time at runtime was failing with NEST.