I am using Clang on macOS / arm64. I encountered a situation where the same code runs three times slower when compiled for arm64 and run natively than when compiled for x86_64 and run with Rosetta. Furthermore, I have a second version of the same code which should in principle be slightly slower than the original version, however, it in fact 2.4x faster on arm64 only. On x86_64, as well as on 32-bit arm (Raspberry Pi) it is slower, as expected.
No special flags. Definitely no platform specific flags such as -march. Regarding optimization flags, the timing of the affected function is exactly the same with -O1, -O2 or -O3, or with no optimization flag at all. It’s also exactly the same with or without LTO. This is not true for the timing of the alternative function, measured within the same benchmark program. That one is very slow with no optimization flag, gets much faster with -O1, faster again with -O2 and a bit faster still with LTO. So I know that optimization flags are working.
I wanted to hint at that for Apple Clang 14 -fglobal-isel is the default. It changed from Apple Clang 13 → 14. With -fno-global-isel on Apple Clang 14 you can pick the traditional backend for Arm64. X86 is always -fno-global-isel.
Thanks for the hint. Specifying either -fno-global-isel or -fglobal-isel for Apple Clang makes no difference. -fglobal-isel warns that “-fglobal-isel support is incomplete for this architecture at the current optimization level”.
I would not have expected a difference since as I said above, performance is equally bad (same as completely unoptimized) with all of the following Clang versions, and you mentioned that some of these don’t support this feature:
clang version 10.0.1
clang version 11.1.0
clang version 15.0.1
Apple clang version 14.0.0 (clang-1400.0.29.102)
Depending on what exactly the issue is we do not necessarily need a minimal example. It would already help if you could isolate the slowdown (e.g. using Instruments for profiling) to a function/loop and share the source code to start with. Given that the slowdown is huge, hopefully that shouldn’t be too difficult.
If you are able to gather this information, a bug report would be very much appreciated.
All the code is open, part of the igraph library accessible on GitHub. I’ll give a summary below.
The slow function is igraph_degree(), which retrieves the degrees of some vertices of a graph. There is also an igraph_degree_1() function which retrieves the degree of a single vertex only. Using the former to compute the degrees of all vertices is about three times slower than using the latter.
0.445 is the timing of igraph_degree() and 0.188 is the timing of igraph_degree_1() called in a loop to compute the same thing.
This is the relevant part of the source code of igraph_degree_1():
Note that we only care about the loops=true input (so the rest of the function is not run), and mode=IGRAPH_ALL which means that both the mode & IGRAPH_OUT and mode & IGRAPH_IN branches will be run. The benchmark simply runs this function in a loop for vid values from 0 to n-1 where n is the number of graph vertices:
Now let us look at igraph_degree():
Notice that igraph_degree() is basically identical to calling igraph_degree_1() in a loop, except some checks that don’t need to be repeated are moved outside the loop. However, this time the for loop makes use of igraph’s “vertex iterators”, which may or may not be the source of the difference. Under the hood, the vertex iterator still just counts up to n. You can see the definitions of macros like IGRAPH_VIT_NEXT() here: igraph/igraph_iterators.h at master · igraph/igraph · GitHub
Instructions for building and running the benchmark:
Run tests/benchmark_igraph_degree. Only the first two timings are relevant.
If you want to try to run the same using Rosetta, you need to configure igraph so that it has no external dependencies. It’s easiest to run ccmake . in the build directory, toggle IGRAPH_GRAPH_SUPPORT to OFF and toggle all IGRAPH_USE_INTERNAL_... flags to ON. Hit c and g to re-configure the project. Then run cmake .. -DCMAKE_OSX_ARCHITECTURES=x86_64, and re-build the benchmarks.
With Rosetta the timings look like this, on the same machine.
I looked at the program with the Instruments profiler, but I did not learn anything interesting. The performance-critical part of the code does not contain any function calls, so Instruments can’t give me a breakdown.
Thanks for checking, so just to make sure I understand correctly, all time is spent. in igraph_degree and igraph_degree_1?
If that’s the case, would it be possible to share the LLVM IR before optimizations for the file containing those (just add -S -emit-llvm -mllvm -disable-llvm-optzns to the compiler invocation to compile the file) and also the assembly generated (-S option)?
Thanks for checking, looks like this only gives very modest improvements and it is not the main issue.