Code compiled for arm64 much slower than for x86_64

szhorvat · October 3, 2022, 6:57pm

I am using Clang on macOS / arm64. I encountered a situation where the same code runs three times slower when compiled for arm64 and run natively than when compiled for x86_64 and run with Rosetta. Furthermore, I have a second version of the same code which should in principle be slightly slower than the original version, however, it in fact 2.4x faster on arm64 only. On x86_64, as well as on 32-bit arm (Raspberry Pi) it is slower, as expected.

Should this be reported in the issue tracker? Unfortunately I can’t afford the time to attempt to produce a minimal example that does not depend on any library. The problem occurs in the igraph library and is discussed at igraph_degree_1() by szhorvat · Pull Request #2223 · igraph/igraph · GitHub

The problem occurred with all of Clang 11, Clang 15 and Apple Clang 14. I was unfortunately unable to compare with GCC due to a bug in Apple’s linker.

tobiashieta · October 3, 2022, 7:13pm

Not all of Apples optimizations for M1/2 are upstreamed yet. You could try to compare the LLVM IR output from apple clang to clang 15 and see if there is any big difference.

For the linker bug. You could try to use lld - it should be ready to use on macOS now and GCC can invoke it with the -fuse-ld flag.

szhorvat · October 3, 2022, 7:47pm

Sorry, I wasn’t clear. The result is exactly the same using Clang and Apple Clang.

Thank you for the tip. I wasn’t able to get this working. If I manage to compare with GCC, I’ll report back.

tobiashieta · October 3, 2022, 7:48pm

What compile-flags are you using? Adding any -march tuning?

szhorvat · October 3, 2022, 8:01pm

No special flags. Definitely no platform specific flags such as -march. Regarding optimization flags, the timing of the affected function is exactly the same with -O1, -O2 or -O3, or with no optimization flag at all. It’s also exactly the same with or without LTO. This is not true for the timing of the alternative function, measured within the same benchmark program. That one is very slow with no optimization flag, gets much faster with -O1, faster again with -O2 and a bit faster still with LTO. So I know that optimization flags are working.

tschuett · October 3, 2022, 8:12pm

Apple switched the Code Generator for Arm64 with clang 14? My Apple clang seems to support -fno-global-isel.

szhorvat · October 4, 2022, 8:28am

@tschuett I see the same timings with all of Clang 11, Clang 15 and Apple Clang 14. This is not a Clang vs Apple Clang or a Clang 14 vs earlier issue.

tschuett · October 4, 2022, 9:27am

I wanted to hint at that for Apple Clang 14 -fglobal-isel is the default. It changed from Apple Clang 13 → 14. With -fno-global-isel on Apple Clang 14 you can pick the traditional backend for Arm64. X86 is always -fno-global-isel.

szhorvat · October 4, 2022, 9:45am

Thanks for the hint. Specifying either -fno-global-isel or -fglobal-isel for Apple Clang makes no difference. -fglobal-isel warns that “-fglobal-isel support is incomplete for this architecture at the current optimization level”.

I would not have expected a difference since as I said above, performance is equally bad (same as completely unoptimized) with all of the following Clang versions, and you mentioned that some of these don’t support this feature:

clang version 10.0.1 
Target: arm-apple-darwin21.6.0

clang version 11.1.0
Target: arm64-apple-darwin21.6.0

clang version 15.0.1
Target: arm64-apple-darwin21.6.0

Apple clang version 14.0.0 (clang-1400.0.29.102)
Target: arm64-apple-darwin21.6.0

fhahn · October 4, 2022, 10:06am

Thanks for the report!

Depending on what exactly the issue is we do not necessarily need a minimal example. It would already help if you could isolate the slowdown (e.g. using Instruments for profiling) to a function/loop and share the source code to start with. Given that the slowdown is huge, hopefully that shouldn’t be too difficult.

If you are able to gather this information, a bug report would be very much appreciated.

szhorvat · October 4, 2022, 10:35am

All the code is open, part of the igraph library accessible on GitHub. I’ll give a summary below.

The slow function is igraph_degree(), which retrieves the degrees of some vertices of a graph. There is also an igraph_degree_1() function which retrieves the degree of a single vertex only. Using the former to compute the degrees of all vertices is about three times slower than using the latter.

This is the benchmark comparing them: igraph/igraph_degree.c at master · igraph/igraph · GitHub

Only the first two BENCH calls are relevant, which compare igraph_degree() with igraph_degree_1() (through the deg1() function in the same file) for computing all degrees. This is an example output:

|  1 igraph_degree(), preferential attachment n=100000, m=10, 1000x                0.445s  0.445s      0s
|  2 deg1(), preferential attachment n=100000, m=10, 1000x                         0.188s  0.188s      0s

0.445 is the timing of igraph_degree() and 0.188 is the timing of igraph_degree_1() called in a loop to compute the same thing.

This is the relevant part of the source code of igraph_degree_1():

github.com

igraph/igraph/blob/master/src/graph/type_indexededgelist.c#L1120-L1133


      
          igraph_error_t igraph_degree_1(const igraph_t *graph, igraph_integer_t *deg,
                                         igraph_integer_t vid, igraph_neimode_t mode, igraph_bool_t loops) {
          
          
    if (!igraph_is_directed(graph)) {
                  mode = IGRAPH_ALL;
              }
          
          
    *deg = 0;
              if (mode & IGRAPH_OUT) {
                  *deg += (VECTOR(graph->os)[vid + 1] - VECTOR(graph->os)[vid]);
              }
              if (mode & IGRAPH_IN) {
                  *deg += (VECTOR(graph->is)[vid + 1] - VECTOR(graph->is)[vid]);
              }

Note that we only care about the loops=true input (so the rest of the function is not run), and mode=IGRAPH_ALL which means that both the mode & IGRAPH_OUT and mode & IGRAPH_IN branches will be run. The benchmark simply runs this function in a loop for vid values from 0 to n-1 where n is the number of graph vertices:

github.com

igraph/igraph/blob/master/tests/benchmarks/igraph_degree.c#L13-L15


      
          for (igraph_integer_t i=0; i < n; i++) {
              igraph_degree_1(g, &VECTOR(*res)[i], i, mode, loops);
          }

Now let us look at igraph_degree():

github.com

igraph/igraph/blob/master/src/graph/type_indexededgelist.c#L1221-L1238


      
          if (loops) {
              if (mode & IGRAPH_OUT) {
                  for (IGRAPH_VIT_RESET(vit), i = 0;
                       !IGRAPH_VIT_END(vit);
                       IGRAPH_VIT_NEXT(vit), i++) {
                      igraph_integer_t vid = IGRAPH_VIT_GET(vit);
                      VECTOR(*res)[i] += (VECTOR(graph->os)[vid + 1] - VECTOR(graph->os)[vid]);
                  }
              }
              if (mode & IGRAPH_IN) {
                  for (IGRAPH_VIT_RESET(vit), i = 0;
                       !IGRAPH_VIT_END(vit);
                       IGRAPH_VIT_NEXT(vit), i++) {
                      igraph_integer_t vid = IGRAPH_VIT_GET(vit);
                      VECTOR(*res)[i] += (VECTOR(graph->is)[vid + 1] - VECTOR(graph->is)[vid]);
                  }
              }
          } else { /* no loops */

Notice that igraph_degree() is basically identical to calling igraph_degree_1() in a loop, except some checks that don’t need to be repeated are moved outside the loop. However, this time the for loop makes use of igraph’s “vertex iterators”, which may or may not be the source of the difference. Under the hood, the vertex iterator still just counts up to n. You can see the definitions of macros like IGRAPH_VIT_NEXT() here: igraph/igraph_iterators.h at master · igraph/igraph · GitHub

Instructions for building and running the benchmark:

Clone the igraph repo
mkdir build && cd build, cmake .., cmake --build . --target build_benchmarks
Run tests/benchmark_igraph_degree. Only the first two timings are relevant.

If you want to try to run the same using Rosetta, you need to configure igraph so that it has no external dependencies. It’s easiest to run ccmake . in the build directory, toggle IGRAPH_GRAPH_SUPPORT to OFF and toggle all IGRAPH_USE_INTERNAL_... flags to ON. Hit c and g to re-configure the project. Then run cmake .. -DCMAKE_OSX_ARCHITECTURES=x86_64, and re-build the benchmarks.

With Rosetta the timings look like this, on the same machine.

|  1 igraph_degree(), preferential attachment n=100000, m=10, 1000x                0.155s  0.155s      0s
|  2 deg1(), preferential attachment n=100000, m=10, 1000x                         0.251s   0.25s      0s

This is what I expect: the first one must be faster than the second.

I hope this is helpful in reproducing the issue @fhahn. Let me know if you have questions. I’ll look at Instruments once I get GUI access to this machine.

szhorvat · October 4, 2022, 11:40am

I looked at the program with the Instruments profiler, but I did not learn anything interesting. The performance-critical part of the code does not contain any function calls, so Instruments can’t give me a breakdown.

fhahn · October 4, 2022, 12:14pm

Could you try and build with -mllvm -unroll-runtime as extra option?

szhorvat · October 4, 2022, 12:20pm

The timing of igraph_degree() (the affected function) does not change. The timing of igraph_degree_1() in a loop (the “alternative implementation”) improves.

With -mllvm -unroll-runtime:

|  1 igraph_degree(), preferential attachment n=100000, m=10, 1000x                0.445s  0.445s  0.001s
|  2 deg1(), preferential attachment n=100000, m=10, 1000x                         0.161s  0.161s      0s

Without that option:

|  1 igraph_degree(), preferential attachment n=100000, m=10, 1000x                0.446s  0.445s  0.001s
|  2 deg1(), preferential attachment n=100000, m=10, 1000x                         0.188s  0.188s      0s

(These timings are stable to more than two significant digits.)

fhahn · October 4, 2022, 3:17pm

Thanks for checking, so just to make sure I understand correctly, all time is spent. in igraph_degree and igraph_degree_1?

If that’s the case, would it be possible to share the LLVM IR before optimizations for the file containing those (just add -S -emit-llvm -mllvm -disable-llvm-optzns to the compiler invocation to compile the file) and also the assembly generated (-S option)?

Thanks for checking, looks like this only gives very modest improvements and it is not the main issue.

szhorvat · October 4, 2022, 3:34pm

Yes.

OK. This time I am using Clang 15.0.1 from MacPorts, as I assume you are interested in the output of the latest version rather than Apple Clang. For reference, the compiler command is:

/opt/local/bin/clang-mp-15 -DIGRAPH_VERIFY_FINALLY_STACK=0 -DNCOMPLEX -DPRPACK_IGRAPH_SUPPORT=1 -Digraph_EXPORTS -DIGRAPH_FILE_BASENAME=\"src/graph/type_indexededgelist.c\" -I/Users/szhorvat/Repos/igraph-main/igraph/include -I/Users/szhorvat/Repos/igraph-main/igraph/build/include -I/Users/szhorvat/Repos/igraph-main/igraph/build/src -I/Users/szhorvat/Repos/igraph-main/igraph/src -I/Users/szhorvat/Repos/igraph-main/igraph/vendor -I/Users/szhorvat/Repos/igraph-main/igraph/vendor/plfit -I/opt/local/include -I/opt/local/include/libxml2 -O3 -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.3.sdk -fPIC -fvisibility=hidden -Werror -Wall -Wextra -pedantic -Wstrict-prototypes -Wno-unused-function -Wno-unused-parameter -Wno-unused-but-set-variable -Wno-sign-compare -Wno-unknown-warning-option -std=gnu99 -S -emit-llvm -mllvm -disable-llvm-optzns -c /Users/szhorvat/Repos/igraph-main/igraph/src/graph/type_indexededgelist.c

(Edited slightly from what CMake generates.)

I cannot upload .s, .ll or .zip files here (it’s disallowed), so please take the files from https://cloud.mpi-cbg.de/index.php/s/uKBDDDZzi4dqqnD. Unfortunately, this function exists in a rather large source file, which also contains many other functions.

fhahn · October 4, 2022, 6:56pm

Very interesting, but unfortunately I couldn’t spot any obvious middle-end optimization difference between the X86 and AArch64 versions. This will need a closer look.

It’s quite easy to reproduce locally with the instructions you shared!

szhorvat · August 31, 2023, 11:47am

I no longer see this issue, but I can’t benchmark on the same computer, and I’m not sure what changed.

This time I’m using macOS 13.5.1 on an M2 (last time it was an M1).

All of these are fine:

Apple clang version 14.0.3 (clang-1403.0.22.14.1)

clang version 15.0.7

clang version 16.0.6

The issue was present on an M1 with clang version 15.0.1.

Topic		Replies	Views
llvm.org pre-built clang significantly slower than apple/xcode clang LLVM Dev List Archives	7	150	November 29, 2018
Clang versus GCC speed Clang Frontend	7	138	November 22, 2009
Compilation time regression? Clang Frontend	5	96	November 25, 2013
slooow compiles LLVM Dev List Archives	2	71	October 20, 2009
Clang compiled binary significantly slower than gcc compiled binary Using Clang clang	2	1120	October 1, 2022

Code compiled for arm64 much slower than for x86_64

Related Topics