I’m currently debugging a performance issue in GraalVM that seems to be caused by a function being compiled to bitcode that’s way bigger and more complicated that I would expect.
The function I’m looking at is the
yaml_parser_fetch_more_tokens function in truffleruby/scanner.c at master · oracle/truffleruby · GitHub .
For reproducing the issue, you need only
yaml_private.h from that directory. Then compile with
clang -S -emit-llvm -I. -O1 scanner.c, and search in the result for
define .*yaml_parser_fetch_more_tokens (the -O level doesn’t really matter, as long as it’s not -O0).
The resulting function has almost 3500 bitcodes. The first basic block contains 60
alloca, and then a mix of 700
getelementptr. Many of them are used later, all over the rest of the function. As far as I can tell, the bitcode is correct, just a lot bigger than expected.
The same function compiled with
-flegacy-pass-manager -O1 looks totally fine, with just ~40 bitcode instructions. Even with
-flegacy-pass-manager -O3 it’s only sligthly bigger, ~70 bitcode instructions, so it can’t be just the fact that the new pass manager does some inlining at
I’m not sure if I’m missing something, but keeping hundreds of SSA values alive across the whole function doesn’t seem right. Is this the intended behavior? If so, how is the code generator dealing with this? Is there maybe some other transformation still happening in the backend?
Or is this function hitting some corner case, and there is maybe some optimization running wild?