Scaling to many basic blocks

How well does LLVM scale to many basic blocks? Let’s say you have a single function consisting of a million basic blocks each with a few tens of instructions (and assuming the whole thing isn’t trivially repetitive so the number of simultaneously live variables and whatever is large) and you feed that through the optimisers into the backend code generator, will this work okay, or will it take a very long time, or will it run into a scenario of ‘look, all compilers are designed on the assumption that nobody is going to write a million lines of code in a single function’? (Not that I’m planning to do so by hand either, I’m just thinking about certain scenarios in automatic code generation.)

Hi,

Several passes would have troubles with such code (namely, GVN and JumpThreading). Recently we exposed similar issues when new more aggressive unrolling heuristic was implemented - we unrolled big loops expecting that the linear code would be then optimized, and it indeed was optimized, but compile time regressed significantly. But probably some of these issues have been fixed already - I didn’t check.

Thanks,
Michael

A somewhat more general question:

What characteristics of generated IR contribute to poor optimization/code generation times?

I recently found a pathological case in my own code of data tables that caused poor performance in the IR verifier.

In general, I am not happy with the performance of LLVM on my code, so I know I need to do better. But how? I have a few ideas, but trying any of them could be time-consuming, with no guaranteed return at the end of it.

Currently, I don’t emit TBAA metadata. I have a hunch that doing so might speed up compilation, as alias analyses can be reduced in size, but I don’t know that for sure, or how much of a benefit it would be. Yes, I need to do this in any case, but if my users complain of long compile times, I’d like to know what to work on first.

Can you just choose not to run those particular passes? I suppose the big
problem would be if there's a problem with the code generation and related
stuff like instruction scheduling and register allocation - is that likely
to be the case?

Hi,

Several passes would have troubles with such code (namely, GVN and
JumpThreading).

Can you just choose not to run those particular passes? I suppose the big
problem would be if there's a problem with the code generation and related
stuff like instruction scheduling and register allocation - is that likely
to be the case?

I seem to recall the sanitizers hit some scaling issues with lots of small
basic blocks (with lots of variables) in the register allocator.

- David

You can expect the register allocator to be a hotspot during compiling such functions where I’m assuming your basic blocks also end up having millions of variables. I’ve not tried this in LLVM but in other compilers I’ve used it certainly has been an issue.

Since you mentioned automatic code generation which is also how I ended up[1] in a similar situation I can tell you that 10,000 local variables or whatever you’re attempting can usually be solved by instead making it an array of those things you were going to make individual variables.

[1] it was a protocol buffers like code generation scenario of a schema that had thousands of fields (unrelated to each other) and I had to sort of invent this arraying strategy to prevent compile times from skyrocketing, that said the compiler at the time was using an n2 register allocation algorithm.

GVN and some other optimizers have really bad behavior on this number
of basic blocks.
Most algorithms are linear, but some are not, and can't be made
linear. They mostly have cutoffs, but sometimes cutoffs get missed,
etc :slight_smile:

Otherwise, most algorithms are either linear in the number of
instructions/BB, or can be made linear in the number of
instructions/BBs if they aren't.

I don’t believe we have any documentation on this topic at this point. The performance tips for frontend authors page has some information (), but not much. In general, the farther something is from reasonable C/C++ code (including machine generated code from various common frameworks) the less likely all the issues are to have been worked out. If you wanted to start some documentation, that would be a very constructive way to gather information that will help you and others. This was a performance bug. They exist. Please file them so they can get fixed. In particular, you may find various passes which need cutoffs added (due to the super-linear scaling inherent in them). Feel free to file bugs - preferably with test cases! - for these. Patches are even better of course. :slight_smile: Honestly, you’re the only person who can do these experiments. You can try talking to others in the community, but if you’re not willing to run experiments, your results will be poor. Having data to share greatly increases the likelihood someone else will be interested enough to help. Can you share your current numbers? i.e. Why are you “not happy”? I really doubt failing to emit TBAA metadata will help. In particular, my general experience has been that improving compile time happens mostly by making the optimizer more effective (so that other passes see less code), not by crippling it.

One problematic pass right now is correlated value propogation. The sad
part is that at least from clang it can't be disabled short of -O0 :frowning:
I have a patch in my tree to at least be able to disable it with -mllvm
magic, but Chandler didn't like it for the obvious reasons for in-tree.

It would be really nice if someone could (finally) add the proper memory
and compute time bounding to that pass.

Joerg

Is there already a PR open on https://llvm.org/bugs/ ? Ideally with a pre-processed source to feed to clang to reproduce.

Thanks,

Mehdi

Sure, https://llvm.org/bugs/show_bug.cgi?id=10584

Joerg

When my compiler generated LOAD/STORE instead of MEMCPY for large objects, the compiler spent a large amount of time trying to decide scheduling of the massive number of instructions that came from the load and store operations. It doesn’t take a very huge number of instructions either, a couple of functions of 16000 instructions will take tens of seconds to compile - similar size code with “more dependencies” will be much faster to compile, in comparison.