[GSOC] The 1001 Thresholds in LLVM — midterm update

Motivation : LLVM has lots of thresholds and flags to control the compiler behaviour. However, it is unclear if all these thresholds are useful, their value is reasonable, and what impact they really have. This work explores these thresholds, when they are hit, what it means if they are hit, how we should select their values, and if we need different profiles (Profiles here refer to metrics like compile time, size of the generated program, or any statistic that LLVM emits like “Number of stack-move optimizations performed” etc.)

Initial Results : We have conducted a large study on 10,000 *bitcode files and a small study on *100 bitcode files. Both of them reflect the fact that changing these knob values can positively affect various statistical values emitted by LLVM. We measure both compile-time and bitcode-size along with various stats emitted and present them in the form of interactive graphs here. For instance, here are the graphs for jump-threading-threshold :

Here we can observe that modifying the knob value leads to considerable increase in several stat values with varying compile-time and code size changes. Apart from that we also observe some very interesting patterns in our small study which shows values for a knob which not only increases the number of stats emitted but also gives a performance boost . For example, here are the graphs for inline-memaccess-cost :

We observe that as we increase the knob values, not only does most of the stats increase (with one increasing by 720%), but the performance parameters namely compile-time and bitcode-size also show a general decreasing trend.

Approach : We first collected all the knobs we could from the LLVM codebase by creating a custom clang matcher. We looked for the following patterns :

  1. Const knob_name = knob_val
  2. Cl::init
  3. Enum {knob_name = knob_val}

This landed us with nearly 500 knobs. The small study was meant to filter out the uninteresting knobs among these. Though this approach has its own flaws since a particular knob can emit stats for different *bitcode files, this seemed the best alternative with the limited resources available. We cherry picked 93 of these knobs which showed changes in at least one stat value. We then use a custom python tool (optimised to deal with I/O and cache bottlenecks) to collect the different stat values in parallel and store them in a json file which is later used to render the graphs. For compile time, we measure the instruction count using perf and for bitcode-size, we use llvm-size to get the text segment size in bytes.

Feedback and Next Steps: We would like to receive feedback on the current results and suggestions for next steps. So far we have planned to first determine whether multiple magic values exist for a particular knob. This involves studying the knobs in isolation and determining for what portion of the *programs a certain knob value performs well and for what portion of the program some other magic value performs well. Note that it’s not necessary for a knob to have multiple such values. We would first like to determine whether this exists for some knobs because then we can work towards making a framework that LLVM can potentially use to change its knob values according to the program it compiles.

cc : @jdoerfert @wsmoses @jhueckelheim


*The first K files in the Compile dataset

3 Likes

Some of these results are kind of obvious… e.g. if you increase the loop unswitch threshold, you unswitch more loops.

Some of the more obscure knobs have interesting effects… for example, capture-tracking-max-uses-to-explore apparently has a significant impact on analysis in certain cases.

It’s not clear what we actually do with this information, though; it might point to areas to explore, but it doesn’t really indicate how to fix them. So someone would need to look into it… but nobody really has time to look into the codesize of a random dataset.

The effect of the inline-memaccess-cost option is just to increase inlining costs by a lot – that this results in lower code size and compile-time is an expected outcome, but I don’t think it really tells us anything useful.

Looking at your 10000 bitcode data set, the force-target-instruction-cost option is an internal testing option, which forces the vectorization cost model to return fixed values. While this of course impacts vectorization statistics a lot, it won’t do so in any useful manner…

The other tested options are more interesting, though I find it hard to really draw actionable conclusions from the data. The effect that a change on a certain statistic has on the final binary performance is non-obvious. If we increase the loop unswitch threshold, we’ll unswitch more loops – but is that actually beneficial? We would need runtime performance data to determine that. Does the used data set allow gathering runtime performance data as well?

I think the most directly useful result here is for capture-tracking-max-uses-to-explore, which indicates that we do lose optimizations due to the capture tracking limit – I don’t think we’d want to simply raise the threshold, but we could explore visiting larger graphs in cases where the results can be cached.

We are working on that. Update coming soon (@ivanradanov).

Agreed. We’ll first explore in more detail what values (potentially more than one) are worth investigating in real settings for different cases (size, compile time, soon runtime). We’ll start with one knob and it can be replicated for others as we go.

Are these bitcode files all generated by clang or does the dataset include IR generated by other frontends? I see that the full dataset includes a handful of languages, but it’s not clear to me how “the first K files” are determined – if it’s not a random sample, then there may be bias in the file selection?