How to compare IR/object files statically for predicting better performance for machine learning based cost models

Hi is there an existing tool that I can use to get a close estimate of performance metrics for a machine learning based cost model, my aim is to not have to run the application/benchmark because that would be very time-consuming for a huge number of training epochs, say for example, I have some cl::opts through which I control some inline heuristics, these would be part of my feature set, I want to be able to train a model without having to run any application, so I would like a comparator kind of a tool which will give me a close estimate of which of the given two object files/ IRs is a better code? Thanks for any help. If not the most naive thing I could think of is to add up instruction costs using something like TTI table! but I’m not sure how good of a comparator that would be!

People are working on this.
(Tag @boomanaiden154-1 @mtrofin @ondrasej)

1 Like

This RFC is related - it’s about introducing a driver tool where we can plug in various such estimators, the (likely first) being Gematria-based.

(tag @JestrTulip)

1 Like

We don’t have anything super production ready currently, but we’re definitely working on building up better performance models (especially beyond straight line code for problems like inlining, but other techniques might perform better there for modelling). Just getting instruction latencies/throughputs (using something like TTI like you mentioned) for each BB and then weighting it using PGO data is fairly naive but isn’t completely terrible in terms of evaluating total cycle counts. However, it leaves a lot on the table in terms of accuracy (and speed depending upon which cycles predictor you use). We hope to have better models for this sort of thing soon, but we’ll have to see how everything ends up turning out.

I’ve heard of llvm-mca, will that tool be useful?

as a follow up, how does mlgo does this currently? do you run the app or is there an estimator there? I’d like to know, thanks for your inputs btw!

For regalloc we use the estimator in CodeGen/RegallocScore.cpp, but it wasn’t sufficient and we ended up guiding the rl algo with it for a few policies, then selecting among them with real benchmarks. Not ideal, hence the interest in better latency models.

Yeah well, makes sense, I think something like this should have been done much earlier. anyway, do you have any WIP code that I can take a look at?

@ondrasej for the ongoing port of Granite.

@pawan-nirpal-031 I’m not sure I understand correctly what you’re planning to do. My best understanding is that you have an algorithm that has a number of options, and you’d like to get a running time of the algorithm based on the options (or determine which of two combinations of options is better) without running the code.
Did I get this right?

For the former, Gematria or llvm-mca can help with estimating the cost of a sequence of instructions (e.g. an execution trace), and they should be very easy to use for this purpose and more precise than just counting instructions (which might still be a good first approximation).

The planned llvm-cm tooll will allow combining those models with profiling data (which contains the number of executions of each instruction/basic block in the binary).

But all this assumes that you run the application and collect the profile data to know how many times each basic block (and thus instruction) is executed. This profile may change with the options you use, so you’d still need to collect a new one for a new combination of options. Doing this purely statically from code is a very hard, if at all feasible, problem of its own.

Hi, Let me put it this way, I have some heuristics which are controllable in InlineCost.cpp, I control it through cl::opt these are supposed to be part of my feature vector or learnable parameters for a ml model, my intention is to find the best set of the values for which I have higher performance for a benchmark suite something like specv6, but running the benchmark for each training iteration is terrible of course, so what I need is, given two object files of the same application built by the compiler with a different set of those learnable parameters, I need to compare statically which of the two would be more performant on some given march, for now say znver3, or if including march makes it too hard, then in a machine-independent manner which of the two will have better performance, my naive guess to achieve this was to add up the instruction costs this is of course too naive, but not terrible to begin with, but I need a better estimator. I hope this is clearer. Cheers

Yes, this makes sense. Thanks for the explanation!

I’m not sure the llvm-cm tool will be very useful in this case, because it will probably need profiling data from each run to give you a reasonably precise result. If your compilation flags change the output significantly (change the shape of the CFG, add/remove memory accesses, …) then the profiles won’t be transferable from one run to the other, and you might as well need to re-run the benchmark.

For this kind of parameter tuning, a black-box optimization algorithm might be a better choice. There are some optimizers designed to find a good solution with as few trials as possible. The one I know is Google’s Vizier, but there might be others.

I’m doing exactly the same thing with you. Currently, my solution is to get the latency and total count of each basic block and then sum them up. However, it doesn’t work well, still far away from real data.

oh okay, I wonder if any of the tools mentioned here could be tailored for that purpose, are you going to look into these tools that the guys mentioned above?

I tried. But none of them meet my requirement. The accuracy of Gematria is still too low. And I’m developing on gcc, so llvm-mca is useless.

btw did you benchmark a quickly running application ( real runtime data ) vs the instruction latency cost model, and how good is the naive approach by that I mean, what % is the naive approach compared to actual runtime? Even If It’s around 80% of the real data. It should be a good starting point.

It’s hard to evaluate the performance of Gematria outside of the results published in the GRANITE paper currently as there is no publicly available model. But it definitely is not perfect despite being quite a bit better on the commonly used datasets (BHive, Ithemal) than llvm-mca, Ithemal, and others (although some of them prioritize other metrics than absolute accuracy).

llvm-mca does use the LLVM scheduling models/depend on LLVM otherwise, but it is essentially standalone and it is supposed to model the underlying hardware rather than anything specific about the compiler. Unless I’m missing something, there is no issue using llvm-mca with code generated by GCC.

I did some experiments with uiCA (which is pretty much the current state of the art in cost modeling straight line code) and basic block weights from profile information for cost modeling register allocation and that technique was able to distinguish between code snippets as being faster or slower about 70-75% of the time and had a tau-coefficient of about 0.5-0.55 when ordering a bunch of different register allocations.

Thanks for your information! The reason why I say that the accuracy of Gematria is too low is that even based on the results presented in their paper, the accuracy is much lower than uiCA. While in my experiment settings, combining uiCA and the number of executions of basic block can only achieve an accuracy of 20-40% when predicting the optimization is worse or better, I test it on four optimizations and 5000 synthesized C programs.

Yes, I tried with my synthesized dataset, but the accuracy is too low.

May I confirm how you set your uiCA? I’m wondering whether there is anything wrong when I set the uiCA.