[RFC] Enhancing MLGO Inlining with IR2Vec Embeddings

We plan to upstream support for generating IR2Vec embeddings into Machine Learning Guided Optimization (MLGO) for inlining. Initial training of the code size model on internal binaries, combining existing features with IR2Vec embeddings, demonstrates additional code size reductions of up to 4.2% with -Os and 3.8% with -Os and MLGO Inliner.

Design of IR2Vec

IR2Vec [1] is a program embedding approach designed specifically for LLVM IR. The IR2Vec embeddings capture syntactic, semantic, and structural properties of the IR through learned representations. The embeddings are trained in an unsupervised manner, offline, to capture the statistical correlation between the entities of the instructions (like opcodes, types and operands) in the IR. Broadly, the training involves learning to correctly “predict” the missing entity given its context. This process results in learning a vocabulary consisting of n-dimensional floating point vectors (embeddings) for each entity of the IR. The vocabulary is a dictionary that maps the entities of the IR to floating point vectors.

Once the dictionary is obtained through offline learning, the process of generating representation in LLVM involves a simple look-up of the learned vocabulary and aggregating it to compute the embeddings for basic blocks and functions.

The computation of IR2Vec embeddings does not introduce external dependencies or require a model inference during compile time of a program. The necessary model weights are extracted as a standalone vocabulary of about 64 floating-point vectors (corresponding to different opcodes, types, and operands in LLVM IR) in JSON format. Currently, this JSON vocabulary is read once into memory and used for computation. Going forward, this file read can be avoided by auto generating the vocabulary as maps during the build time.

Plan

Broadly, we intend to upstream a function analysis pass to compute the embeddings following different strategies with no additional dependencies. The corresponding patch is available at https://github.com/llvm/llvm-project/pull/134004. In the PR we have identified a bunch of FIXMEs and TODOs for performance improvements that we would address in incremental patches. Subsequently, we plan to patch the MLInlineAdvisor to make inlining decisions using the embeddings. Going forward, our goal is to replace a subset of the existing features that are “costly” to compute with the embeddings.

We plan to maintain the source code for training the vocabulary outside LLVM, and is currently available at https://github.com/IITH-Compilers/IR2Vec. We can subsequently explore if it makes sense to include under (e.g.) llvm/utils/mlgo-utils, once we accrue more experience with real-world use cases.

Experiments
Currently, LLVM supports MLGO for inlining to reduce code size and eviction in register allocation. These ML models are trained using hand-engineered features tailored to specific optimizations. On the other hand, IR2Vec is a program embedding approach designed specifically for LLVM IR. The IR2Vec embeddings capture syntactic, semantic, and structural properties of the IR through learned representations.

IR2Vec has demonstrated its effectiveness on different ML-driven optimizations like phase ordering [2], loop distribution [3], and register allocation [4] on standard benchmarks like SPEC CPU, TSVC, Polybench, etc. Before proposing this RFC, we wanted to validate the effectiveness and scalability on real-world scenarios.

To that end, following the existing approach, we trained two Reinforcement Learning models (using PPO): one only with MLGO features (existing approach) and the other by concatenating existing MLGO features with IR2Vec embeddings. Training was done on about 50K modules from our internal datacenter binary till convergence. Initial evaluation of the resulting policy on various size sensitive binaries internal to Google, clang, and opt (with -Os) results the following improvements.

Improvements in Text section Size

-Os -Os with Current MLGO (Feature-based) -Os with MLGO features + IR2Vec embeddings % Additional Improvement over -Os % Additional Improvement over -Os + MLGO
Internal_1 121.3M 117.1M 116.1M 4.29% 0.85%
internal_2 721.7M 714.5M 698.5M 3.21% 2.24%
clang 116M 113.5M 111.6M 3.79% 1.67%
opt 101.9M 103.04M 99.1M 2.75% 3.82%

Improvements in Total Binary Size (Stripped)

-Os -Os with Current MLGO (Feature-based) -Os with MLGO features + IR2Vec embeddings % Additional Improvement over -Os % Additional Improvement over -Os + MLGO
Internal_1 139.9M 135.6M 135.6M 3.07% –
internal_2 802.6M 794.2M 779.5M 2.88% 1.85%
clang 123.4M 121.3M 119.2M 3.40% 1.73%
opt 105.8M 107.9M 103.7M 2.00% 3.89%

We used a vocabulary which was trained on SPEC CPU benchmarks and boost library by following the steps described here. We did not finetune this vocabulary as our goal was to make sure that (i) the embeddings add value, and (ii) the approach is scalable (in terms of compile time and memory utilization). Note that finetuning the vocabulary and model improvements might further improve performance.

Acknowledgements
All the contributors of IR2Vec - https://github.com/IITH-Compilers/IR2Vec/graphs/contributors

Thanks,
Venkat

References
[1] S. VenkataKeerthy, R Aggarwal, S Jain, M Desarkar, R Upadrasta and Y. N. Srikant. “IR2Vec: LLVM IR based Scalable Program Embeddings.” ACM Transactions on Architecture and Code Optimization (TACO) 17.4 2020. https://arxiv.org/abs/1909.06228.
[2] S Jain, Y Andaluri, S. VenkataKeerthy, R Upadrasta. “POSET-RL: Phase ordering for Optimizing Size and Execution Time using Reinforcement Learning”. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 2022. https://arxiv.org/abs/2208.04238.
[3] S Jain, S. VenkataKeerthy, R Aggarwal, T K Dangeti, D Das, R Upadrasta. “Reinforcement Learning assisted Loop Distribution for Locality and Vectorization”. IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 2022. https://ieeexplore.ieee.org/document/10026979.
[4] S. VenkataKeerthy, S Jain, A Kundu, R Aggarwal, A Cohen, and R Upadrasta. “RL4ReAl: Reinforcement Learning for Register Allocation”. Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction (CC 2023). https://doi.org/10.1145/3578360.3580273.

8 Likes

Do you have a plan to document how to actually use optimizations with IR2Vec models? There’s zero in-tree documentation for any of the MLGO work, and your initial patch doesn’t add anything.

I’ll fix that. Section under docs/Reference, and then the ir2vec docs can go there?

1 Like

Sure, will add a doc under llvm/docs as per @mtrofin’s suggestion. Would it be more appropriate to introduce this as part of the patch that would integrate the embeddings with MLInlineAdvisor?

If you want to document the file formats for inputs/outputs of the IR2Vec pass itself, that probably belongs in the first patch, since that’s where it’s implemented. Other stuff can wait for the relevant patch, sure.

Hi @svkeerthy can you confirm my understanding that this would be the only piece of GitHub - IITH-Compilers/ml-llvm-project currently/soon-to-be-merged in upstream? I.e. there is currently nothing for LLVM-gRPC, the onnx runner, etc?

Happy to contribute to the upstreaming effort. Maybe we could start with considering how the onnx runner would fit into the current mlgo scheme (without the whole ML-compiler-bridge surrounding it), how to build & run onnx models.

Also to introduce myself, I’m Ryan Mitchell I work for the AMD GPU shader compiler team, going to sign up for the MLGO meetings!

Exciting development! Couple thoughts/questions:

  • How often (if ever) do the seed embeddings need to be retrained and is that something your team will do in perpetuity?
  • Would be very useful if there was a way for target backends to adapt the approach for MIR. Is that something that’s been thought about or discussed?

Hi @RyanRio,

Currently, our objective is to upstream IR2Vec and the relevant components available at GitHub - IITH-Compilers/IR2Vec: Implementation of IR2Vec, LLVM IR Based Scalable Program Embeddings.

If there is sufficient interest in having the components of ML-Compiler-Bridge like ONNX runner as a part of LLVM’s Model Runner infrastructure (LLVM: llvm::MLModelRunner Class Reference), we would be open to exploring upstreaming them. We can discuss in one of the upcoming MLGO meetings and see how we can take it forward.

Theoretically, using an older vocabulary should be acceptable unless we have concerns about the embedding quality for the IR generated by the current LLVM/clang. To best capture the newest characteristics of the IR, it seems ideal to retrain the seed embeddings with each LLVM release (that is what we have been doing till now). Alternatively, we could consider retraining whenever significant changes occur in the IR structure or syntax, such as the addition or removal of opcodes or types. With that said, we plan to upstream a set of default seed embeddings of different dimensions and make the necessary tooling to generate the vocabulary available.

Yes, we have extended the ideas of IR2Vec to MIR for our experiments on register allocation. Specifically, we have seed embeddings for MIR targetting x86 and AArch64 backends (ml-llvm-project/llvm/lib/CodeGen/MLRegAlloc at mlbridge-lib · IITH-Compilers/ml-llvm-project · GitHub). We are open to looking into ways to upstream these extensions incrementally.

Drive by comment: With more and more interactions with ML originating from inside LLVM, I feel we should standardize the API, or at least to start having conversations about it before we have a million customized ways to do it. I hope we have some bridging/boilerplate in LLVM so we can talk with PyTorch/TF effortlessly.

For example, this PR uses using Embedding = std::vector<double>;, which may not scale to low precision (if ever), and reads from JSON (Error IR2VecVocabAnalysis::readVocabulary()). On the other hand, MLGO defines a TensorSpec.h that can be translated into raw memory buffers (void*) and defines how to talk with JSON again.

1 Like

Right, but the whole point of this RFC is how the technique is lightweight and independent from ML frameworks.

Thanks! Yeah wouldn’t want to derail your current objective. I’m experimenting with your fork in the meantime. Hopefully we can get a good scope for the use-cases and make sure it’s standardized enough as the most recent comments have expressed concern for.

I understand. I am not suggesting include ML frameworks into LLVM, that would be too much. But should we have a unified representation of class Tensor, or something alike, so we don’t have to have different code doing similar things? (e.g. dump to JSON)

So far, a tensor value is really just std::vector<some_scalar>. TensorSpec is rather a type, it’s describing an operand (input or output) and it’s type and is used at the handshake with the MLModelRunner implementation. JSON serialization is interesting for TensorSpec when we log or perform IPC, for training. Tensor values, at that point, are bit-dumped for efficiency.

For the IR2Vec scenario, there’s none of that, really - we’re really just talking about vectors of floating point values loaded once. “everything just works” - including json serialization/deserialization - out of the box. Can’t really see what meaningful reuse there’d be here. Of course, if that changes, we can revisit, but right now we don’t seem to have concrete use cases. wdyt?

Sounds good.