RFC: Upstreaming elements of the MLGO tooling

As part of the effort to enable machine learning guided optimization (MLGO) within LLVM ([RFC] MLGO Regalloc: learned eviction policy for regalloc, RFC: a practical mechanism for applying Machine Learning for optimization policies in LLVM) , there have been a variety of utilities upstreamed into LLVM. However, along with the machinery available for MLGO in upstream LLVM, a variety of other tools have also been developed out of tree in GitHub - google/ml-compiler-opt: Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM.. One of the key elements is the corpus extraction tooling that supports generating corpora of LLVM IR from software projects using various build systems. There are a couple reasons we’re interested in upstreaming this tooling:

  • It’s non-prescriptive. The corpus extraction tooling doesn’t bind a user into any specific ML framework or any other paradigm, still allowing for a large amount of flexibility, enabling other projects rather than prescribing how they should be performed.
  • It’s highly coupled to compilation flags/facilities provided within the compiler. Having these corpus extraction utilities available in upstream LLVM allows for flags and the tooling to be updated at the same time.
  • It puts more of the tooling to fully take advantage of the upstream MLGO infrastructure in a common place that anyone working on the compiler will already have, making collaboration easier.

As part of this transition, we would be moving the corpus extraction tooling from GitHub - google/ml-compiler-opt: Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM. to within the LLVM monorepo, specifically under llvm/py/mlgo. We would use namespace packaging (Packaging namespace packages - Python Packaging User Guide) to create a llvm.mlgo package with our tooling. The llvm.* namespace would be open for others to extend with their own utilities. However, we’re open to other ideas on where exactly things should be placed.The exact files/tests/associated utilities can be seen in [MLGO] Upstream the corpus extraction tooling by boomanaiden154 · Pull Request #72319 · llvm/llvm-project · GitHub. We would be making a couple minor changes to the existing code to make it consistent with the rest of the LLVM project:

  • Decouple the current libraries/scripts from GitHub - google/ml-compiler-opt: Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM. so they are in a suitable state for being introduced upstream.
  • Remove extra dependencies from the scripts as necessary and keep dependency-using prescriptive parts of the tooling that enable certain (internal) use cases in the existing google/ml-compiler-opt repository.
  • Reformat the code with black to use consistent formatting with the rest of the LLVM monorepo.
  • Add CI to publish the library as a PyPI package with the rest of the LLVM release process and as a nightly build.
  • Change out the license headers.

We believe this change should be relatively unobtrusive as it doesn’t change any defaults and should be a relatively small addition, but we want to obtain community feedback on this proposal before taking the next steps.