New Large-Scale Dataset of LLVM-IR for Machine Learning and Beyond

We are excited to announce the release of the first large-scale dataset of LLVM IR ComPile suitable for machine learning tasks, as well as statistical analyses of the usage of LLVM across languages beyond machine learning.

The license-filtered dataset consists of

  • 16GB of C bitcode
  • 109GB of C++ bitcode
  • 200GB of Julia bitcode
  • 656GB of Rust bitcode
  • 8GB of Swift bitcode

and is publicly hosted on HuggingFace, with the dataset utilities to build the dataset publicly available on GitHub, and actively being expanded upon.

The accompanying paper diving into the details of the dataset construction was also just accepted into the Journal of Data-Centric Machine Learning.

We are excited to see what benefits to the LLVM ecosystem can be derived through the dataset!

Best,
@boomanaiden154-1 , @lpaehler , Konstantinos Parasyris, @tbennun , Jacob Hegna, @wsmoses , @josemonsalve2 , @mtrofin , and @jdoerfert

For a brief glimpse of the capabilities of dataset, beginning with its statistics:

And its utility for machine-learned compiler optimizations on the example task of code-size prediction as compared to the established Anghabench benchmark:


11 Likes