New Large-Scale Dataset of LLVM-IR for Machine Learning and Beyond

lpaehler · July 11, 2024, 8:29pm

We are excited to announce the release of the first large-scale dataset of LLVM IR ComPile suitable for machine learning tasks, as well as statistical analyses of the usage of LLVM across languages beyond machine learning.

The license-filtered dataset consists of

16GB of C bitcode
109GB of C++ bitcode
200GB of Julia bitcode
656GB of Rust bitcode
8GB of Swift bitcode

and is publicly hosted on HuggingFace, with the dataset utilities to build the dataset publicly available on GitHub, and actively being expanded upon.

The accompanying paper diving into the details of the dataset construction was also just accepted into the Journal of Data-Centric Machine Learning.

We are excited to see what benefits to the LLVM ecosystem can be derived through the dataset!

Best,
@boomanaiden154-1 , @lpaehler , Konstantinos Parasyris, @tbennun , Jacob Hegna, @wsmoses , @josemonsalve2 , @mtrofin , and @jdoerfert

For a brief glimpse of the capabilities of dataset, beginning with its statistics:

And its utility for machine-learned compiler optimizations on the example task of code-size prediction as compared to the established Anghabench benchmark:

Topic		Replies	Views
RFC: Incubator project for llvm-ir-dataset-utils LLVM Project	3	745	January 30, 2024
Mlgo meeting Feb 9 2024 agenda Community mlgo	2	270	February 9, 2024
[LLVM-DEV'25] LLVM :hearts: ML Workshop Community mlgo	11	1216	November 4, 2025
Building Compilers from Scratch: Struggling with LLVM/MLIR? Can LLMs Be Your Code Mentor? Beginners llvm , mlir	1	216	February 28, 2025
2022 LLVM Developers' Meeting Videos Posted US Developer Meeting usdevmtg	1	1281	January 9, 2023

New Large-Scale Dataset of LLVM-IR for Machine Learning and Beyond

Related topics