RFC: Incubator project for llvm-ir-dataset-utils

As part of our recent effort to build a large dataset of LLVM-IR for applications in machine learning, compiler testing, and more ([2309.15432] ComPile: A Large IR Dataset from Production Sources), we have created tooling to build large IR datasets from package indices and to process the resulting information. We’d be interested in moving this under the LLVM umbrella as an incubator project. Note that this effort is complementary to RFC: Upstreaming elements of the MLGO tooling as we utilize this tooling extensively. We believe this provides a couple key advantages to the community:

  • It makes it easier for other parties to contribute to the effort, especially those that restrict/require approval for general open source contributions, but might have an exception for contributing to LLVM.
  • Puts the tooling in an easily discoverable, centralized place where it complements the rest of LLVM well to aid in the previously mentioned efforts.

We believe this project makes a lot of sense as a LLVM incubator project. Addressing the specific points listed in the developer policy for incubator projects:

  • Must be generally aligned with the mission of the LLVM project to advance compilers, languages, tools, runtimes, etc.
    • Our dataset tooling is tied to LLVM (it produces LLVM-IR after all) and it is closely associated with other upstream LLVM efforts like MLGO, ultimately aimed at enabling new efforts and enhancing existing ones within LLVM.
  • Must conform to the license, patent, and code of conduct policies laid out in this developer policy document.
    • Llvm-ir-dataset-utils will be upstreamed under the Apache 2.0 License with LLVM Exceptions and we generally intend to follow all upstream LLVM procedures.
  • Must have a documented charter and development plan, e.g. in the form of a README file, mission statement, and/or manifesto.
    • We currently have a basic README, but can write up a more formal mission/milestones as the community deems necessary.
  • Should conform to coding standards, incremental development process, and other expectations.
    • Currently llvm-ir-dataset-utils is lacking somewhat in this department. The tooling currently works at a large scale, but there’s still a lot of work to be done to improve the code quality and make elements more robust. Our current plan is to work on improving test coverage and better take advantage of utilities like static type analysis incrementally. We believe the necessary adjustments could be made within an incubator repository.
  • Should have a sense of the community that it hopes to eventually foster, and there should be interest from members with different affiliations / organizations.
    • We’re currently focused on enabling a variety of projects more focused on the research end of things, but we have a large degree of interest from multiple parties. The team behind this project was a cross-institutional collaboration including people from Google, LLNL, ANL, and multiple universities. We’ve received a lot of interest already from other universities and companies.
  • Should have a feasible path to eventually graduate as a dedicated top-level or sub-project within the LLVM monorepo.
    • We are not sure of a concrete timeline for this, but our plan would be to upstream the project or essential pieces of it to somewhere in the monorepo after code quality issues have been cleaned up and the community is interested in it being upstreamed. We would either do it as a full top-level project or put it under llvm/util.
  • Should include a notice (e.g. in the project README or web page) that the project is in ‘incubation status’ and is not included in LLVM releases (see suggested wording below).
    • Easily doable and will be done pending this RFC when the time comes to actually move things.

We’re interested in seeing what the broader LLVM community thinks about this proposal and looking forward to future collaboration to enable more efforts towards large scale compiler testing and machine learning in compilers.

The code is available to view as a top-level project at [Dataset] Upstream llvm-ir-dataset-utils by boomanaiden154 · Pull Request #72320 · llvm/llvm-project · GitHub although it would end up in an incubator repository pending acceptance of this RFC.


I wonder how much of the tooling is really coupled to LLVM-IR? And if so how much is fundamental?
It’s would be great if this could all be decoupling (when conceptually possible) so that people writing MLIR-based compiler could reuse all of this (think “OpenAI Triton” for example, or “Flang IR”).
Otherwise we’ll end up duplicating all of it at some point to be able to support MLIR users.

1 Like

Some of it is fairly closely coupled to IR, like the extraction tooling (see RFC: Upstreaming elements of the MLGO tooling), but a lot of it is less coupled to IR, like the build system interfaces. The IR extraction tooling can also be easily swapped out for something else. A lot of the post-processing infrastructure, build-system interfaces, and other dataset infrastructure should be pretty extendable. We already have plans to add support for Fortran with flang, and as long as the correct flags are in place to emit HLFIR/FIR that we can collect, it should be relatively trivial. Doing something like using polygeist or clang with clangIR to compile all the C/C++ applications should also be easily doable. Other sources would just need an additional builder.

There are some assumptions made in regards to LLVM IR that would need to be changed to get this to work, but it should be doable and I would think most sections should already be reusable. Given that quite a few people are interested in extending this to MLIR at this point, we’ll make sure to keep this use case in mind as we refactor/iterate on the current tooling to make sure things are easily extensible for this use case.

1 Like

Are there any objections to moving forward with this? We’ve recieved quite a bit of interest in the dataset (although most of it not present here on the Discourse), and now that the dataset is open source we’d like to move forward with the incubator process.

I believe if there aren’t any objections, the next step would be to have the LLVM Github admins (@tstellar ?) create a new repository under the LLVM organization on Github?

1 Like