As part of our recent effort to build a large dataset of LLVM-IR for applications in machine learning, compiler testing, and more ([2309.15432] ComPile: A Large IR Dataset from Production Sources), we have created tooling to build large IR datasets from package indices and to process the resulting information. We’d be interested in moving this under the LLVM umbrella as an incubator project. Note that this effort is complementary to RFC: Upstreaming elements of the MLGO tooling as we utilize this tooling extensively. We believe this provides a couple key advantages to the community:
- It makes it easier for other parties to contribute to the effort, especially those that restrict/require approval for general open source contributions, but might have an exception for contributing to LLVM.
- Puts the tooling in an easily discoverable, centralized place where it complements the rest of LLVM well to aid in the previously mentioned efforts.
We believe this project makes a lot of sense as a LLVM incubator project. Addressing the specific points listed in the developer policy for incubator projects:
- Must be generally aligned with the mission of the LLVM project to advance compilers, languages, tools, runtimes, etc.
- Our dataset tooling is tied to LLVM (it produces LLVM-IR after all) and it is closely associated with other upstream LLVM efforts like MLGO, ultimately aimed at enabling new efforts and enhancing existing ones within LLVM.
- Must conform to the license, patent, and code of conduct policies laid out in this developer policy document.
- Llvm-ir-dataset-utils will be upstreamed under the Apache 2.0 License with LLVM Exceptions and we generally intend to follow all upstream LLVM procedures.
- Must have a documented charter and development plan, e.g. in the form of a README file, mission statement, and/or manifesto.
- We currently have a basic README, but can write up a more formal mission/milestones as the community deems necessary.
- Should conform to coding standards, incremental development process, and other expectations.
- Currently llvm-ir-dataset-utils is lacking somewhat in this department. The tooling currently works at a large scale, but there’s still a lot of work to be done to improve the code quality and make elements more robust. Our current plan is to work on improving test coverage and better take advantage of utilities like static type analysis incrementally. We believe the necessary adjustments could be made within an incubator repository.
- Should have a sense of the community that it hopes to eventually foster, and there should be interest from members with different affiliations / organizations.
- We’re currently focused on enabling a variety of projects more focused on the research end of things, but we have a large degree of interest from multiple parties. The team behind this project was a cross-institutional collaboration including people from Google, LLNL, ANL, and multiple universities. We’ve received a lot of interest already from other universities and companies.
- Should have a feasible path to eventually graduate as a dedicated top-level or sub-project within the LLVM monorepo.
- We are not sure of a concrete timeline for this, but our plan would be to upstream the project or essential pieces of it to somewhere in the monorepo after code quality issues have been cleaned up and the community is interested in it being upstreamed. We would either do it as a full top-level project or put it under llvm/util.
- Should include a notice (e.g. in the project README or web page) that the project is in ‘incubation status’ and is not included in LLVM releases (see suggested wording below).
- Easily doable and will be done pending this RFC when the time comes to actually move things.
We’re interested in seeing what the broader LLVM community thinks about this proposal and looking forward to future collaboration to enable more efforts towards large scale compiler testing and machine learning in compilers.
The code is available to view as a top-level project at [Dataset] Upstream llvm-ir-dataset-utils by boomanaiden154 · Pull Request #72320 · llvm/llvm-project · GitHub although it would end up in an incubator repository pending acceptance of this RFC.