RFC: Incubator project for llvm-ir-dataset-utils

Some of it is fairly closely coupled to IR, like the extraction tooling (see RFC: Upstreaming elements of the MLGO tooling), but a lot of it is less coupled to IR, like the build system interfaces. The IR extraction tooling can also be easily swapped out for something else. A lot of the post-processing infrastructure, build-system interfaces, and other dataset infrastructure should be pretty extendable. We already have plans to add support for Fortran with flang, and as long as the correct flags are in place to emit HLFIR/FIR that we can collect, it should be relatively trivial. Doing something like using polygeist or clang with clangIR to compile all the C/C++ applications should also be easily doable. Other sources would just need an additional builder.

There are some assumptions made in regards to LLVM IR that would need to be changed to get this to work, but it should be doable and I would think most sections should already be reusable. Given that quite a few people are interested in extending this to MLIR at this point, we’ll make sure to keep this use case in mind as we refactor/iterate on the current tooling to make sure things are easily extensible for this use case.

1 Like