[RFC] IR2Builder - An LLVM IR converter to C++ IRBuilder calls

IR2Builder

Hello, I had created a tool - IR2Builer - for converting LLVM IR into C++
IRBuilder API calls. The purpose of this post is to gather ideas and see if this
would be a valuable addition to the set of tools that are part of the LLVM
repository.

The tool itself is not fully finished (see testing section bellow), but can be
(and was) used already. It can be found in my fork of LLVM: GitHub - mark-sed/llvm-project at ir2builder.

Motivation

The main motivation for me was all the cases where I had to generate functions
using IRBuilder. It is not always possible to use .ll file and just load it as
there might be compile-time or runtime dependencies.
In almost all of these cases I had always written first he LLVM IR version and
then I had rewritten it in using IRBuilder API calls. Using this tool helps with
the rewriting.

Another use case is for learning LLVM. When I started learning LLVM (not that
far back) I had issues with code generation and understanding all the
connections in IRBuilder. So one could write the LLVM IR and then see how it can
be constructed using this tool.

For some it might be useful to convert their always included .ll files
into cpp code.

Implementation and use

The current implementation is done as a standalone LLVM target implemented in
one cpp file. It uses LLVM to parse the input IR file and then traverses it and
generates C++ code in textual form.

Currently it is possible to generate just the function (probably most useful),
but also whole module and main. The latter is mostly for testing purposes.

Users can also specify variable names for IRBuilder, LLVMContext, Module or
llvm scope, so it fits straight into their code.

./ir2builder code.ll -o code.cpp

Testing and known issues

I had created a simple bash script, which ran this tool on the whole llvm/tests
directory for all .ll files for opt:

  1. It created the fully runnable cpp code (./ir2builder test.ll --runnable > test.cpp)
  2. Then compiled this using llvm-conf (g++ test.cpp $(./llvm-conf ...) -I./include/ -o test)
  3. Ran the generated binary (./test > generated.ll)
  4. Compared original with generated using llvm-diff (./llvm-diff test.ll generated.ll)

Test was considered successful (pass) if there was no compilation error and
llvm-diff succeeded. Tests that were not for opt are marked as ‘skipped’.
I can shared the full results (in google sheet or something), but here are the
main numbers:

Test set Passed Failed Skipped
Analysis 1003 450 3
CodeGen 501 301 18166
Transforms 5895 3581 162
Assembler 18 16 436
Bitcode 38 10 223
Examples 5 2 4
Feature 15 6 62
Instrumentation 355 101 2
LTO 30 16 104
Other 84 40 27
ThinLTO 99 79 104
Verifier 16 6 317
ÎŁ 8059 4608 19610

Pass rate: 63.6 %

Fail rate: 36.4 %

I would personally prefer to see higher pass rate, but most of these failures
are tied to some of the known issues, which can be fixed (see bellow).

Known issues

  • Incorrect GEP return type generation - This issue is mostly just me know
    knowing the correct call to get the return type that I need to provide to
    CreateGEP call. This one should be easy to fix and in the test set I can see
    1030 mentions of this error (there can be multiples in one test).
  • ConstantExpr creation - Once again I was not sure how to correctly generate
    ConstantExpr from the IR and this one is present 17316 times in the test set.
    So adding this should help with the pass rate the most.
  • BasicBlock addresses - Every function is generated in its own scope to
    avoid having issues with same named variables across functions, but this is an
    issues for BlockAddresses creation.
  • Float errors - llvm-diff fails sometimes because of different floats, which is
    caused by rounding error and not always correct output of the value.
  • Incorrect tests - Some tests fail when being parsed and loaded into the tool.
  • Missing call bundles

The current output is not well formatted, but my idea is that the code generated
can be easily run through clang-format and that sorts this issue.

Current and general solution for known and unknown issues

The tool counts on new stuff being added to LLVM and not having it updated straight away
and for this and known issues there is a simple approach - let the user
create this. Currently for the missing parts, such as ConstantExprs, ir2builder
places a comment such as /* TODO: ConstantExpr creation */ in the part where
this should be created and the user can create it there. This will also usually
fail the compilation and this comment is visible in the error snippet.

Summary

This tool still needs some tweaks (which I am happy to add), but currently I
find it to be in a usable state.

It is also made so that it can bring some use even when missing some newer
constructs and could be used even for learning.

I personally think it would be a good addition to the LLVM tooling.

What do you think?

2 Likes

I have implemented something like this in non-LLVM based compilers as a workaround for the IR not being serializable, but still needing to inject canned sequences of code into existing IR. With LLVM, these canned sequences can easily be injected as bitcode (on disk or embedded in the binary). Also, with LLVM bitcode lazy loading, it allows us to build one bitcode library of expansions and only load the ones that are actually used in memory. What are the cases where reading bitcode is not possible?

This idea is neat, I just wonder if we have a stronger motivation and use case. One issue with linking bitcode is that types are fully baked in, so if you had 2 almost identical functions but just “slight” differences (in types, or some special case handling etc), with bitcode the entire function is duplicated. May be a C++ code emitter can take an entire bitcode file (a library to be linked) and generate C++ code that can handle these slight variations in the same code (i.e., it can generate N variants from the same C++ code with some runtime checks). There might be other ways to do this as well.

1 Like

Honestly I don’t have too much experience with LLVM bitcode, but I find your idea quite interesting and something that seems possible to implement.

We had this many years ago… at the time, we called it the “C++ backend”. We got rid of it due to lack of maintenance; from what I recall, it broke, and nobody even noticed until years later.

If we want to revive the idea, I’m not exactly against it, but who do we expect to use it? How do we ensure the code doesn’t bitrot?

1 Like

What are these dependencies? Maybe you could load the .ll file and update it in memory instead of generating the whole thing?

I did not know that there was such backend, very interesting.

but who do we expect to use it?

I think that there are people, just like me, who fit one of the cases I described in the RFC, but one of the reasons for creation of this RFC was to see if there are people who could find use for it.

How do we ensure the code doesn’t bitrot?

That is a very good question about which I have been thinking. The current approach is done in a way that the tool expects to not know all the constructs and in such cases it generates a /* TODO: ... */ comment with a description, which the user can fill in. Of course if there is change in the API and arguments are shuffled or names are changed, then this will be visible only once compiled.
I was thinking about using some of the tests I tested this on as actual LLVM tests for this tool, which should uncover issues with changed API.
And I personally don’t mind fixing or adding some of the new ones. At the same time I would guess (correct me if I am wrong) that LLVMs builder and the structure this touches, does not change as much now as it did those years back.

What are these dependencies?

In my cases these were for instance different targets. For ARM I needed different inline asm. Or it might have been different version which changed some function signature and a new argument needed to be added (but other versions had to be supported as well).

Maybe you could load the .ll file and update it in memory instead of generating the whole thing?

Well yes that is an option as well, but using the IRBuilder and inserting the function there that way is also one. In the case with the inline asm you would get only one version of it in the .ll file and then you would need to modify it and those other versions would be hidden.
With this tool you could write the .ll code for one of them, generate the IRBuilder code and then condition those parts that are different.

I’d imagine bitcode solution would quickly become untractable as the number of loaded functions grows and they are impacted by different options/parameters.

I.e. if you have (pseudocode)

define @foo(%arg1 (, %optional_arg2)) {
   <long body>
  if (OptionA) {
    <some IR>
  else {
   <some IR>
  }
  if (OptionB) {
    <some IR>
  else {
   <some IR>
  }
  <long body>
}

You’d have to prepare bitcode modules for each of the possible combination of options (and also to compile the from somewhere).

2 Likes

We have a use case for something like this in LLPC, our LLVM-based shader compiler, though it is a matter for debate whether an IR-to-C++-IRBuilder-code tool is really the right solution.

There are many places in LLPC which we need to lower some operation into a non-trivial piece of code – often including control flow in the resulting code, and often including statically determined options like in @danilaml’s pseudocode example.

LLVM’s “batteries included” “solutions” to this problem boil down to:

  1. Hand-written IRBuilder code.
  2. Write the code in LLVM IR, or some language (e.g. C) that can be compiled to LLVM IR, to form a “runtime library” Module that is re-generated for each compilation, linked in, and applying inlining and cleanup passes.

The problem with the first approach is that it is very difficult to maintain due to the complexity of the inserted code – especially where it includes control flow.

The problem with the second approach is compile-time, at least for us. LLPC behaves like a JIT compiler in the sense that typical applications invoke LLPC many times (1000s of times), and most of those invocations are quite small. Most invocations only need small parts of the “runtime library” Module, but the link-and-inline approach requires creating the Module for each compiler invocation (since linking is destructive).

One could try to fix this issue by careful splitting up the runtime library into multiple Modules, but that again introduces a maintenance burden.

What we ended up doing for production is write a CrossModuleInliner helper instead.

Basically, we fix the compile-time issue of the second approach by loading the “runtime library” Module only once per LLVMContext. Instead of linking it into the target module, we selectively inline functions directly from the “runtime library” Module into the target module.

There are some weaknesses with this approach. They’re not essential to us at the moment, though long-term it would be good to fix them – and that would be fairly easy to do if we could make some changes in the core LLVM inliner that are difficult to justify without the CrossModuleInliner use case. I personally would be very much in favor of upstreaming the CrossModuleInliner – it’s just not been that high on our priorities list.

2 Likes

“runtime library” Module + cleanup passes approach works when you have to support many “branching” implementations (based on some compile-time flags). Unfortunately, I don’t see a way around IRBuilder when you also need to support changing function signatures (there is no preprocessor for IR).

1 Like

FYI there is now a pull request for this: IR2Builder tool for converting LLVM IR files into C++ IRBuilder API by mark-sed ¡ Pull Request #117129 ¡ llvm/llvm-project ¡ GitHub.