Discussing feasibility: Generating Tablegen files for easier LLVM backend development?

Hello, LLVM Community!

I want to preface this post by telling you that I am new to LLVM, but I am very much eager to learn.

I am currently at the end of my Master studies. For the Master Thesis project, me and my supervisor had the idea that a good research opportunity would be to try to speed up the LLVM backend development process, and the Tablegen files being a big part of that, this seems to be an area where improvements can be done.

The idea would be to have some description of the target (Registers, Instructions, Calling Conventions, …) in a more human readable format (for example in YAML, JSON or even XML) as the input, and the output of the tool would be the Tablegen files (at least a rough version of them where most of the boilerplate is generated). Starting from this, one could piece together the classes generated by Tablegen, write the MC layer and such, and in theory it could work to have a HelloWorld program running on that target, compiled with this backend, clang being the frontend.

As a test target for this project I was thinking of a toy architecture just to kick things off, and then see how it works when I move on to real MCUs.

My question to you is if this is a feasible thing to do. Maybe you know of an existing project that already does something similar?

I hope this is a good place for my question.


Have you looked at MDL:

Thank you very much! This does look quite promising. Will look into it and try it out!

Note that Tablegen is wildly used in LLVM for instructions, diagnostics, and more.

I feel like instead of

YAML/JSON -- (your tool) --> TableGen Text -- (llvm-tblgen) --> Generated Source

It will be less convoluted to translate YAML/JSON to TableGen’s in-memory representation (e.g. llvm::Record, see files in include/llvm/TableGen):

YAML/JSON -- (your tool) --> TableGen In-memory Objects -- (existing TableGen backends) --> Generated Source

Since it saves the roundtrip that goes to textual TableGen files before being parsed back right away.

But before that, I have a more fundamental question: I don’t see why YAML/JSON/XML will be more “human readable” in this case. More specifically, how can you express instructions definitions in a concise way?
TableGen is really good at reducing the number of boilerplates / repetitions, which occurs a lot especially in backend development. TableGen does this by factoring out the common parts from related instructions definitions. I don’t see how you can do that with JSON / YAML / XML and achieve the same level of tidiness – which is an important factor for readability IMHO – without bloating two simple instruction definitions into 50+ lines of JSON / YAML / XML.

In fact, I think JSON / YAML / XML in this case are not good human readable but machine readable format: llvm-tblgen actually has a TG backend to generate JSON file of the input TableGen code (i.e. llvm-tblgen --dump-json ...). You can try it out on real instruction definition TG files and see how the resulting JSON files being less human readable.

1 Like

Creating a “better” language than TableGen is an often-discussed idea, so I think its great that you’re thinking about this. I do agree with the points that mshockwave made regarding YAML/JSON/etc, and the template-like features of TableGen that make it easier to define classes of things, particularly instructions. Why don’t you consider creating a first-class declarative language for defining instructions, operands, registers, calling conventions, etc, if the goal is more a more human-understandable language?

That said, using a “toy architecture” will probably not expose enough of the crufty-ness of TableGen to really demonstrate the value of a different language. Maybe it would be better to start out with a real, simple architecture, like RISCV, and see if your description is indeed easier to write and understand.

And do take a look at the MDL stuff (my repo). I do in fact have a language for defining registers, register classes, operands, and instructions, but those aren’t intended to replace TableGen definitions, but only communicate some of the information about those definitions to the MDL compiler. Maybe that would be a good place to start.

At any rate, I’m happy to answer questions about it (MDL).

Thank you very much for your reply!

After thinking a bit more about it, I think you’re right; writing the description in YAML/JSON etc isn’t really much better than writing things directly to Tablegen. The problem I’m trying to solve is two-fold, and maybe I should’ve explained it a bit better in the original question.

  1. I’m trying to make it easier for a developer that has little to no experience in LLVM to create a (minimal) backend. That would be useful for instance in rapid prototyping, when you’re working on developing your platform.
  2. Make it easier to update to new LLVM versions. It’s a fact that porting a backend to newer versions of LLVM can be a real pain as the APIs change somewhat frequently and they break compatibility downstream. I’m not entirely sure how this can be solved, I’m just trying to think of a solution for that.

In any case, thanks a lot for taking the time to respond! And also thank you for contributing with those talks you have on YouTube! :slight_smile:

1 Like

I believe there is more value in writing a tutorial/book on how to write a backend than replacing Tablegen by XXX.

MDL is interesting because it supports different ways to specify scheduling models.

1 Like

Thank you for taking the time to reply!

Why don’t you consider creating a first-class declarative language

I admit, I did not consider writing a new language for this purpose, and truth be told, it’s probably out of my league :smiley:.

That said, using a “toy architecture” will probably not expose enough of the crufty-ness of TableGen to really demonstrate the value of a different language. Maybe it would be better to start out with a real, simple architecture, like RISCV, and see if your description is indeed easier to write and understand.

True that. Also RISCV is well established already in LLVM, and I would have a very good baseline for comparison.

And do take a look at the MDL stuff (my repo). […] I’m happy to answer questions about it (MDL).

I also took a short look at MDL; really promising and indeed could be a good place to start for me. If at some point I will have questions about MDL, should I message you directly, or should I post it in your post about it so it’s visible for everyone?

LLVM as a project has absolutely no guarantee on API and ABI stability (except for minor release versions, which are API and ABI compatible with their major release version), especially for C++ APIs – and this is an intentional design goal. So unfortunately TableGen directives or any backend infrastructures are unlikely to have stable APIs either.

Agree. I feel like better tutorial / documentation can definitely improve this situation.

Yes, please post it in the thread!

I appreciate the enthusiam for trying to make LLVM’s internal tooling better. We need more of that! :slight_smile:

That said, I would caution that TableGen is the way it is because it supports a problem space with a lot of inherent complexity, and it’s easy to fall into an “80:20 trap” where you think you can do better with a cleaner solution, but then you only get 80% of the way to what TableGen does today. And as you try to close the gap of the remaining 20%, you bring back a comparable set of problems as what TableGen has, but with the downside that now you’ve spent years with two incompatible ways of doing the same thing.

So from a perspective of arguing in favor of solid software engineering, I would be far more comfortable with a proposal that involves refactoring the TableGen we have, or perhaps replacing one TableGen backend, etc.

Then again, your perspective is one of research, and so you may be fine with stopping before you encounter the “80:20” problem – and whatever research you end up with may help guide some future effort that follows a more solid engineering approach with the aim of actually being included upstream. So, you know, don’t let me stop you :wink: