TableGen Formatter

Description

The TableGen infrastructure plays a central role in LLVM and its sub-projects. Most notably, in LLVM backends, many target-specific information, like instruction definitions and CPU features, are written in TableGen. MLIR also heavily relies on TableGen in its backbone: defining dialects, custom operators, or even lowering rules, to name a few. In both subsystems, there is a substantial amount of TableGen code – over 300 and 44 KLOC, respectively.

Despite the scale of TableGen code in LLVM, a good code formatter is currently absent for this language. Checking the formatting manually during code review is virtually the only way to assure the tidiness of our TableGen codebase. For MLIR, this simply doesn’t scale with its fastly-growing project size as well as countless number of downstream applications. For LLVM targets with smaller communities, like M68k, a tidy TableGen codebase with great readability is essential to foster a friendly environment for more developers to join.

I’m proposing a GSoC project to create a formatter for TableGen, similar to clang-format. The tool is expected to perform source code formatting on TableGen code, according to some predefined styles. A similar topic has been listed on MLIR’s Open Project list, which is also brought into discussion during GSoC 2020’s staging phase. I’m aggregating these comments and putting details into the Expected results section below.

Personally I think this is a beginner-friendly project for any GSoC participant. It doesn’t require deep knowledge of LLVM or advanced compiler engineering, and the size is also digestible for a 3-month project. Not to mention there is already clang-format that can be taken as a reference.

Expected results

  • A prototype source code formatter for TableGen language.
    • Be able to format TableGen source code according to a primitive / basic formatting style.
    • It is encouraged to use the existing TableGen lexer and/or parser libraries. It’s also encouraged to wrap core logics into libraries such that we can reuse them in other tools (e.g. IDE extensions). But this tool needs to be fast, lightweight and performing full-blown / advanced semantic analysis is not encouraged.
    • It is encouraged to explore the possibilities of reusing components from existing formatting tools in LLVM, like clang-format.
    • Can be upstream eventually (The upstream process doesn’t need to kick off or finish during GSoC though).
  • Discussing TableGen formatting style with the community, for example, what should be the “base” style, if the timeframe is allowed.

Desirable skills

C++ programming skills. General experiences with coding tools (formatters, IDEs, even parsers) are preferable.
Optional / Nice-to-have: TableGen programming experiences.

Project type

Small ~ Medium

Mentors

Min (@mshockwave)

Further readings

// end proposal
As mentioned in the proposal I’m not the only one interested in this topic: @ftynse @River707 @antiagainst you are very welcome to chime in and I appreciate your comments! We can also mentor this project together.

5 Likes

I think this is the wrong approach.

TableGen is already basically a subset of Clang it’s self, and having yet another formatter is just a waste of effort.

I think TableGen should be replaced entirely.

Add extensions to Clang to support TableGen’s syntax, so clang can bootstrap it’s self, we already do this when compiling clang with clang.

There really is no reason to have a seperate DSL when we already have a perfectly good parser at our disposal.

This would be a really nice thing to have indeed. I can help connecting to the MLIR part of the community if necessary, our uses of TableGen may be quite creative compared to what it was originally intended for.

I would drop the TableGen part from here. Anybody who has enough experience with it is likely a contributor already and GSoC is mostly targeted at new contributors. General experience with coding tools (formatters, IDEs, even parsers) is probably a good thing.

How so? It has a completely different input language and does not depend on Clang in any way (the actual tablegen library, not Clang uses of it).

1 Like

This is fantastic to see! Tablegen would greatly benefit from having an easy to use formatter tool. I’m happy to help in any way, whether that be on the MLIR side/code review/or anything else you’d need.

Clang format technically has a Tablegen language kind (though I don’t think it’s extensive). If clang-format is not being extended, we should (at least aspirationally) hope to remove that support.

I would encourage that the tool be written as a library (as with most everything else in LLVM). We could quite easily hook this into a TableGen VSCode extension to support formatting tablegen code in-editor (among other things).

Similar to what @ftynse mentioned above, but I would mark this more as encouraged or nice-to-have. TableGen isn’t too expansive, and I don’t think the important constructs are terribly hard to grasp with some time and help (many are similar to other languages, and we have quite a bit of examples).

– River

1 Like

Fair enough, I’ll update this bullet point.

Fantastic! Thank you

Thank you!

I also noticed that but unfortunately, it seems to be a placeholder. IIRC clang-format only makes a trivial formatting on TableGen (something to do with white spaces in a certain scenario). So I agree we can simply drop that language kind once this tools is landed.

Good point! I’ve integrated this into the description.

My opinion is similar to @ftynse 's : There are many cases where LLVM developers build LLVM without building Clang to save building resources (time, space, memory etc.) during development. Integrating the entire TableGen into Clang will force a dependency on Clang libraries and cancel out the said benefits.

Also I think your concern has transcended the scope of GSoC and LLVM Project category might be a better place to discuss.

It also just makes no sense; Clang is about compiling imperative high-level languages to an imperative low-level IR, whereas TableGen is about compiling a declarative DSL to a bunch of tables that happen to be written in an imperative high-level language, and some minimal wrappers around them.

Disclaimer: one of main clang-format contributors and reviewers here.

  • Building this tool on top of clang-format is not encouraged, because then we need to pull in clang libraries as dependencies.

I’m not convinced by this argument. When using clang-format, you don’t need to build it all the time, nobody does that. The fact that you work on and format TableGen code, doesn’t mean that you need to build clang-format at the same time.
Also, the infrastructure is already there, it only needs some more love. There’s only a very basic TableGen handling now. Handling all the possible syntactic structures of TableGen DSL should be pretty straightforward (I have no very limited knowledge of TableGen though, just talking from what I’ve seen in some .td files). Many existing parts of lib/Format could be reused as well.
Another point is that adding another tool would be partially a duplicate effort to set up the tool. IMO, the time needed for this would be more usefully spent on the real formatting implementation. The users would need to learn the new tool as well, whereas clang-format is well integrated in the IDEs, build scripts etc. This might be a major drawback of a new tool.

Finally, I’d be happy to help on reviewing patches improving TableGen support to clang-format.

2 Likes

Ideally, we would have the generic formatting functionality in a common library and leave language-specific parts to frontends. I imagine one day flang may want flang-format, for example.

Given that TableGen is in effect several distinct DSLs as expressed through the different TableGen backends – stylistically, at least – isn’t it possible that there will emerge preferred formatting styles on a per-backend basis?

I can’t come up with any concrete examples which might suggest it’s not as big an issue as I’m fearing, but my gut tells me we might want to format a DAG differently depending on whether it’s a list of register-class members in RegisterClass or an instruction-selection pattern in Pat, for an example pulled out of thin air.

A complication arises there though in that these backends often operate on identical source files: see -gen-instr-info, -gen-asm-writer, -gen-asm-matcher and co all seeing a different “view” of XXXInstrInfo.td.

So I think either the formatter would have to be agnostic to all backend-specific records or would have to provide specific formatting styles for certain named records for a file-by-file basis.

Maybe this is just a rehash of conversations had when people started using clang-format on C++ and the benefits of a blunt automated tool outweigh the negatives. I just wouldn’t be surprised if people fight to keep formatting how they like it in certain specific .td files but not in others. TableGen does seem particularly peculiar in being a language that attaches semantics to specific named records based on the currently-acting backend.

Should any of this be considered in the scope of the project at all? At least to clarify the outcome.

1 Like

That’s a good point. Though I still don’t think we should piggy-back too many languages on the same tool, what might be a better solution is abstracting the common formatting infrastructure from clang-format to maximize code reuse among these formatting tools, including potential ones like flang-format mentioned by @ftynse .
Nevertheless, I’m incline to remove the statement (i.e. Building this tool on top of clang-format is not encouraged…) from the proposal, or change it to more general words to leave more spaces for the participant to explore.

The issue you mention is definitely possible.
What was originally in my mind is that TableGen code in different parts of LLVM, for instance, Clang’s Options.td versus LLVM target’s XXXInstrInfo.td, have different styles (or even different processor vendors have their own formatting rules in their XXXInstrInfo.td). In that case, a per-folder customization – “.tblgen-format” file or something – should be sufficient. And I feel like some of the issues you mentioned can be solved by this kind of localized configurations, too.
For example, since Pat is only used by LLVM targets (and probably MLIR) so we can put a rule in llvm/lib/Target (and some MLIR’s folders) to give Pat some special handling.

(Anyway, if GSoC participants don’t feel like to have a concrete resolution on this, they can always focus on some other simple rules first, like wrapping long lines. Again, they don’t need create a full-blown formatter in 3 months)

Regarding different stylistic rules for different subsystems, I would expect that the Tablegen formatter gets support for style configurations, similarly to clang-format. We can use different styles per project or per subsystem. This is probably beyond the scope of a GSoC project. Even getting reflowing for code and string literals would be a good outcome.

1 Like

Are there any updates on this? Thanks.

AFAIK none of the GSoC participants is interested in this proposal so I don’t think it will be one of the GSoC projects this year.
But generally speaking, this is still a tool I would like to create in the near future.