Clang-format or some other auto-format for .mlir files

I’m directly editing .mlir files, and I’d like to have some automatic formatting, ideally using some sort of canonical format.

For example, let’s say I was writing a matmul file, I’d like if the below (poorly formatted) code could be automatically indented, perhaps have new lines inserted at appropriate places to increase clarity, etc.

// C += A * B.
func.func @matmul(%A
                  : memref<2048x2048xf64>, %B
                  : memref<2048x2048xf64>, %C
                  : memref<2048x2048xf64>) {
affine.for %arg3 = 0 to 2048 {
affine.for %arg4 = 0 to 2048 {
affine.for %arg5 = 0 to 2048 {
  %a = affine.load %A[%arg3, %arg5] : memref<2048x2048xf64> %b =
            affine.load %B[%arg5, %arg4] : memref<2048x2048xf64> %ci =
                affine.load %C[%arg3, %arg4] : memref<2048x2048xf64> %p =
                    mulf %a,
    %b : f64 %co = addf %ci, %p : f64 affine.store %co,
    %C[% arg3, % arg4] : memref<2048x2048xf64>
}
}
}
return
}

clang-format seems like the right tool for this job, and is integrated into most IDEs. However I’ve not been able to find a .clang-format configuration for MLIR that gives me what I want. There is the .clang-format in the root of mlir, but I think this is just for the C++ code of MLIR. Running clang-format using this config does not improve the file, in fact it just breaks syntax because it inserts spaces between % and the SSA variable name, e.g., %A -> % A.

We have some MLIR files in the test directory, e.g., here, which look nicely formatted. However I am unsure if this has been done automatically, or through the toil of contributors.

Is there a .clang-format for MLIR files, or some other canonical way to automatically format them?

If I run just mlir-opt [my_mlir_file] with no optimizations, it provides the type of formatting I’m looking for. However it also applies some small yet significant changes (e.g., changing my SSA names). Nevertheless, it shows that auto-formatting is available somewhere. Is there a way I can get this automatically in my editor?

2 Likes

Mlir-opt is the closest yes. Naming is interesting one (well and comments). It would be quite interesting to explore more!

(which is saying, no dedicated tool I’m aware of but would be a nice contribution with some reusable components already).

Considering clang-format and MLIR are both in the same monorepo, could it make sense to at least not treat .mlir files as C++ by default? Perhaps by adding a case to https://github.com/llvm/llvm-project/blob/e1f911e40ce6ad4a7f393ea1b33e65b24940eb84/clang/tools/clang-format/ClangFormat.cpp#L78-L96 ?

Someone on my team accidently ran clang-format on a large lit test file and ended up losing quite a bit of time undoing the changes it made (yes, editor history or version control would have helped… hindsight is 20/20).

I think the best we can do clang-format side is to tell it about .mlir files and then just let it not do anything. It looks like it needs to identify a file type (filetype none being commented to “do not use”), and there doesn’t seem to be a good other option for MLIR files there (not sure which of json or tablegen or … would be least destructive :slight_smile: ).

Most of the nicer formatting mlir-opt does is due to custom ASM syntax which is registered with MLIR. If we wanted that in clang-format, we’d need to have clang-format depend on MLIR which seems undesirable at first glance.

True… and custom syntax definitions are regularly placed out-of-tree, so we could end up with a mlir-clang-format, foo-mlir-clang-format, bar-mlir-clang-format (similar to how mlir-lsp-server needs downstream versions to pull in custom dialects).

The vast majority of MLIR files are likely to be generated by mlir-opt, rather than by hand.

Therefore, it probably makes sense to use mlir-opt in some way to do the formatting for user generated cases too. I can imagine something like mlir-opt --format sample.mlir.

As highlighted above, mlir-opt with no flags gets us most of the way already, but has a couple of undesirable features:

  • it changes variable names
  • it removes comments
  • it wraps everything in a module

I can imagine an extension to mlir-opt that allows these behaviors to be disabled, which I think gives us everything we need. It has the advantage of:

  1. supporting new dialects (since anything that works with mlir-opt should also work here
  2. following any changes made to the formatting in mlir-opt by default.

I’m taking a look at this with an out-of-tree codebase, based on standalone-opt. You can find my in-dev repo here. Right now I’ve just moved MlirOptMain.cpp into the tree, and am looking at trying to achieve the following features:

  • keep the original variable names
  • keep the comments (and later, format them)
  • not wrap the MLIR code in a module

Arguably these features are handy for other use-cases too (as optional config flags).

I imagine comment stripping would be done at the lexing stage, so I’d need to find a way to store those. I guess that the name changes are done during parsing.

Any comments or suggestions appreciated.

For “not wrap the MLIR code in a module”, we already have the flag --no-implicit-module. However, for some inputs passing this causes an error, e.g.,

error: source must contain a single top-level operation, found: 3

Therefore, for this feature, we would need to allow the single top-level operation requirement to be relaxed (optionally, of course).

Regarding comments, as you might expect, they are stripped out by Lexer.cpp:L257.

Right now the design I’m thinking is tightly coupled to the existing mlir-opt codebase. I’m imagining adding a new token, COMMENT, so that it can be optionally preserved when we reformat the code.

One issue this could cause is with syntax checking - I may need to insert logic there to skip comments, otherwise the behaviour may change due to the presence of this new token. Since most MLIR programs would not include comments, I wonder if is an acceptable overhead to include this if check on the critical path, or if I should try and keep it as separate as possible

Textual parsing isn’t considered a “performance critical” use-case (Bytecode should be used instead).

I would be mostly worried about the added complexity to the codebase, but it’s hard to judge without knowing the actual proposed changes.

It’s also not clear why you’re coupling this to mlir-opt? Wouldn’t you want a mlir-format standalone binary, with tighter coupling to the lexer / parser instead?

Good suggestion regarding a dedicated mlir-format binary. I’ve got a clone of mlir-opt now, and will slice off things not required (e.g., all the opt flags) later.

I’m now working in tree now, with this branch. I think that comment handling should not increase codebase complexity much, since it’s just one extra token with very little logic required. We can diff with main once I have a PoC to see if it’s acceptable.

Regarding keeping the original variable names, I could see that adding some complexity if it is built directly into the main lexer/parser infrastructure. I don’t want to mess with the renaming etc that it’s already doing, just restore for formatting reasons. Maybe storing the original names in a hashmap or something, that can be looked up in the final stage of formatting?

Right: but the complexity I’m thinking of isn’t in the Lexer, instead you need to be able to hook the parser to be able to trace every method there so that after parsing you get the IR (which you can discard) but keep a parse-tree built in parallel on which you’ll be able to do the formatting :slight_smile:
(I don’t know if you had something else in mind here?)

Right now, mlir-opt can already generate and print formatted code, which is done on MlirOptMain.cpp:L403. I’m looking through the logic here to understand what it’s doing and how. But with my partial perspective right now, this suggests that most the logic I need for formatting is already implemented here, perhaps in the AsmState class, which states its purpose as “This class provides management for the lifetime of the state used when printing the IR”.

This is printing the IR, which you say I can discard, and just keep the parse tree. I think the approach I’ve been considering is that formatted IR printing works right now, so why not use that, and apply a couple of extensions and/or recovery steps to ensure it follows our requirements of preserving more original information?

  • Comments now (optionally) preserved in the IR
  • Original SSA variable names are restored (we store a dictionary of them at the point that e.g., %A -> %alloc, %B -> %alloc_0, etc).
  • We relax the requirements around module wrapping, or just remove any surrounding module declarations that weren’t there already.

From a design perspective, just using the parse tree is more elegant, since going from IR is extra steps that arguably aren’t needed. But from a codebase complexity and implementation, I believe that it might be simpler to go from IR?

I’m not convinced actually: you will need to keep all the information from the parseTree, but in a side-data structure that has to connect to the IR.

Then you’ll need to some hook all of that in the printer infrastructure.
You may be able to make it so that you can preserve comments and variable names, but you won’t control anything else: does this fit what you want to achieve? We’ll be far from clang-format flexibility.

That’s the hard part, when dialects have custom parsers/printers.

tpp-opt doesn’t quite know what the canonical form is, it just parses and prints back out “in canonical form” because of the parser/printer.

Exactly! But again, having a completely separate clang-format for MLIR would require us to teach syntax and semantics of MLIR to it, which changes wildly (more so than spec-based languages like C++ and Python), so we’d end up with an arms race of sorts.

We do have a similar problem locally (when writing test variations, for example) and what I do is just write some IR, pass it through mlir-opt -canonicalize and work with the textual result in an editor. That loses a lot of stuff, unfortunately.

Trying to hook that up in an IDE would require some external text parsing, which could be more tractable than re-writing the entire MLIR semantics (including custom dialects) in clang-format.

While I agree with both that if we wanted full clang-format then that seems rather tricky (well not tricky, just keeping a lot of information). But “good enough” one could get something perhaps without too much pain (the most complicated is comments in arbitrary spots as then you have to do something with the printer to have it insert line breaks - if we are willing to give up on that and say comment before or next to operation, then easier). I think this also may be what Renato was saying in the last paragraph.

E.g., if we wanted to do comments only on own lines, then convert to custom operation "mlirformat.comment"{str="foo"} : () -> () as pure textual preprocessing (basically the “parser” there only understands comment or not and prints out operations), pass all through the regular parser and printer (where dialects have to be registered same as opt tool to get custom syntax), and then post process final string with unwrapping those back to // foo. Elegant? No. But simple :slight_smile: [could expand to have attribute on_same_line_as_next_op]. The parser can already populate an AsmState while parsing so you could get back original names in printing I believe. Module wrapping could also be a textual post-processing (given the module inserted would be trivial and effectively you have line at top, line at end and extra indent).

Honestly I’d be happy with even just comments, and was considering if one could just abuse the location information during parsing :slight_smile:

I have a commit here that begins to implement this idea. With COMMENT added as a token, we can then parse the token and store it as an op. For now, it is an unregistered dialect, so with that I can turn this:

func.func @add_one(%a: f64) -> f64 {
    // this function adds 1 to the input
    %cf1 = arith.constant 1.00000e+00 : f64 // this comment here
    %co = arith.addf %a, %cf1 : f64
    return %co : f64
}

into this:

module {
  func.func @add_one(%arg0: f64) -> f64 {
    "// this function adds 1 to the input\0A"() : () -> ()
    %cst = arith.constant 1.000000e+00 : f64
    "// this comment here\0A"() : () -> ()
    %0 = arith.addf %arg0, %cst : f64
    return %0 : f64
  }
}

Adding an mlir-format dialect which prints it as expected shouldn’t be too challenging. Adding logic to handle comments which are appended to an existing line (e.g., // this comment here would be needed).

I don’t think that MLIR supports block comments right now.

clang-format flexibility would be nice, e.g., one could customise the output. However, to make some assumptions: unlike a user-facing programming language (such as C++), MLIR users are probably not going to be writing huge amounts of MLIR — they will write test cases, or try and figure out what is going on in some intermediate stage of generation. I find it hard to imagine someone with a full codebase/project with just MLIR files.

Given this, I believe the benefit of a fully-featured code formatter might be marginal, but getting the basics down seems worthwhile.

I also agree that custom printers is also a big concern.

I’ve got an initial version of mlir-format working, which implements comment processing and module wrapping removal. See later in this post for my thoughts on variable name reteniton. I’ve squashed all my changes into commit d3504b on my personal mlir-format branch.

From a high-level design perspective, here’s how the current version of mlir-format works:

  1. mlir-format binary which is essentially just a stripped down clone of mlir-opt, take a buffer as input
  2. Comments (//) now generate a token, which is then processed in the AsmParser. I made sure that lexing // is non-default behaviour, by adding an opt-in option in MLIRContext
  3. The parser generates a mlirformat.comment, which has the comment string as an attribute
    • The mlir-format dialect is about as lightweight as a dialect can be right now, if this project was upstreamed it might make sense to merge it with an existing dialect
  4. The rest of IR generation occurs as normal
  5. The processed IR is returned back to the main mlir-format code (and the inelegant part begins!)
  6. We read the generated IR line-by-line (as a string!), and if the line starts with "mlirformat.comment", then we reformat it. This is done by re-parsing this line, and extracting the comment string attribute.
    • Compared to using a regex, reparsing requires more code, and probably more runtime execution. However I think it’s a bit more robust and extensible.
  7. Finally, we remove the wrapped module (if it wasn’t there already), as well as the extra level of indentation generated. This step could be fused with step 6., and save a full traversal of our data. But I’m leaving this optimisation out during development for clarity.

Some additional features which I think would be useful:

  • inline. right now, inline comments are placed on the next line. this could be an attribute
  • line splitting: if a comment is too long (>80 characters? including indent?), try splitting it into multiple lines, assuming we have whitespace we can split on
  • Make -allow-unregistered-dialect default behavior.

I’m working on some tests right now, but am sharing this update to solicit feedback.
I can make a draft GitHub PR if more specific feedback on the code is needed. I may need some shepherding as this is my most substantial contribution to the project so far.

Variable name retention

As discussed earlier, an important feature of this tool is to retain the original variable names in the file.

Speaking with another MLIR user, I think that variable name preservation could/should be added as an mlir-opt pass, as it could be useful for general debugging too. Given this, I will work on that separately, and make another forum thread to discuss design ideas.

Identifier names for values may be used in an MLIR text file but are not persisted as part of the IR - the printer will give them anonymous names like %42.
Source: docs/LangRef.md

Not persisting variable names is a pretty core part of the current design, so I’m going to have to be wary that I don’t overcomplicate the critical path with this solution.

1 Like

I suspect that MLIRContext is not the place to add this kind of options: it’s not related to the MLIRContext but specific to the parser and should likely be threaded through as such (or just always enabled: textual parsing/printing isn’t a performance-sensitive component, we have bytecode for high-performance serialization/deserialization).

This all likely requires some careful consideration, and overal; I am still not totally convinced by the approach at the moment. It looks like we’re aiming for some sort of very “local optima” to have a quick solution for “preserving comments” without a clear path for long term improvement.
It also creates some coupling between the parser/printer and the IR for the specific handling of the comments (storing these in the IR) you have here, in an unprecedented way.

By adding a pre-processing stage in mlir-format for // lines, I could generate the mlirformat.comments without having to touch the core lexer or parser at all. This could keep the comment handling entirely contained to the mlir-format tool, and sidestep some of these issues? That pre-processing could still be leveraged by other systems that wanted to keep comments for whatever reason.

Regarding long term improvement, what sort of things do you have in mind? E.g., a path to more fine-grained formatting control, such as in clang-format?