[RFC] Interleaving of auto-generated/hand-written MLIR documentation

Most dialects use gen-dialect-doc to fully auto-generate their documentation. However, I can be beneficial to complement those by hand-written documentation. Dialects doing so do not use gen-dialect-doc. Instead every paragraph gets generated via the appropriate tblgen option (gen-op-doc, gen-typedef-doc, …). This auto-generated documentation is then included in the hand-written one. This results in additional effort and dialects doing so don’t have a uniform structure (the docs for some dialects lack a table of contents, others have two,…).

I recently discussed this with @River707 on Discord and one alternative would be to inline hand-written documentation into auto-generated one. The result generated via gen-dialect-doc could be structured as follows:

# '${dialect.getName()}' Dialect

${dialect.getSummary()}
${dialect.getDescription()}

[TOC]
[include "Dialects/${dialect.getName()}.md";

...

Here, the manually written documentation is inlined directly after the table of contents. The only change in the structure is the additional include. This can be generated with mlir-tblgen by default. The include itself is processed by copy_docs.sh and the newly added include just has no effect if the file to be included isn’t present.

Taking a look into 'llvm' Dialect - MLIR, one could also think of the following structure:

# '${dialect.getName()}' Dialect

${dialect.getSummary()}
${dialect.getDescription()}

[include "Dialects/${dialect.getName()}-top.md";
[TOC]
[include "Dialects/${dialect.getName()}-mid.md";

...

Here, the hand-written documentation is split into two parts, allowing to complement the description provided in the dialect’s tblgen definition. One can also think about inserting explicit include statements after every auto generated section ${dialect.getName()}-ops.md, ${dialect.getName()}-types.md and so on.
However. most dialects do not require this complexity. E.g. with view on the GPU dialect, it seems reasonable to move the first paragraph from the markdown file llvm-project/GPU.md at main · llvm/llvm-project · GitHub to the GPU dialect’s tblgen definition. A single hand-written doc file would be sufficient for the remaining parts.

In total, only minimal changes to mlir-tblgen (only affecting gen-dialect-doc) are required and I have already refactored copy_docs.sh. This doesn’t break the ability to include auto-generated docs in hand-written ones, so we still can do both. The main changes of course apply to the docs but also the tblgen files defining the dialects could be touched.

Looking forward for your comments :slight_smile:

1 Like

Indeed, so gen-dialect-doc currently does

os << "# '" << dialect.getName() << "' Dialect\n\n";
  emitIfNotEmpty(dialect.getSummary(), os);
  emitIfNotEmpty(dialect.getDescription(), os);

  os << "[TOC]\n\n";

So you are proposing adding an additional include statement below? (or just inlining it during generation explicitly?). What is the reason to have it in separate file rather than in description ? Just to be after TOC? We could alo move TOC up or require that TOC be in description of the dialect instead so that formatting can be as dialect wants - default case if dialect has no definition it is only TOC (done tablegen side), else convenience “method” to add TOC (which is not standard markdown).

I would propose to add an additional include statement below explicitly via mlir-tblgen for example by adding

  os << "[include \"Dialects/" << dialect.getName() << ".md\"]\n\n";

Inlining during generation/processing via copy_docs.sh (that script does the inlining) is implicit from my point of view an not explicit.
I think the main point is that you may want to have the ability to have a huge portion of auto generated by e.g.

add_mlir_doc(StandaloneOps Standalone Standalone/ -gen-dialect-doc)

instead of

add_mlir_doc(StandaloneDialect StandaloneDialect Standalone/ -gen-dialect-doc)
add_mlir_doc(StandaloneOps StandaloneOps Standalone/ -gen-op-doc)

Regarding the argument, why not moving everything into the dialect description, see TOSA.md or Vector.md. Not everything can or should be moved.

I mean you could literally have

[include Dialects/foo.md]

in the dialect description today :slight_smile: (the TOSA example seems fine to have in the description, the vector one is larger - although that would still work and with editors that actually understand multiple languages in the same file you don’t lose anything - the biggest “hurdle” today would be that one has to know the # depth that one should use, which could be handled inside doc gen). Include is not standard markdown, and neither is TOC, we are sort of skirting that and saying we only care about what will be generated “main” website side. Which is why I was asking, if we are going to make this a construct, should it actually result in valid markdown or do we entrench this more? And the 2nd one was, why have 2 ways of adding a description of a dialect?

1 Like

Thanks @marbre, this is a nice proposal!

What I like with the include form is that long documentation benefits from being in a markdown file because of IDE and auto formatting options. The description in the .td file suffers from the same kind of issue as inline C++ in TableGen: it displays as a string and isn’t as friendly to edit/navigate.

You can but it has no effect :wink: With the script copy_docs.sh you can only include auto-generated docs in hand-written docs. At the moment your [include Dialects/foo.md] statement is just deleted from the generated marked down file. If have sent this pull request to enable this, which is a first step to get forward with this. But it’s true, it is not a must to change mlir-tblgen. I think this strongly depends on the communities needs.

Good point. I am open minded of moving more of what currently in the scope copy_docs.sh to main repo and get closer to generate correct markdown there.

Regarding your second question, I wouldn’t say it is another way to add a description. I would rather say that it is a convenient way of extending a description without having to add all those to the tblgen file.
Furthermore, as exemplified in my initial post, we can also think about adding multiple includes. One per section generated by emitDialectDoc():

  • Attribute definition
  • Type constraint definition
  • Operation definition
  • Type definition

As mentioned before, I am open minded to let mlir-tblgen take over more of the processing. Instead of letting copy_docs.sh handle the inlining, it is possible to move this to mlir-tblgen or at least somewhere into the monorepo. This is probably a required step if we really want to generate valid markdown more from the beginning.

I think it’s worth to discuss the several options we have :slight_smile: At least to me the way we currently have (only be able to include auto-generated docs in hand-written docs) feels insufficient.

With Change processing of included docs #83 merged, we can actually include hand-written docs in auto-generated ones, e.g. by placing an include directive in a tblgen description. However, I still think that the structure of the docs could be improved as outlined in my earlier posts. I still think we could benefit of letting mlir-tblgen insert some further include directives or maybe even by letting mlir-tblgen taking over the inlining. Are there any further opinions on this?

D’oh missed that and thanks for fixing it!

I don’t have a preconceived answer here, so open to either too. What is useful about not having it in tblgen is that one could combine it with other steps. Pretty much saying “these are two atoms we expect to be supported in some way when you generate from these” doesn’t seem bad. It’s just good to be aware of it. Then again, you could handle this by changing build file order too. TOC is a bit of a pain/it is built in some places so it would be another reason to just enable relying on what shows the files.

What I was thinking of here before also consisted of “sections”, say i want to group all elementwise ops together (as is common), how would I represent that kind of question. I’m thinking: what do we need to hardcode vs expose. E.g., what if we had a section construct (with heading and description, which now can reference an include) and that could be used to create a document outline, the default would the sections as you mentioned and so if you don’t override you get just that. But now you could decide how ops are displayed (group them into categories, skip displaying some ops that are internal only), where attributes are/add text for them, add multiple includes per section, etc. The structure is now built up from a ~4 constructs coded into the tblgen backend and the rest the dialect owner has control over.

I’m not sure if that is going too far :slightly_smiling_face:

More concretely I was thinking, we could have something like:

let description = [{
  My dialect ...
  
  [TOC]

  # Attributes
  [Attributes]
  
  # Ops
  ## Elementwise
  [Ops{Add, Mul}]
  ## Silly
  [include sillydocs.td]
  [Ops{Send, Recv}]
  ## Misc
  [Ops{*}]
}];

So there could be some control as to grouping things (and some conveniences could be added). So default case could be represented in this format and it becomes a little opt-in. E.g., if you want default you get it, if you want to customize you can. Now I’m not yet sure exactly how to make it too opt-in :slight_smile:

An option, if you add [Attributes] then you don’t get that section generated but would everything post, that would constrain to knowing what the ordering is though, so simpler alternative would be to just skip any autogen if [Ops…] or [Attributes] are specified. Or perhaps simpler: the default layout becomes a tblgen def and so if one wants default, you specify:

let  description = DefaultLayout<[{
   My dialect ...
}]>;

and then it is very explicit, but not a lot overhead for folks wanting the default + no magic to keep in mind.

Do we need to have this kind of grouping in the docs themselves? One thought that I had at some point was to have the grouping (or tags) attached directly to the operations themselves, and have the docs use this grouping during generation. If we standardized this across all of the ODS components, that would also make grouping consistent for things like Attributes/Passes/Types/etc (with hopefully less effort on our part). WDYT?

– River

Very good question, I previously thought about using defset so that the grouping is outside the ops and or in this case one could have something like filter on elementwise or filter on has trait X (to reuse a tag on the op/attr/type). The question becomes is there a universal taxonomy that would be worthwhile? If so, then adding it to ops or attributes makes sense. If not, then it shouldn’t be on the op and be some grouping outside of it. And if the grouping only makes sense for the documentation, then it should be specified in the documentation as that is easier to find [and we could use the tags there to introduce some filtering mechanism or enable specifying defset to reuse whatever tagging is done for more semantic purposes]. Now of course, there is a downside to allowing customization in that folks will use it :wink: And I definitely don’t want to reinvent Jinja or some such.

A path could be to keep sections fixed and only allow some changes within a section (e.g.,

FooDialect_Attrs : AttributeDocs<Foo_Dialect> { 
  let description = [{ ... }];
  let attrs = [...];
}

and then the dialect description is only the description but if this construct is found, then Attribute section is overridden).

I think something like this can make sense. I’d like it if we can constraint how dialects are writing their docs, so that we don’t have a large impedance mismatch when reading the documentation for different dialects. Keeping the sections uniform makes sense to me in that regard, and having constrained ways of specialization seems like the right way to proceed.

(As someone who extensively uses inja for a side project, I would very much like to avoid going that direction as well)

– River