[RFC] Exposing MLIR's Tokenizer, Lexer and Parser

The core tokenizer, lexer and parser in MLIR are currently not reusable.
As a consequence, multiple clients have to rewrite their own, atm I can see:

  • parts of the Toy tutorial,
  • the declarative format introduced here
  • the named ops generator under review
  • internal use cases for xPU.

Is there a strong motivation for not wanting to expose these core components?
Note that in D77067 reusing the core MLIR parser is now a blocking requirement.

What’s people’s take on this topic?

I’ve been mulling what it would take to have a parser generator dialect which would take something like an ANTLR grammar and generate the lexer and parser bits for a dialect. This is probably alot more than what you’re thinking about, though. :slight_smile:

1 Like

For the record, my default is to shy away from copypasta, but if we can articulate and record in this forum a reason why this is a good solution, then that’s fine for me as well. I just want to make sure that the discussion takes place and is codified.

I don’t think that splitting this out and pretending it is reusable is a good idea - too much of it is specific to decisions in the MLIR syntax. I would much rather see a little parser generator framework for defining grammars and generating a lexer/parser from a declarative specification.

That said, the existing MLIR syntax is very hackable if you are flexible about the details. Defining a dialect with customer parser rules can already do much of what you want. I have toyed with the idea of having a flag that changes the default dialect from “std” to something else, which would make this even more interesting.


That’s what I initially suggested, but it’s not easy because this runs as part of the build :confused:

It’s only difficult if IR/ depends on the generated thing, which isn’t the case here or for most things. As an example, the dynamic patter rewriter work that is upcoming will depend on several different dialects which is a much larger dependency than IR/.

+1. The Parser has so many different MLIR specific decisions and assumptions that it doesn’t really make sense to expose. The custom assembly format parser wouldn’t benefit at all from exposing it. The thing it may benefit from is exposing the parts of the lexer, but that is mostly the base as the allowed tokens and error handling differ.