RFC: Introduce ml_program dialect and top-level ops (proposal v2)

stellaraccident · April 14, 2022, 4:44am

Ok, folks, I’ve landed the first patch (sorry for the delay – vacation and sickness got in the way).

I’m starting to get echoes back in various frontend forums of the form “when we have ml_program, we can drop XXX” or “we only need this workaround until we have ml_program”, which is an indication to me that we should keep moving.

I would like to proceed to work on the globals support as presented at the top of this thread, modulo a couple of comments made throughout, but I wanted to check in first: are we close or far away from convergence on the initial proposal? If close, we can just continue discussing here or lead with a patch. If far, I might checkpoint into a new thread and/or an ODM. Feedback welcome.

Mogball · April 14, 2022, 5:28am

@mehdi_amini and @jpienaar and I spent some time discussing a mechanism to support externally owned attributes in MLIR. Our main concern is keeping MLIR files standalone, so that things like reproducers work well. Proposal is WIP, but some current users are getting around the problem by having an attribute that refers externally-stored data, e.g. #foo.constant<"bar_table", 1> to refer to the first constant in a table called “bar_table”.

On the syntax of GlobalOp:

This is giving me Java vibes. This is also the opposite of functions which are explicitly declared private.

There’s something about the inline initializer and bounding vs storage type syntax that bothers me and isn’t intuitive. How about:

ml_program.global @foobar(dense<4> : tensor<4xi32>) : tensor<?xi32>
ml_program.global mutable @foobar(extern : tensor<4xi32>) : tensor<?xi32>

Which, now that I think about it, would bring it in line with llvm.global.

stellaraccident · April 14, 2022, 5:40am

OMG, now that you say that, I can’t unsee it (public static void main). You are totally right. The intent was to be close to LLVM in function – aligning the syntax as you propose sgtm (just verifying – I think your critique is just syntax, right?).

It would be really nice if whatever we did had a unified “namespace” for extern globals that are “linked” symbolically by symbol name and whatever goes inside of #foo.constant<...>. i.e. the latter would be a way of “inlining” an extern symbol reference. There is probably some parallel to things that exist, but I haven’t thought through it deeply. This would let us share some of the tooling/file-formats/etc for handling these.

I know it’s just an example, but I don’t love the opaque ordinals and would prefer assuming something that we can bind by a real symbol name of some kind. It makes a lot of things easier to assume that when things compose later in the pipelines vs just having ordinal based.

mehdi_amini · April 14, 2022, 5:49am

I believe there are two different kind of “external constants”:

the “performance” aspect: avoid duplicating/leaking them in the MLIRContext, and possibly uniquing as well. In this case the compiler (analysis/passes/transformations) can query the content of the constants, potentially changing it or creating new ones. For reproducibility purpose we likely need some round-trip / snapshot mechanism, this is what we’re working on that @Mogball referred to.
the truly externals (or “late binding”): the compiler never have access to them, no passes can get their content. We don’t see them in the IR and there is no question of round-trip issue: just having an id/name to refer to them seems fine?

stellaraccident · April 14, 2022, 6:21am

I think there is a bit more to it than that, and actual existing systems that I am aware of do the latter (i.e. Glow or even TF if you stare closely enough) and treat the “compiler” as more of a traditional compiler+linker or LTO thing for cases where the contents are required in order to optimize the program in some way.

The “performance” aspect of them being uniqued in the MLIRContext, imo, is just one part and a bit of an accident of history: the facilities for this should really not have been scaled as they were to general, arbitrarily large constants (there’s nothing wrong with letting people do it, I guess – it is just a questionable design choice for an industrial strength user of the facility). Just wasting a lot of memory by uniquing is bad, but any industrial system also needs to optimize the movement and mapping of this semi-constant data between artifacts that are fed in and out of the tools too – and ultimately there are cases where you need to be pretty conservative and not touch such contents much if at all.

IREE plans to take Glow’s approach and it has dedicated machinery around such evaluations that is designed to be conservative with respect to mutation and movement (i.e. spend extra time analyzing, evaluate, and update a minimal number of times if profitable). These passes are specialized and need to be engineered to fairly high tolerances in industrial systems where we can be munging many gigabytes of semi-constant data, often-times with deadlines and budgets.

I’m not sure we actually need to converge on the downstream system design constraints, and the facilities are orthogonal to each other and can both exist. My hunch, though, is that if we have both ways of binding to external data, if we do it right, they can compose together (and let the implementations duke it out based on their tolerances and use cases) – and that is what I would want to see. Compilers always grow LTO of some form, and planning for that so the two can interoperate will strictly help…

It’s too bad we didn’t have this discussion a day earlier. Would have been a fun thing to hash out f2f in the ODM time slot tomorrow, but it is too late now. I’m happy to keep discussing here, but it would be useful to get a more holistic view of what you/Jacques/Jeff are cooking to see how the two might compose.

Also, we could proceed with the work on globals without letting them be extern to start with and add that later. That would allow us to parallel track a bit more.

mehdi_amini · April 14, 2022, 8:00am

You mean the former instead of the latter? Or I am misunderstanding what you’re saying?

Uniquing is actually saving memory, do you mean “copying” instead of “uniquing”? (There is some level of orthogonality and a system can provide attribute uniquing without copying).

I think what you’re doing is fine, I just see the current “extern” globals that you have as something that MLIR does not provide access to. Of course we can’t / won’t prevent a system like IREE to couple their compiler to their runtime in order to have the compiler passes access and interact with these globals, but getting anything in this direction upstream will require very careful work in order to preserve “hermetic passes” and things like “crash reproducers”.

Mogball · April 14, 2022, 4:13pm

It would’ve been funnier if you’d initially wrote public mutable extern @foobar. But, yes, the critique is just the syntax.

+1, upstream constants should be “pure” extern until a proper mechanism is hashed out.

and +1 here too.

stellaraccident · April 22, 2022, 3:43pm

Ok, I’m going to prepare patches today for:

ml_program.global op and an ml_program.global_load_const op which makes it minimally useful (deferring the side effecting load/store ops to a followup, since that is where discussion remains)
A library and tool defining the external data files

I expect that the first may have spelling level comments that can be handled on the review thread. The second may have design comments but those are probably best had in view of a concrete attempt.

_sean_silva · April 25, 2022, 3:59pm

@stellaraccident - do you already have passes for ml_program.func and func.func interconversion?

stellaraccident · April 25, 2022, 4:02pm

No - if you need those now, I won’t get to it in the next few days and would welcome a patch.

_sean_silva · April 26, 2022, 8:40am

We will get to it eventually as part of our ml_program integration effort, so if it’s not blocking you no need to do it.

stellaraccident · May 23, 2022, 6:44pm

Hi folks, here is the next patch in the series: ⚙ D126230 [mlir] Add global_load and global_store ops to ml_program.

Note that we did not converge on the design of this in the original thread: there were different opinions with respect to graph vs imperative handling. I am proposing this by way of concrete patch as it represents my view of the next incremental step here – but by all means, let us continue the discussion if others feel differently.

mehdi_amini · May 23, 2022, 10:15pm

My understanding was that the safer way was to make sure was always have equal support for graph and imperative region?

stellaraccident · May 24, 2022, 12:17am

I recall that being one of the viewpoints, but I heard at least as strong support for the alternative: load/store ops that accept/produce an access token are different ops and we don’t need them to be unified. Within this alternative, there was also a side discussion that could potentially blend the worlds using some kind of wrapper (being imperative inside and graph-based outside):

%token0 = ml_program.create_token
%0, %token1 = ml_program.execute %token0 {
  %1 = ml_program.global_load @foobar
  yield %1
}

%token2 = ml_program.execute %token1 {
  ml_program.global_store @foobar = %0
  yield
}

mehdi_amini · May 24, 2022, 8:35am

I don’t get this: how are they different? (Other than having a token?)
As far as I understand a version of this op that returns a token and accepts an optional list of tokens would fit both use cases perfectly. The token can just be ignored/unused in a non-graph region.

I am quite strongly against this: while this a « hack » that works in making the representation « correct » semantically speaking (we did this in TF before Graph region existed), it is not practical to use.

stellaraccident · May 24, 2022, 11:16am

That is what I’m trying to get to the bottom of. In this thread and the ensuing discussion, I can tell you that your perspective on this does not seem to be in the majority. In fact, every other person has pushed a position that the worlds should be separate on this point or that such interop things as I presented above are practical solutions to the problem. Now, you may very well be right, and your experience with TFG does give you an interesting perspective on all of this that I want to make sure is accounted for (but also comes with history that can introduce other biases that I want to make sure we understand).

To restart the discussion, pulling a comment from up-thread:

Others made similar comments without rationale in this thread and the ensuing discussion. Perhaps we should elaborate on some of the reasons for those opinions?

_sean_silva · May 24, 2022, 11:30am

From my perspective, the set of analyses that you want to do on these ops is very different whether they are in a graph or imperative region. You would be doing a pure use-def chain thing in the graph region, whereas in the imperative region you need to be dealing with program order with a dense analysis. I guess you could augment the dense analysis with pruning based on the chains (using the semantics “if it has a chain, then it is assumed that the side effects are correctly ordered by looking only at the chains”), but somehow that doesn’t seem like something that I would actually want to implement.

mehdi_amini · May 24, 2022, 12:29pm

Sure the region kind may lead to different set of algorithms, if only because dataflow analysis in a graph region don’t propagate state based on the order in which operations appear in a block.

But that’s a property of the analysis framework with respect to the region kind, and it seems to me that it is orthogonal to the op semantics itself.

We could also define two variants of the op: one that returns a token and one that does not, but we should do this consistently then.

_sean_silva · May 24, 2022, 1:28pm

+1

stellaraccident · May 24, 2022, 11:07pm

So, if we were to go with this approach, I take it that we would roughly:

Define a new TokenType
Define a new create_token : !ml_program.token op.
Define op variants for side-effecting ops that take and return a variadic of TokenType.

Questions:

How would we name the ops? global_load / global_load_df?
Do the dataflow variants implement the same memory effects traits as the imperative ops?
@clattner and @_sean_silva seem convinced that these should be different ops. Could you elaborate more on the reasons?

Topic		Replies	Views
RFC: Introduce ml_program dialect (deprecated v1 proposal) MLIR	33	2274	March 12, 2022
Open MLIR Meeting 2/17/2022: IREE’s input dialect Announcements	10	1031	February 20, 2022
Development of high-level Tensor Compute Primitives dialect(s) and transformations Tensor Compiler	79	11457	September 8, 2021
[RFC] Splitting the Standard dialect MLIR	13	3090	December 5, 2020
Open MLIR Meeting 2/24/2022: Continuing discussion re ml_program Announcements	7	1386	March 16, 2022

RFC: Introduce ml_program dialect and top-level ops (proposal v2)

Related topics