[RFC] Allowing Dialects to relax the SSA-dominance condition

stephenneuendorffer · March 31, 2020, 9:45pm

Hi all,

Today MLIR has a relatively rich syntactic framework based on SSA form. Fundamentally, this means that every value is defined by a single operation (single-def), and every value is defined before it is used (dominance). When combined with some other basic primitives (e.g. Ops, Attrs, SymbolTables, Blocks, and Regions) and Transformations, we have a syntactic foundation to build and manipulate many dialects.
These might include the traditional notion of CDFGs (e.g. LLVM dialect), CDFGs restricted to structured control flow (Loop/SCF dialect), and higher level abstractions (Affine+Tensorflow dialects).

However, not all operations or all dialects require all of the richness of the structural foundation. For instance, operations can opt-out of some complexity of Blocks using the SingleBlockImplicitTerminator trait. It’s also possible to constrain the use of hierarchy. Some of this is opt-in (e.g., an op declare that it has regions, the default is no regions) and some is opt-out (e.g., the HasParent<> trait restricts the potential containers of an operation).

I’d like to propose that we offer dialects an additional configuration mechanism to enable or disable the SSA-dominance property. This property is important for CDFG analysis and semantics, but is not necessary for some other semantic models. This would allow dialects to express that this kind of IR is legal:

%1 = op(%2)
%2 = op(%1)

Currently, the feedback loop has to be broken in some way using a terminator that preserves the dominance property, for instance something like:

 func @simple(%arg : i32) {
  br ^bb1(%arg : i32)
^bb1 (%a: i32):
  %1 = "testop"(%a) : (i32) -> (i32)
  %2 = "testop"(%1) : (i32) -> (i32)
  br ^bb1(%2 : i32)
}

In particular, we’re interested in various hardware-oriented semantic models where feedback and concurrency are common.
One such kind of model is dataflow/Kahn Process Network models. We have a prototype of such a dialect implemented in MLIR today. In the future, we expect to build RTL-style models used for composing IP in a chip design. While not completely a a showstopper, the syntactic constraints of SSA-dominance makes implementing these models significantly more awkward. Our existing prototype relies on disabling the verifier check relating to dominance, but most other mechanisms seem to work as expected. In particular, printing and parsing work fine. We currently transform into this dialect, but do not perform optimizations on this dialect. I expect that the existing algorithms should work fine or require minimal modification. In particular, I’d expect that we can build a system using declarative rewrite rules where matching the two connected op possibilities %1 → %2 and %2 → 1 are equivalent, which is not possible in the existing system (which would require having some rewrite rule match around a back edge).

So basically, I have two questions:

What are people’s reactions to this? Are there fundamental objections? In particular, I’m sensitive that some may feel that this is completely out of scope for what MLIR is trying to do, to the extent that it should be actively discouraged. In such case, I can see us needing to build our own MLIR-like syntactic framework (or more likely fork MLIR to allow this).
In lieu of fundamental objections: What form should this take? opt-in, requring the addition of a Dominance traits to (most) existing operations with regions? This trait could presumably be responsible for verifying the condition itself, taking ownership of code which is currently in the global verifier. opt-out, disabling the existing verifier code for some operations with regions, or some other form?

clattner · March 31, 2020, 10:35pm

Hi Stephen,

I would also really love to see this. In my opinion, MLIR is great at representing CFG like representations, great at representing nested hierarchical relationships, and is great at DAG structures. However, it is “not great” at representing general Graphs, and even TensorFlow graphs require things like tfexecutor.NextIteration to be split into two nodes. Dominance as a concept isn’t helpful or problematic for DAGs, but really only makes sense when MLIR is being used to model imperative execution flow, and it actively hurts general graphs.

To me, the big question is “what is the best way to represent this in MLIR”. I can imagine a few different options with different tradeoffs:

Status quo: Force everything to be split like NextIteration. This breaks the value of use/def chains (which in a graph context, provide forward and backward edge iteration) and makes things irritating to work with. This undermines a lot of MLIR’s value.
Add a bit (opt-in or opt-out with different tradeoffs) that each region (or operation that has a region) can have to enable or disable the dominance checks and related properties.
Introduce a new region-like thing that is not a region. For example, you could have:

foo.graph @simple(%arg : i32) {{
  %1 = op(%2)
  %2 = op(%1)
}}

Where the {{x}} syntax turns into a new mlir::Graph node that is like a region without dominance. Regions and graphs would be similar, but a graph wouldn’t need to have a Block within it, for example.

What do others think about this? MLIR is getting used for a wider set of domains and the ability to represent general graphs is a very useful capability (e.g. for netlists).

-Chris

_sean_silva · April 1, 2020, 1:09am

I generally like the idea of adding this functionality. Agreed that things like netlists and TF graphs become unwieldy without it.

This is kind of a rehash of the “non-cfg region” discussion that we tabled a while back of having different kinds of regions with different guarantees. I forget if we came out of that with any firm conclusions.

I feel like the approach we have been taking so far is to focus on minimal mechanisms vs building new IR concepts. E.g. for a module body, we have an implicit terminator, single block, and an outside-the-IR SymbolTable helper class, instead of introducing a mlir::DeclarativeNameScope type of special region type that would encapsulate all that (region.isa<DeclarativeNameScope>()?). Adding a bit that enables/disables dominance checks seems most consistent with that.

mehdi_amini · April 1, 2020, 1:32am

I proposed this last year actually, here is the thread (this was the first ever public MLIR RFC by the way!): https://groups.google.com/a/tensorflow.org/d/msg/mlir/gPQFIy9XpVw/hfxmBGF8AQAJ

Ultimately after more internal live discussions, @clattner and @albertcohen were strongly opposed to this, and I went on with another solution for modeling TensorFlow.

If we want to revisit this it’d be interesting to positioning it with respect to the arguments made at the time in the thread.

Note that the proposal at the time was more general than just “dominance”, it was making other properties of the CFG optional in order to address “convergence” as well

rjmccall · April 1, 2020, 4:30pm

Having the ability to temporarily break dominance would be useful for C frontends as well. We could model scopes as use-defs, emit goto and switch naturally as control flow that jumps into those scopes, and then diagnose violations of the language semantics by examining scope dominance.

We might also be able to handle conditional destructors more elegantly.

stephenneuendorffer · April 1, 2020, 6:19pm

Indeed it was discussed before… And I even commented on it at the time!
I’ll try to summarize the previous discussion in a few points I see. Please feel free to add anything I’ve missed:

(Unnecessary complexity) This increases the complexity of MLIR and dialects. There are more things that dialects need to be a aware of and more things that the MLIR core needs to take into account.
(One abstraction) If a dialect can be made to fit in the SSA world, then maybe it should since there’s a bunch of things that come with it (e.g. DCE) and everyone understands it. Or, if some dialects need this king of thing, can we push the complexity into a small number of dialects instead of having core abstractions which potentially impact all dialects?
(Premature complexity) Even if we need this, then do we need it now before we understand how MLIR will really be used?
(Unknown implications) If we allow it, are we committing to making the ‘generic parts’ of MLIR (e.g DCE) work with non-SSA abstractions?
(Well-foundedness/semantics) People know what an SSA CFG means. We can ‘easily’ extend that to lots of dialects which are simple extensions. If we start relaxing this, then we might need to discuss more of semantics. In particular, what does it mean (specifically in terms of dominance analysis, but also more generally) when SSA regions and non-SSA regions interact.
(pragmatics) How is this represented? Is it something which has ‘configuration bits’ in the Region class? is it delegated to ops (maybe through Traits)

I don’t know if I can completely address all of these concerns. My opinion is that this is necessary complexity to represent some problem domains and I think we’ve reached the time where we should have this discussion (perhaps comparing the pros and cons of Tensorflows dataflow representation with our dataflow representation). I think that the presence of operations traits and interfaces provides a natural way to express these properties. I think the semantic questions are still open and perhaps should point to additional places where existing Interfaces are insufficient to support our analysis and transformation passes.

Steve

albertcohen · April 6, 2020, 1:54pm

[Sorry for the very late reaction and thank you @mehdi_amini for notifying me.]

First of all, this is a great discussion and very well-motivated proposal, with pragmatic rationale and more fundamental pro and cons arguments. You’re doing all the work @stephenneuendorffer

I would add the risks of spreading too thin, and of code duplication for “superficially similar abstractions and graphs” to the list of concerns.

Now the specific equational dataflow representations you have in mind, including Kahn networks, dataflow synchronous languages for reactive systems and HDL, are all keen to my heart as you know. As @clattner highlights, it would be a pitty to make the life of dataflow people more difficult than necessary due to control flow idioms they don’t need. As MLIR grows, some fundamental PL concepts — proven clean and effective in their specific domains — are natural candidates for being first-class citizens.

I was initially (strongly) against relaxing dominance, for the reasons listed above, and particularly in the case of TF graphs as Mehdi explained. But we are in a different situation here and my understanding has also matured thanks to discussions with Ulysse Beaugnon and (much earlier) Dumitru Potop Butucaru . Beyond the 3 options Chris proposed, I would draft a 4th one that acknowledges that dominance is only a control-flow-specific view on a more fundamental dataflow and PL concept of causality. There are unfortunately many theories and implementations of the latter, in synchronous languages alone it is a real mess, and as HDL people here probably know it is even worse in hardware design. But some formalisms of causality are very simple and I believe, suit incrementally the design of MLIR rather than breaking it or introducing a special case. This should be an ideal situation since we are talking about something that’s awkward to encode rather than plain impossible, favoring a smoother/incremental change if possible. Without going into any theory or related work (I’ll have to provide some later though), here is the proposal:

Introduce causality-loop-breaking operations.

Branches are exactly that: control flow can be seen in the data flow world as a layer of causality-breaking operations, where data-flow equations within blocks can be sorted topologically (i.e., use-def chains only, sequencing within block is purely syntactic) and imperative code generation is a structured-or-not layer of control flow driving a flat layer of use-def/dependence-preserving schedule of operations. This view assumes individual operations have no side-effect of course. Side-effecting operations should have additional artificial use-def relations capturing the side-effects conservatively (sequencing is one form of implicit artificial use-def relation from one op to the next).

The idea is simply to add dataflow-friendly causality breaking ops, or delay operations. NextIteration in TF executor graphs is exactly that. It is called “fby” (pronounced followed by) or “pre” in dataflow synchronous languages (see Lustre, the father of dataflow synchronous languages, and earlier languages like Lucid it takes some inspiration from, the more modern Heptagon, and the commercial version Scade 6 from ANSYS/Esterel Technologies that helps planes and trains do what they are supposed to do when certification agencies also do what they are supposed to do, sorry I’m diverting to an unrelated political rant).

Stephen’s example may simply be written using custom syntax for fby (think of a loop-phi function with implicity end-of-bb-looping-branch) and pre (the same without initialization):

func @simple(%arg : i32) {
^bb1 (%a: i32):
  %1 = "testop"(%a fby %2) : (i32) -> (i32)
  %2 = "testop"(%1) : (i32) -> (i32)
}

or without initialization (assuming it makes sense, depends on the ops):

func @simple() {
^bb1 ():
  %1 = "testop"(pre %2) : (i32) -> (i32)
  %2 = "testop"(%1) : (i32) -> (i32)
}

if this breaks some people’s sense of what should be linear IR, it’s possible to take the fby/pre outside the expressions of course:

%t = “fbyop”(%a, %2)
%1 = “testop”(%t)

Other languages in that spectrum introduced “next” or “last” keywords, for similar causality-loop-breaking effects, but they may not suit a linear, SSA-style representation so well (to be investigated in a more structured/region-based setting however). Haskell uses… list consing for the same effect, relying on lazy evaluation (resulting into lazy streams), but this is not acceptable in an IR meant for generating stand-alone imperative code.

Note that relaxing the implicit sequential ordering in blocks will break what we have today. The implicit futures solution of the TF executor and TFRT dialects should remain applicable, and does not have the syntactic awkwardness of using control flow in dataflow languages. TBD.

Note also that lowering delay operations like fby/pre to control flow is easy and systematic.

This is all I have for now, pending more in-depth discussion and references for those interested. If this resonates with the needs at Xilinx and dataflow folks, and if it also makes sense for TF folks, I’ll be glad to work on a deeper RFC with those interested.

clattner · April 6, 2020, 9:08pm

Hi Albert,

Thank you for raising these interesting ideas. One quick response: TFRT BEF programs are simple DAGs, so they don’t benefit from any extension in this area AFAIK. TF executor only has the NextIteration problem which is bounded, so I agree that no change would be warranted if that were the only use-case.

The problem here is that there is not any notion of causality that makes sense. In the hardware world, things are implicitly parallel and the ordering within the IR is meaningless. Specifically, these two operations should have the same semantics:

  %x = foo()
  bar(%x)

and

  bar(%x)
  %x = foo()

It took me awhile to get used to this - because I’ve worked with order dependent IRs for so long, but it really is better this way. The issue is that you get important cases where cycles and other issues are critical to the design, and they aren’t bounded to specific operations like NextIteration - they can happen on literally all operators.

This behavior isn’t specific to the hardware world either, this is a general property of domains that want to model graphs. In adjacent domains there is a well known body of work that boils down to things like “how to map a graph into a protobuf/json/xml file” and none of those solutions (e.g. using UUIDs) have been great for archival, must less analysis and transformation…

This my long way of saying that I think we should just accept that a clean generalization for graphs in MLIR is the right way to go. The existing use/def edges as the basis of graph edge iterators, the question in my mind is “what is the least engineering impact to achieve this”?

For example, it would be obviously unacceptable to duplicate Operation into a GraphOperation class to make everything above it be different. I think it would also be difficult/yucky to make the existing Operation class be able to live in a Block or a Graph. It would be easy to add a bit to Region that says that it doesn’t respect dominance. It would fit very naturally into our design to have an “IsolatedFromAbove”-like bit that indicates that the region contains non-execution semantic data, which would imply that dominance doesn’t hold.

The question is - among the ways we could model this, what is the best way?

-Chris

clattner · April 6, 2020, 9:11pm

To follow up with my actual opinion, I think we should introduce a new trait bit that can be applied to operations with regions - or perhaps a bit to regions that indicates that the regions don’t have execution order semantics. This would disable the dominance checks, and would be something that transformations could key off as well. We’d have to nail down the specific semantics and decide whether this is a per-region bit or a per operation bit.

Once we have that, we’d have to decide how to spell it in the generic syntax. I think we want the existing generic region syntax to default to enabling dominance checks, so we should have some way of spelling a generic region without dominance checks and make the be an explicit thing.

-Chris

albertcohen · April 7, 2020, 2:33pm

Yep, I think we are just using different words and ordering of concepts.

Let me rephrase and develop. We have in fact three separate problems to deal with: (1) relaxing sequential ordering in blocks, (2) breaking causality loops, (3) making SSA-dominance an option

(1) Sequential ordering in blocks.

Regarding TFRT BEF I agree this is not directly related, I was shifting focus on asynchrony being modeled through implicit futures. Just like the tf_executor dialect, implicit futures allows to keep sequentially ordered ops in blocks while modeling unordred computations spawned by these ops. This is well known from pure functional languages with futures, where binding a future practically offers a means to bound a set of possible evaluation orders between eager and lazy extremes. The fact this set is not reduced to a singleton is the very reason it exposes/explicits concurrency in the first place. This example shows there are ways to layer unordered computations underneath sequentially ordered ops in a block. It is not great however, as the former are only implicit, they do not exist as first-class objects represented in the IR. Only their originating spawning ops are, not the asynchronous spawned computations. This indicates that whatever worked for tf_executor and TFRT would not necessary apply to a wider context where sequential order should be relaxed in blocks. The foo/bar examnple of Chris is unordred, i.e. pure dataflow semantics without a causality loop. There is absolutely a causal dependence between the execution of foo and the execution of bar, taking the form of electric signal propagation in a hardware implementation, or sequencing of instructions in software. But this causal dependence is not associated with any specific ordering of the instructions in the IR/dataflow-code.

(2) Breaking causality loops.

On the contrary, Stephen’s example has a causality loop. It does not have any meaning unless some properties are known on the ops on %1 and %2, such as analog signal semantics to define an SR flip-flop (one of the fancy causality semantics I was referring to). We don’t want this in MLIR as of now. I’m assuming HDL folks do not want to model the internal semantics of an SR flip-flop. At least this is what I have always seen in behavioral synthesis, ESL, and working with Esterel and BlueSpec folks: there is always a syntactic way to break such cyclic definitions in the language. In imperative compiler IRs, backward branches are used to break causality loops (see also LLVM’s optional loop-closed SSA for a stricter enforcement of loop-phi vs. close-phi separation in the IR; I would refer to Sebastian Pop and Kenny Zadeck and SSA taxonomy experts for further exploration of loop-closed SSA vs. our discussion here). Delay operators like fby/pre are other means to breaking causality loops that do not rely on a CFG and branches. In other words backward branches are only a means of breaking causality loops.

(3) About dominance

The good news is that dominance exists for any DAG. It is not specific to CFGs. It can be defined without control flow, on a plain dataflow graph, as long as it is acyclic. Assuming delay operators are in use, i.e. causality is defined syntactically rather than relying on fancy semantics of special auto-stabilizing compute ops for low-level HW circuits, this is all good news. We can keep dominance, probably compute it the same way as we do today, and definitely use it in any analysis and transformation the same way as we do today. Of course the concrete engineering of things will be more complex as every pass and anlysis today assumes a CFG world with terminators and dominance defined in this restrictive context.

How to go ahead from there.

My recommendation is pretty much aligned with Chris’s conclusions, despite the seemingly different story above. A trait mechanism for ops introducing regions of unordered block semantics seems like a natural thing to do. It was a no-go for TF graphs as implicit futures were good enough, but I think we cannot avoid it if we jump into a wider data-flow/circuit/reactive-system world. Having delay ops in addition to terminators as means to break causality loops seems like the natural other thing to do. Otherwise we’ll be limited to DAGs of ops in a single block within such a unordered execution region. I don’t think these delay ops can co-exist with conventional block terminators in the same region however. We should only allow them with another trait (or maybe the same one) that also disallows branches (only the degenerate implicit terminator can be present and the need for one is a strong invariant of the IR at the moment).

Does this make sense or am I making too many simplifying assumptions about the reality of HDL IRs and what you have in mind @stephenneuendorffer @clattner?

Note for TF folks: I actually think such a design would make the static scheduling of TF graphs much nicer, as more bits and pieces of XLA migrate to an MLIR world. Same thing for tracing compilation of ML graphs in general, where concurrency may be gradually refined into sequentially executed islands depending on the size of the target platform (from transient-powered devices to data center).

stephenneuendorffer · April 7, 2020, 7:17pm

I’m glad to see that everyone seems in agreement that this is something that we should allow in some form. It seems like the discussion has focused on how do we allow this and what use models are valuable here.

I think Albert has really focused in on some of the interesting questions, In particular, the difference between ‘structured’ breaking of causality loops and ‘unstructured’ breaking of them. In particular, my original example:

is specifically posed in this way to suggest an unstructured approach where the basic syntactic mechanisms do not understand the semantics of the operations and hence cannot distinguish how the causality loop is broken.
I think we should also allow structured approaches, but that these can largely be described using the current style of structured ops. For instance, Albert’s pretty-printed example:

Could just as easily be shorthand for:

sequential.op @simple() {
^bb1 (%2_last:i32 ):
%1 = “testop”(%2_last) : (i32) → (i32)
%2 = “testop”(%1) : (i32) → (i32)
sequential.yield(%2);
}

Where the semantics of sequential.op take care of the feedback state.

My assumption is that as some point a lowering to code which is unstructured may be useful.
I see this as analagous to the connection between structured (sequential) control flow ops representing Loops and If conditions, being lowered into unstructured control flow expressed as branches and basic blocks. I see that we need the same thing for concurrent constructs where we often prefer to work in a structured way, but also require a lowering into unstructured forms. Expanding on Albert’s example, this alternative representation of the same example:

%t = “fbyop”(%a, %2)
%1 = “testop”(%t)
%2 = “testop”(%1)

This no longer becomes a question of something that a dialect can allow by overriding fbyop, or the containing op. It is syntactically illegal. I propose that we allow dialects a mechanism to make such structured syntactically legal.

I see several main proposals for allowing this:

Allowing any op to declare that contained regions are graphs. The verifier would change it’s behavior based on this. Presumably this would happen through a Trait on the op which would opt-out of some of the current verifier checks. Default is that SSA-dominance is required. (However, could easily be built as opt-in as well.
Delegate verification to ops. Most existing ops would opt-in to SSA-dominance checks which would exist as a library. Ops are responsible for the additional syntactic properties of their contained regions.
Define a new independent region-like concept for Graphs. Every op can potentially contain a list of graphs in addition to a list of regions.
Define a new concept for Graphs which subsumes the concept of Regions. Every op contains a list of Graphs. ‘Region’ become a special case of ‘Graph’ with additional properties. Every region is also a graph.

I think 1) is by far the least invasive and seems to be what most people are leaning toward.
I’ll plan to sketch this out and make it work with our out of tree dataflow dialect.

mehdi_amini · April 8, 2020, 3:20am

1 and 3 here are my least favorite options: in my opinion this will lead to a second class citizen awkward “system inside the system” situation where most of the MLIR code will be written with the implicit assumption of SSA-dominance and it’ll be too easy to write code that is incorrect/unsafe.
Overall this will make the system more fragile: I prefer that we take the time of making all this more aligned with the rest of the infrastructure.

2 and 4 can be reasonable. With 2 I would go back to the proposal we had last year and take this opportunity to make region more than just binary: SSA dominance is one aspect, another one can be “convergence” for GPU-like programming models.
Option 4 can be also interesting to explore but requires more thinking and investment (it is also a more intrusive change): looking at our Operation nesting could lead us to revisit at the same time the single-block regions that are so common so that an operation could directly enclose a single block without the indirection.

clattner · April 8, 2020, 6:41pm

Yep, I agree. This is a great property and very important for MLIR to be able to do (as it already can), just not directly related to the other topics in this thread.

I don’t understand this point - MLIR as an infra doesn’t know the semantics of any of the operations, that is up to the dialect to define. The passes in MLIR make certain assumptions, and many of the existing ones assume sequential ordering, dominance etc, but that doesn’t mean that all MLIR passes that could exist should have the same assumptions and limitations.

Also, fwiw yes, directly modeling an SR flip-flop is interesting and important!

Fun fact: the LLVM dominance algorithms are implemented as templates that are defined over any graph. I’m not sure what you mean about DAGs and requiring acyclic graphs though, dominance is correctly defined on loops and other cyclic regions in a CFG. Also, to your other point, I’m aware of loop closed SSA and implemented all the LLVM support for it many years ago

I think you’re coming at this from “how do we avoid a change here”, and I’m coming at this from a “graphs with cycles are inherently part of the problem space that a flexible IR like MLIR has to handle; how do we do this in the best way”. I don’t think that introducing “fby/pre” like operators or “breaking loops” is acceptable.

BTW, there is no such thing as causality in a graph. In the circuit world, the whole point here is that everything executes in parallel - there is no ordering. I don’t think there is any notion of causality as you’re referencing.

clattner · April 8, 2020, 7:28pm

I agree completely with this, lets do something “right”, without being pressured to do something “fast”.

Can you explain a bit more about what you mean here? We already have the situation where graph-like structures exist - the tf executor dialect. We do split the NextIteration nodes, so it can conform to the limitations of the current infra, but none of the passes that expect SSA-dominance or flow execution will work correctly on code in this form.

I agree that taking one more step increases the risk of incompatible passes being broken by uses of this, but that is already pervasively the case in MLIR as it is today.

I’m just trying to understand the point you’re coming from here, I’m not implying that any particular solution is better or worse here.

-Chris

stephenneuendorffer · April 8, 2020, 8:34pm

Well, all code in MLIR today has the implicit assumption of SSA-dominance, so we certainly need to understand which existing code can operate on Graphs. We also have many places where correctness is highly dependent on users understanding how to use the framework properly. MLIR is already (necessarily?) fragile in the sense that using it improperly without understanding the implications of certain properties results in unintended results. I don’t think adding a Graph concept changes that fundamentally.

I agree that we need to avoid a ‘system within a system’, and I think this is going to require some in depth experience to understand which parts of the existing framework can be easily extended+verified to operate on graphs, Primarily, I’m thinking of (an interested in) the DRR framework as a key piece.

_sean_silva · April 8, 2020, 9:03pm

Do they actually miscompile, or safely do nothing since they don’t understand such programs (with no need for any specific extra check)? I believe it is the latter. In the case of a missing “does not obey dominance” check in a pass we would be talking about miscompiles, compiler crashes, etc. (which is what I’m most worried about)

clattner · April 8, 2020, 9:33pm

I don’t actually know because I’ve forgotten the details of tf.executor. However, there is nothing that maintains a strong link between the two halves of the nodes, so there is an inherently fragile link involved. I think what you’re implying is correct: you can mark these things as side effecting and otherwise scary to touch in general, but I think my general point stands that there are invariants in the IR that are captured in the definition of the nodes themselves, not expressed in the use-def relation that the edges in the value graph imply.

I don’t think this is true. Most code (e.g. all the pattern rewriter infra) should work find out of the box as far as I know. Perhaps there are minor tweaks, but it would not be invasive redesign to make this work.

I think it is important to make a distinction here between the existing passes which generally assume a flow, and the infrastructure which does not. It doesn’t make sense to run “LICM” on a general graph IR.

-Chris

mehdi_amini · April 9, 2020, 4:35am

It depends what you mean by “work correctly” here. Of course the loop aspect of this can’t be recognized by an SSA data flow analysis, however this is similar to the fact that any regular CFG pass does not “work correctly” on the structure control flow (as in LICM written for a flat CFG wouldn’t be effective for example, but won’t mess up anything either).
However the tf_executor dialect is I believe semantically “correct” with respect to the SSA/dominance constraints though (no MLIR pass would break or mis-compile it hopefully).

I think you’re remembering the “control dialect” which was indeed not sound. The tf_executor variant is more robust from this point of view: the two halves are strongly linked via a token and the sink consumes the token, the verifier of each operation ensures that this is correctly linking a pair:

%output, %token, %control = "tf_executor.NextIteration.Source"() : () -> (tensor<*xf32>, !tf_executor.token, !tf_executor.control)
...
"tf_executor.NextIteration.Sink"(%token, %output) : (!tf_executor.token, tensor<*xf32>) -> ()

Yes, I am very supportive of taking this step. I am only looking into having the best possible handling of the risk involved here, and thinking about how we’ll evolve the common infra so that MLIR is as great for these new “cyclic dialects” as it is for the dialects that fits the current constraints.

So I see having a bit that changes the dominance rules as risking to have many MLIR passes that would just crash by operating in regions where they would just not bother checking the bit. Of course you could say that this is to be covered in code review, and it is a large part of it, but maybe not all of it.

That is what I mean by “second class citizen”: while relaxing the dominance verification based on a flag on the region alone would enable your out-of-tree dialect to work and every passes you write to operate correctly, it would not do anything to have every other pass upstream to be “safe” in face of such inputs. Without more investment it is similar to “just disabling the verifier” for your purpose (hey since you’re out of tree, this is likely a one line patch to maintain in your fork )

Also we have an important use case to address around convergence, and if we introduce different kind of regions: it may be worth taking this as an opportunity to pay some infra cost and not limit us to a binary choice (dominance checking on/off) but have more “kinds” of region allowed.

Concretely, what I proposed last year is that “cfg” would be one “kind” of region (corresponding to LLVM modeling basically), while “graph” could be another one, and “convergent” a third mode. Just like op registration, passes and analysis should be conservative when facing a region of an unknown kind. This allows us to not define all the possible kind ahead of time.

You mention here the motivation to have cyclic graph in a single block to cover hardware circuit use cases, but the “non cfg” modeling could allow to be creative with the relationship between the blocks themselves as well (a block without predecessor is not necessarily unreachable, the first block isn’t implicitly the entry block, etc.).
Last year someone was interested in a dialect to model finite state machine for example, and there may be other cases where having this kind of relaxation can be useful.

bondhugula · April 9, 2020, 5:35am

Instead of modeling this as a relaxation of the dominance relation, I think the problem with dominance can go away if we don’t model the things being defined as SSA values (OpResults). You can use a different symbol (an ‘:=’ or a reverse arrow ‘<-’) to specify the creation of something declaratively, if the goal is to model hardware description languages / models. Instead of an SSA value, the thing being created can be modeled as a specific kind of attribute, and other ops could be allowed to reference it. If you have a block / function full of these, you’ll have zero operands and results overall, and everything becomes declarative. Ideally, we should have a trait for an operation that is not meant to execute - the current FunctionLike trait is an approximation for that.

op(#struct_2) ->  #struct_1
op(#struct_1) ->  #struct_2

albertcohen · April 9, 2020, 1:13pm

Lots of stuff here. Let me try to address all comments.

First of all, Stephen’s sequential dialect proposal looks amazing. [Note for CFG folks that sequential refers to sequential circuits here, and theories of sequentiality in logic, which are more general but compatible with the imperative notion.] This is much more in line with the current MLIR design and practice, including structured loop control flow, but also compatible with unstructured/flat branching ops, while capturing exactly the same causality-loop-breaking information as the dedicated fby/pre ops I had in mind. It does have one major downside however, the backward data flow is not explicit anymore in the name:

%2_last is implicitly taking the value of %2 but this is hidden in the semantics of sequential.yield.
I think this is a serious problem for value-based analyses, likely to induce some engineering pain, but I’d prefer being contradicted here!
Now, without _last suffix, you would get the following:

sequential.op @simple(%2 : i32) {
^bb1 (%2:i32 ):
%1 = “testop”(%2) : (i32) → (i32)
%2 = “testop”(%1) : (i32) → (i32)
sequential.yield(%2);
}
which is the same code the whole thread started from (without the intial argument renaming branch), except for one very important difference: the usage of sequential.op and sequential.yield instead of branches. Maybe “bb1” should be renamed into “node1” or something more graph-oriented…

Following on Chris’s point about MLIR not imposing any semantics on ops, this is exactly what we could leverage here: the data flow is exactly the same, but instead of control flow semantics induced by branches, we use sequential semantics from sequential circuits.

Back to Stephen’s 4 options, I would recommend a more optimistic and minimal-change variant of (2). Ops may introduce restrictions on their enclosed regions already, we can definitely enforce sequential.yield may only occur within sequential.op and no branch or other control flow. the optimistic variant is that dominance does not have to be relaxed at all. I mistakenly referred to DAGs only for the dominance property, thanks for catching this Chris. Dominance can be defined on any directed graph, as well as the forward, backward and transverse arcs (disjoint) kinds. What needs to be relaxed, independently of the whole causality-loop-breaking discussion here, is sequential ordering of ops in blocks. This is the only change to the current MLIR semantics I can see, needed to model the dataflow and circuits world.

So, yes as Chris points out I’m coming at this from a “how do we avoid a change here”, and the only change I’m supporting so far is relaxing sequential ordering in blocks. The sequential dialect things proposed by Stephen do not seem to require any change, and this sounds like amazing news.
Thanks to this, we don’t need to introduce ad-hoc causality-loop-breaking ops like fby/pre anymore, and I think we are in agreement here that this is a big step towards “do thing this the best way”.

Because we don’t need to relax dominance, much of the follow-up discussion on this where people including Stephen, Mehdi, Sean, Uday expressed concerns, seems to fold into “nice, we don’t have to worry anymore about breaking everything”. As Mehdi recalled, just like the “new” tf_executor dialect designs allows LICM and any other dataflow-based pass to do no harm on TF graphs (to be verified in practice… but in theory, and of course to do no good either), I believe things will work well with the sequential dialect proposal. The relaxation of implicit op sequencing in blocks is a different story, but I assume much more incremental and manageable, and also slightly decoupled from a use case perspective.

Regarding Uday’s proposal of using a dedicated family of attributes to represent arbitrary graphs, this can certainly be done. But it would miss on a huge potential opportunity of consolidating the common dataflow core of SSA, circuits, concurrent dataflow graphs, state machines, under a single IR principle. I think we are very close to a solution here that avoids losing a common semantics where we can have one.

Side note regarding causality in circuits, there are many models and unfortunately largely incompatible and overly complex. A good intro set of slides from the Esterel world: Stephen Edwards’s Esterel class. I’m unfortunately not familiar with the Verilog/VHDL/Chisel/Bluespec/… models. But I’m still assuming that the internal modeling of an SR flipflop is not our priority in this thread. I’m sure it is an important problem and maybe even one where MLIR has a role to play but simply would rather keep it out of this thread

Topic		Replies	Views
Relation between SSA and MLIR MLIR	8	1360	March 11, 2022
SSA dominance MLIR	8	546	December 2, 2021
[RFC] MLIR interpreter framework MLIR	72	3610	July 20, 2022
[RFC] Modify IfOp in Loop dialect to yield values MLIR	55	2356	March 20, 2020
Structured Control Flow is Not Necessarily Regions MLIR	18	1449	April 16, 2020

[RFC] Allowing Dialects to relax the SSA-dominance condition

Related Topics