[RFC] Allowing Dialects to relax the SSA-dominance condition

Yep, agreed. I think we need to define the expectations. The base expectations of an MLIR pass is that it doesn’t touch things it doesn’t know - it can be non-functional but should be inert in the face of weird things. I agree with you that this is mostly the case for the current tf executor world (but see below).

Ok, but that isn’t safe with the current MLIR infra in general: your dialect is enforcing structural constraints on the SSA graph that are not value flow dependencies. Splitting into blocks with bb arguments breaks this. I can hear you about to say that this is a single block region - but single block regions are also a structural constraint on the IR. :slight_smile:

What this means is that there are classes of passes that will break in invariants expected of the dialect - for example, an outlining pass cannot assume that operations can be arbitrarily hoisted out of a function (tf executor is not alone here - dynamic allocas and builtinframeaddress in C have the same property etc). Similarly, a de-predication pass that turns a select into CFG has to know whether the select is in a single-block region (like an affine.if body) and make sure not to touch it, but know that it can change std.func body, etc.

While it is theoretically possible to describe every possible invariant using op interfaces, this isn’t really practical, and would just lead to bugs where someone isn’t checking the dozens or hundreds of invariant they have. I think it is far more reasonable to say that certain passes should not be used on certain kind of code, or give those passes granular interfaces that allow them to be disabled on code that is scary.

Yes, I agree, this is the crux of the issue. The question is how do we model this in a way that has low impact on the compiler complexity (no, we should not duplicate Operation! :-). But that makes it easy for passes to be disabled/updated when and if they are found to be buggy on a cyclic dialect. I personally don’t think that code review or “perfect foresight” is possible here. I think it is fine for people to just write lots of code at scale, and people who are assembling compilers will have to deal with the integration problems when they try to do something new and unexpected (like running an LICM pass on a netlist, boggle) :slight_smile:

One thing that I think we may be misunderstanding though - you seem to think that many passes will crash when given cyclic use-def edges. Why do you think this? I can’t think of an example in LLVM that walks up or down use-def edges in unbounded ways. Passes end up getting limited for compile time reasons (such walks end up being exponential) including to amazingly silly constants like “6”. :slight_smile:

-Chris

Sure. There is another workaround where instead of:

%x = op(%y)
%y = op(%x)

You change everything to be uses, so you get things like:

%x = newThing
%y = newThing
op(%x, %y)
op(%y, %x)

This is a fully general workaround, but is really gross for several reasons, including IR bloat / memory usage.

The key thing that use-def chains provide in a graph-like IR is the forward-backward edge iteration capabilities, which are really really nice!

-Chris

@albertcohen I think you might have misunderstood my proposal. I think this is a useful structured way to describe synchronous circuits:

However it also makes sense to lower this into the unstructured representation:

netlist.module @simple() {
^bb1 ():
%1 = “testop”(%3) : (i32) → (i32)
%2 = “testop”(%1) : (i32) → (i32)
%3 = “netlist.reg”(%2)
netlist.output(%3);
}

There are some circuit operations which are more easily done at this level, while it is not uncommon for synthesis tools to start with such a description (e.g. a circuit netlist) and extract a more structured form, which is more convenient for algebraic optimization across pipeline stages.

One thing I haven’t yet convinced myself of is how much bloat it actually represents. It certainly makes matching rules more complex, given the current DRR format, but I don’t have a good feel for how much of a bad thing it is from other perspectives.

As one data point here, when I wrote the generic DCE pass that can detect recursively dead cycles (now folded into the canonicalizer), I did make an assumption during the elimination phase that I could delete ops in reverse topological order from each block and then trim block operands. This avoids hitting the “deleting an op that still has uses” assertion in the compiler.

This is the code in RegionUtils.cpp.

So running the canonicalizer today on something with a recursively dead cycle within a single block would invoke that code and trigger that assertion. It’s actually not hard at all to fix that (just dropAllReferences() before erasing).

This actually goes to the discussion in Triggering pass assertions from mlir-opt: allowed? - #8 by _sean_silva which concluded that passes are not allowed to assert or miscompile on any verifier-valid IR they are provided.

Before we take a solid step into the world of “passes can randomly assert/crash/miscompile if given the wrong IR” I would like to understand better what the expectations are, and if we can formalize this more (as a strawman, I suggested associating each pass with a legalization target which runs in addition to the verifier before running that pass).

Yes indeed I did not fully understand your proposal. Neither the first nor the second form in fact!
I understand now that the “_last” suffix in the structure form encoded the delay information I thought was captured by the “yield” op… And regarding the unstructured lowering step, I now get that the “netlist.reg” op is another syntax for the “pre” op (i.e., delay op to break cycles with some sort of default/bottom/undefined initialization). I’m concerned with the _last prefix approach, in its current form with no explicit binding of %2 to %2_last. And I’m also concerned with the netlist dialect requiring deeper changes than simply relaxing op ordering in blocks, such as recognizing netlist.reg as a causal-loop-breaking op. But none of the forms require relaxing dominance itself, which is good.

To address these concerns, and if I got it right this time, here is an alternate representation for the structured form, sequential circuit level, using a “next” terminator op symmetric to and replacing the implicit “_last” prefix:

sequential.circuit @simple(%arg: i32) {
  ^node1 (%2: i32 ):
    %1 = “testop”(%2) : (i32) -> (i32)
    %2 = “testop”(%1) : (i32) -> (i32)
    sequential.yield(%2 : i32)
    sequential.next ^node1(%2 : i32)
}

It may look odd to have both a “yield” op and a “next” terminator op, but now that block ordering is unspecified, it actually makes perfect sense: don’t think of yield as returning in the control flow sense, but only exposing values outside the region in the dataflow sense.

For the unstructured, netlist level, I’m fine with your proposal. It remains to be seen what are all the implications, semantically and on the concrete use-def chain iterators, dominance computation and algorithms that depend on it, and on the definition updated verification needed to support such extended logic.

Regarding @_sean_silva’s very valid concerns, I believe these changes do not introduce risks of miscompilation, and generally I don’t think we should take steps into a world of wild assert/crash/miscompile. Whenever such behavior occors, we should actually step back and make sure we have the proper verification and legalization logic in place. Whenever verification or legalization are impossible, it rings the bell that we have a broken design, I think. Rather than being motivated by a “let’s minimize change here” policy, I’m striving to preserve the strong principles that let MLIR distinguish itself from the many possible JSONs of IR designs.

Hello everybody,

I’d love to see such a syntax in MLIR:

I try to embed a synchronous language into MLIR, and this construct seems to contain what is needed to start: acknowledging a cyclic execution model (SSA-like), along with an identification of the places where I move to the next cycle (sequential.next) and an identification of the loop-carried dependencies (sequential.yield).

It is also a choice of reason, not going all the way to full-fledged KPNs (where MLIR’s semantics would completely vanish). Instead, it remains at the well-understood level of SSA and synchronous languages, which have simple, deterministic semantics.

However, while working on my small language, I ran into some issues that were less discussed here:

  • Hierarchy - People that do dataflow modeling (e.g. digital netlists, mentioned above) use it. And dataflow hierarchical composition (called instantiation in synchronous languages) is quite different from sequential function call, for instance because state is not shared between instances (as opposed to calling stateful functions in C). The modularity conventions used in sequential code generation for synchronous languages provide a common ground, but they restrict semantics.
  • Consistency/interface with existing dialects - Given that hierarchy is different, how can we interface between code with relaxed semantics and classical code. Calling traditional code from relaxed code (using function call semantics) does not pose a problem. Would the converse be possible? If yes, I’d guess it should involve some lowering of the relaxed constructs to traditional MLIR.

In my work (which is not very advanced), I assumed that the relaxed semantics are specific to my dialect, and then the magic is done by the lowering step, which must perform all the lowering of relaxed-semantics hierarchy and the scheduling of the operations into an SSA form that respects the dominance property.

In any case, I am willing to help!

Best regards,
Dumitru Potop

Hi Dumitru,

Glad to see you here and thanks for your feedback. Looking forward to your contributions to this proposal of course. I mentionned your work/interest earlier in the thread, and I’m glad it resonates with the needs of HDL people.

Note that there was a typo in the quoted code, I forgot to remove the _last suffix from %2. I edited my post but after your posted yours.

I agree. It seems like @clattner also has interest in modeling non-sequential circuits such as an SR flipflop, which would not fit in the proposal we are discussing so far, but may build and extend on it I guess (further relaxation).

I see where you’re coming from. It would be nice to show an example, but in general there should be a way to embed state into a region or function, to implement something like a local static variable in C. How is this best captured in MLIR? Some mangling convention and pushing state to enclosing regions may be a way, but this does not look great.

This is how I see it as well. Calling sequential dataflow (so-called relaxed) code from CFG code would involve lowering. I don’t see a direct path to call a netlist function from CFG however, this is too low level and should go through some wrapper to input/output data to some sort of IP, more like the host/device interaction in the gpu dialect.

A very simple example: Assume your design includes 2 independent modulo 8 counters (which advance by 1 modulo 8 at each cycle where the input is different from 0). There are several ways to represent it. The most natural one from a syntactic perspective is incorrect, because the state of the counter is hidden (either as a static var inside counter8, or as a global variable, shared between the two instances, which is incorrect):

%c1 = call @counter8(%i1)
%c2 = call @counter8(%i2)

The compilation (lowering) of synchronous languages into imperative code takes a second path, by explicitly creating memrefs that hold the states of the two counters (%state_c1 and %state_c2).

%c1 = call @counter8(%i1,%state_c1)
%c2 = call @counter8(%i2,%state_c2)

Note that if the node that instantiates counter8 two times is itself instantiated multiple times at higher hierarchy levels, the state memrefs must themselves be re-instantiated, again. This process is tedious, which explains why explicit state instantiation is never used directly in dataflow programming, but rather exposed during compilation (like in this paper which proposes an IR for dataflow synchronous programming).

Of course, if MLIR is just a rather low-level IR, then we can assume someone has exposed and instantiated all the states of all nodes in the front-end. But is this the objective? If yes, section 2.2.4 of the linked PDF proposes some properties ensuring that a spec can be lowered towards pthreads-like code. Generating this type of code is my current objective.

1 Like

@albertcohen Actually, I wasn’t considering _last to be semantically meaningful, I was assuming that all register values (from sequential.yield were also implicitly outputs. I agree it may make sense to separate the concepts.

@dpotop You bring up some excellent points. Hierarchy and composition between graphs and Regions are important questions. My hope would be that we can relax the framework to the point where we can create dialects that make particular choices about how to address these questions.
Personally, I’m interested in modeling a wide range of concepts. Xilinx devices are highly heterogenous and are targetted from many languages, which means:

  1. We often need to compose different abstractions to specify a complete system.
  2. We often need to leverage different abstractions at different points in the design flow in order to implement a design specified by a user.

The architectures implemented in programmable logic often must tradeoff between resources and throughput. Sometimes deeply pipelined and heavily unrolled circuits make sense, at other times hardware sharing is critical. Sometimes extensive static analysis is possible (and necessary!) whereas other times dynamic structures are used. We see MLIR as a framework which can address these problems.

1 Like

There are a bunch of bad things about this, but the biggest is that almost everything matches directly with SSA-define-before use (e.g. everything in combinatorial logic). Using a representation like above doesn’t get the benefit of the DRR many other bits of infrastructure.

Moreover, beyond this one use-case, I care about MLIR’s general applicability to a wide range of problem domains. Graphs are not particularly unusual data structures :slight_smile: The TensorFlow graphs already need something like this (but worked around it with a hack), John McCall mentions upthread that having something like the proposed capability is useful for frontends. I’m also particularly interested in applying MLIR to more general declarative data representational things (xref JSON and XML applications to non compmiler problems).

I don’t see any drawback to generalizing the framework here.

-Chris

You cannot generally assume non-cyclic structures - at least in LLVM IR, and probably in MLIR as well. Consider this pseudo LLVM IR:

bb1:
  ..
  %cond = cmp lt 0, 0
  condbr %cond, loopblock, other

loopblock:
  %x = phi [someVal, %bb1], [%x2, %loopblock]
  stuff
  %x2 = add %x, 1
  condbr %othercond, loopblock, out

out:

Constant folding will go ahead and constant fold the comparison to false, and other transformations can drop the edge from bb1 to loopblock. When this happens, it deletes the predecessor of loopBlock, which drops the inputs to the PHI node:

bb1:
  ..
  ;; deleted: %cond = cmp lt 0, 0
  br other

loopblock:
  %x = phi [%x2, %loopblock]
  stuff
  %x2 = add %x, 1
  condbr %othercond, loopblock, out

out:

but wait, that’s not what things like llvm::BasicBlock::removePredecessor does by default - it actually simplifies away single entry phi nodes. This leads to this IR:

loopblock:
  stuff
  %x2 = add %x2, 1
  condbr %othercond, loopblock, out

Note that the SSA graph for %x2 is a cycle - it is using itself. As it turns out, this is something that I spent a lot of time trying to fight over the years. If you look at the implementation of llvm::BasicBlock::removePredecessor you’ll see a nice comment from 2002 explaining exactly this issue, and some hacky code that attempts to work around it. The problem is that this code isn’t enough: there are lots of reasons why you can have second order cycles in the SSA graph. This all comes down to how dominance is defined for unreachable blocks.

The final conclusion of this mess is that LLVM IR passes are required to handle cyclic SSA cycles in code that could be unreachable. I’m not sure if MLIR has encountered the same thing yet, but I see no reason that bb arguments changes the fundamental issue here. This is inherent to the nature of SSA IR’s in a CFG representation.

-Chris

Let me be specific about this. This is completely valid LLVM IR, it passes the verifier and everything:

define i32 @void(i32 %a) {
 %x = add i32 %a, %a
 ret i32 %x

unreachable:
  %y = add i32 %y, %x
  ret i32 %y
}

It looks like MLIR is incorrectly rejecting this:

func @foo(%a: i32) -> i32 {
  %x = addi %a, %a: i32
  return %x: i32

^unreachable:
  %y = addi %y, %x: i32
  return %y: i32
}

Did you mean br ^unreachable instead of return in the second block? In the snippet above, there is no path from the def of %y to its use in the add. However, in the following, addi does dominate the use. (The only path to the use of %y in addi comes from addi).

func @foo(%a: i32) -> i32 {
  %x = addi %a, %a: i32
  return %x: i32

^unreachable:
  %y = addi %y, %x: i32
  br ^unreachable
}

But MLIR rejects this as well and this is a bug (in properlyDominates separately and in the Verifier that both require one-line fixes). D78012 fixes properlyDominates.

Why doesn’t it also delete the unreachable block instead of having to break the anti-symmetric property of dominance (which happens with SCCs in unreachable blocks)?

Sure, a self loop is the same constraint: the point I’m trying to make here is that dominance is completely undefined in unreachable regions.

The issue here is very subtle and I’m not aware of any simple answers - I struggled with it for years. :grimacing:

The problem is that you want to keep individual optimizations simple and local. For example, you want to maintain properties like “x = phi(bb1: y, bb2: y) can always be replaced with x->replaceAllUsesWith(y)” and “a phi with a single predecessor can always be replaced with its input” and “conditional branches can always be constant folded to unconditional branches”.

None of these simple and local optimization should have to be aware of global dominance issues. This is for two reasons: 1) The notion of “unreachable” isn’t a local property - you can have unreachable subgraphs that are self cyclic etc. 2) Maintaining and updating something like DominatorSet (or something simpler that just tracks reachability) would be a significant compile time and complexity burden for passes that are transforming the CFG.

Also, while the cases above are simple and seemingly obvious, this is a general problem that occurs in much more complex cases. For example, the %x2 = add %x, 1 case above can be a chain of arbitrary computation, which becomes cyclic when the CFG transformations happen.

-Chris

Incidentally, if you’re interested, this whole issue is why you see Visited sets in recursive helper functions like this in LLVM.

The dialect conversion framework does not operate well with unreachable code I believe: it remove unreachable blocks (or ignore them maybe I don’t remember). I think that one of the current invariant of dialect conversion is that operands are processed before an operation.