Currently, it seems to me that concurrency is only present in MLIR at the level of basic blocks, and at this level it’s very constrained - the description involves no conditional execution.
Has anyone here considered allowing predication at the level of basic blocks? It’s something that has already been explored in SSA literature. It remains well within the bounds of SSA philosophy (static single assignment) but allows the natural expression of synchronous concurrency (as in synchronous languages, well-formed Simulink, embedded computing, etc.). It may be also interesting in the perspective of HW generation.
The idea is to associate a predicate to every operation of a basic block. By default these predicates are “true” (tautology), which corresponds to the current execution model. Execution follows dependencies, but an operation whose predicate is currently “false” will not be executed, nor assign a value to its outputs.
The difficulty is that initialization and single assignment analysis (inside a basic block, at the very least) would require a Boolean analysis, which is more complicated than the current dominance argument.
The advantage is that this would allow the expression of a significant concurrent model (synchronous) while retaining the semantic bases of MLIR SSA and its execution model.
Together with my PhD student, we can write a note argumenting in this direction, but I wanted to first see if we did not overlook some previous work or some fundamental MLIR choices that can not be changed.
And there must be a chapter in the yet-to-be-published SSA book. François de Ferrière is one obvious reference on these topics.
Generally I would not let go with dominance, this is my usual rant by now, but consider predicates as an additional condition for a value to be available (used to be called a Predicate Query System among EPIC/Trimaran/IMPACT/HP/Intel folks). This is very much aligned with a clock tree in a synchronous language, as you mention. Merge nodes and “predicate synthesis” being typically more general than what you would need for Lustre or well-formed Simulink. I believe there is definitely room for a nice experiment, in a dialect offering dedicated attributes and ops for dealing with predicates in predicate-aware MLIR passes. I don’t think this can be made transparent to all MLIR passes however. I definitely recommend you talk more about this with @Ulysse, among others.
Thanks @albertcohen for the positive feedback. We are really in dire need of such feedback right now, for (as you point out) the problem is not simple. We will appreciate discussing with @Ulysse, here on the forum and outside (visioconference?).
We will look into the references you provided (which are different from the ones we found).
I fully agree with you that dominance should be mostly preserved and extended (e.g. with something similar to the “clock calculi” of synchronous languages or the “predicate query system” you mention). The question is possibly how to do it with minimal disturbance to existing data structures. Various approaches seem to be possible, each one with its own advantages and shortcomings. Again - looking forward to discussing them!
As @albertcohen mentioned, I am looking in that direction too. I would be happy to hear more about what you are proposal. Are you proposing an in-depth modification of MLIR that would add predicates to every operation? Or are you proposing to work on a dialect to define and use predicates?
It seems to me that requiring every dialect to support predicates might be too constraining. Ideally, we can build a modular system that different dialects can opt in. For example, a system (a dialect?) for defining and querying relations between predicate might be interesting on its own, outside of the synchronous context.
One addition to MLIR that would help implementing predicates is to allow values to point to other values they depend on. For example, an array might point to to the size of its dimensions and a predicated value would point to its predicate. You can already do this if you have access to the defining operation of the value, but that is not always the case.
I would be happy to chat on viso with you if you want to share more.
FWIW, it might be worth looking that the LLVM MachineInstr representation of predication. It is also SSA based (before register allocation), and has a simple structure. In addition to the predicate operand, you just have the “predicate false result” be passed in as an operand.
For example, to represent a predicated add instruction, you have something like this:
The nice thing about this is that it fits into the general structure of the IR without requiring predication-specific extensions, and supports all the weird stuff, like operations that require multiple predicates etc.
In MLIR, you could also consider a region based approach to this, e.g.:
Thanks @clattner for pointing this, I had not seen it. A similar construct exists in synchronous programming, where it’s called CONDACT. The problem with this approach is that you are forced to provide these extra values (%ifdisabled in your first fragment), and that you must use operations such as std.select to represent the Phi-like places where a value is written on multiple exclusive conditions (these are actually called Psy to distinguish them, like in the references provided by @albertcohen). I don’t know if efficient code can be directly extracted, in any case it’s not straightforward. Furthermore, there are problems that are not solved by this approach, such as the handling of side effects.
This is a bit disconnected from the predication question, but I am not sure I understood what you mean with “at the level of basic blocks”?
We explored concurrency last year in the context of TensorFlow data flow graphs, the way we modeled it is with “asynchronous” operations. You still have dominance but you can see every new SSA value implicitly as a “future”.
This is also close to how TFRT is expressing concurrency using MLIR by the way.
@mehdi_amini: Thanks for your input. Can you give us some more pointers and possibly explanations? For now I have found the TFRT link, but it’s not connected to MLIR, and MLIR does not seem to define the AsyncValue class (my pull is from 2-3 weeks ago). Maybe it’s another branch?
Furthermore, I did not find (yet) a document on the execution model of tensorflow/runtime. The C++ std::future uses asynchronous operations which run in parallel with the thread. Also, it seems that with AsyncValues you can have more than one object waiting on the future value. Does that mean you can represent general dataflow execution? Somehow, the “main thread” is an elaboration code and execution is handled by a runtime?
Thanks for the reference and hint. Experience with Psi-SSA and the like in synchronous languages shows that it does come with significant burden and adaptation of data-flow analyses and traditional SSA-based optimizations. It may be profitable in backends, typically after a first register allocation pass (like spilling only rather than the full thing, which itself could be SSA-based), but not necessarily higher up. In MLIR it would make sense as a dialect with minimal changes to the dominance computation and other core infra, but probably nothing more intrusive. We’ll have a chat with @dpotop and make a proposal.
Everything is asynchronous and the order of the operation is only guaranteed by the data flow. To sequence operations with side effects, “chains” are used. In the example above, none of the “print” are synchronized, so they can execute concurrently and we don’t know in which order they will print.
This does not break MLIR dominance concept, one way to interpret the code is that each of these operation is “implicitly” just enqueuing computation into to a global thread pool.
I am wondering if there is an effort to continue with the ideas this thread here?!
I am looking for a way to add a predicate on intrinsic style instructions when going from SCF to standard - perhaps it would be possible to generate a control dependence graph with only self-loops as suggested in the literature? Thoughts?