I’m Julian from TU Darmstadt, and my past and potentially future research interest is in modulo schedulers. To that end, I was wondering if a generic interface for static operation schedulers for HLS-like compile flows would fit into CIRCT.
A rough outline of what I mean is below. No code is written yet, but I wanted to ask about this early to find out if anyone else a) would be interested in this, or b) is already working on something similar.
Operations with a graph region implement a SchedulableOpInterface
getOperatorType(Operation*) returns one (or more, if the scheduler shall choose) suitable operator types (see below) for each of the nested operations.
getSchedulingDependences() returns a list of precedence constraints relevant for scheduling
An OperatorType corresponds to an item in your HLS component library, and specifies:
latency: number of clock cycles until result is available. Example: 0 = combinatorial, > 1 multi-cycle
blocking time: number of clock cycles until next input is accepted. Example: 1 = fully pipelined, equal to latency = not pipelined
incoming/outgoing delay: physical propagation time in first / last cycle
available units: if at most a fixed number of units of this operator should be instantiated
resources: LUTs, DSPs, …, die area per operation instance
A Dependence comprises a source and destination Operation* and an integer distance which is required for pipeline scheduling (e.g. dst depends on the result from src from 3 iterations / samples / etc. ago).
The SSA graph already encodes most of the dependences (e.g. operands need to be finished before their result can be used), but there can additional constraints such as enforcing an ordering of memory accesses.
A concrete scheduler (such as the ModuloSDC algorithm) would be implemented as an analysis that can be queried for the nested operations’ start times, and optionally (if a particular algorithm computes that info), number of required operator instances, binding of operations to operator instances, initiation interval. This information would be used by a conversion from an untimed into a timed dialect like HIR.
Glad to see you are interested in this. I’m still thinking through the outline you proposed, but in general I am very interested in having these facilities in CIRCT, and I think an OpInterface is a great way to set them up. Your proposal is more specific, but I think very much in line with the high-level goals I proposed in this area back in early January (slides, talk).
If you look at the next steps from the last slide there, I have been slowly chipping away at the bullet about getting a matmul working with the Handshake dialect and dynamic scheduling. As of today, we’re a couplepatches away from having hardware that at least simulates for memories and loops… A more complex example still fails (we have been discussing here if you’re curious), but it’s close.
Once the Handshake flow works, and we can empirically measure the resource utilization and performance with dynamic scheduling, I plan to focus on the static scheduling world. I would like to help out with the design and implementation of these features!
I think you are doing exactly the right thing by proposing this on the forum early. While I think more about it, here are a couple quick thoughts to get some discussion going:
Regarding Dependence, have you looked into what we can re-use here from core MLIR tools? I’m asking because I haven’t, but I assume there is something there already, potentially in the Affine dialect, or in other analyses.
I see you mentioned lowering into a timed dialect like HIR, which makes sense to me, but where were you imagining that would lower from? I.e. what operations would define this interface?
We’d definitely love to have your contributions in this area! I have a few questions:
It seems that you’re assuming that an imperative program has already been lowered to a graph region? How are you assuming this happens? Today in a graph region, much of the high-level structure that probably needs to be used for scheduling has already been lost. For instance, at the Affine Loop level it is relatively easy to identify memory dependencies between load and store operations that are not visible in SSA form. However, after the loop has been lowered to a graph region, this analysis is not so obvious. Are you assuming that dependencies would be carried through to a graph region (probably a handshake dialect model?) and be queried with getSchedulingDependencies? My assumption is that the regions that would be scheduled would still be imperative regions (e.g. an Affine.for loop)
One of the things I’d like to ‘get right’ this time around is to properly handle multi-input multi-output operations (e.g. a Multiply-accumulate) In this case the latency for each operand and result needs to be specified independently, rather than just representing the ‘latency’ of an operator. Do you think this can be incorporated into your work?
A simple place to start might be to schedule operation within a StaticPipeline operation. This would enable bootstrapping the machine model and basic scheduling infrastructure without being concerned with modulo dependencies.
Yes, this interface would fit perfectly into the flow you proposed – anywhere you need a static schedule, you could plug in scheduling algorithms with different objectives, runtimes and guarantees of optimality.
I had not, but this is interesting. The affine dialect provides the required info (mlir::DependenceComponent – the “dependence distance” there is exactly what would go into the distance field in my Dependence POD type). Maybe I don’t need an explicit representation of these scheduling edges… A scheduling algorithm could instead query an operation’s SSA operands and affine dependences.
I come from the oldschool C-based HLS world, so I usually think in terms of control-dataflow graphs that are constructed from a CFG-like representation (undergoing if-conversion), and encode memory dependencies (surfaced by alias and/or dependence analysis on the source IR) as additional graph edges. I believe such a CDFG-like abstraction does not exist yet in CIRCT, and I don’t know whether it should. An affine.for appears to carry the required high-level information and could be scheduled directly during the conversion to a timed dialect.
TBH, as of now, the interface is just about what a broad class of static scheduling algorithms needs/expects, but I am highly interested in making it usable across different synthesis flows in CIRCT.
Yes! A possible modelling would be:
Number the operations’ inputs and outputs
Associate an offset with each input/output (wlog there should be at least one input and output with offset=0)
Extend the Dependence edges to go from (src, index) to (dst, index’)
A MAC component (a*b+c) that completes after 10 cycles and requires the operand c for the addition after 6 cycles then could be represented as an operator type with latency=10, operand_offset[0]=0, operand_offset[1]=0, operand_offset[2]=6, …
This would keep the clean separation of concerns between the dependence graph and the operator info, which I personally like, e.g. to try out different component libraries on the same graph. (The simpler alternative would be to annotate the per-port offset directly to the edges).
Thanks, I will look into this.
Awesome! I’ll try to have some kind of demonstrator test-cases ready by then.
Makes sense, but I do think it will help to push this forward if there are specific use-cases. A couple angles I’m interested in personally:
coming in from the machine learning frameworks that use MLIR. these are all working towards the linalg dialect, and from there we can use upstream tools to massage the IR into a form that is suitable for this kind of analysis/transformation.
coming in from clang. specifically, i’m investigating Polygeist’s mlir-clang tool, and so far it seems quite good. i know others in the CIRCT community are involved in Polygeist, and it seems like a promising way to get to the affine level for those interested in the oldschool C-based route.
I think either way we already have existing tools that can produce actual IR at the level you care about, so I’m happy to be a source of use-cases as the need arises.
This is absolutely something I’ve been thinking about and would like to see! I like the idea of it being an operator interface. There’s a distinct possibility that I’ll need this information for an internal project I’m working on.
FYI: we’ve been thinking about have one or more dialects with ops to model device primitives DSPs, M20ks, etc. An op interface like this could be applied to all those “device” ops to great benefit.
Coming a little late to the party here but some of us have (very) briefly discussed possible higher-level representations for delay. The objective would to bring out an explicit representation of pipelining and multi-buffering to the tensor-level with scf.for/yield and mix that with (tiled) named linalg ops on tensors. The rationale is that we have been quite happy with structured transformations that take advantage of SSA use-def chains + (very) late bufferization.
No concrete thoughts are formed on our end yet but I am wondering if you see opportunities for reusable infrastructure all the way up to the tensor level?
Thanks for weighing in @nicolasvasilache. I am definitely interested to see if we can share infrastructure at the tensor level. Right now, the only way to enter CIRCT is through the standard dialect’s CFG, but I am personally very interested in building up higher levels of abstraction that ultimately connect directly with all the great work being done on tensors.
As you say, this isn’t really a concrete thought, but for a loose example: I would love to use something like the linalg codegen strategy to implement something along the lines of AutoSA at the tensor level, and then drop down to generate a hardware description with CIRCT.