This is a thought I’ve been toying with and mentioned before, so I want to start a discussion thread. It’s not really an RFC yet, but I am happy to make a full proposal if the gist of what I’m talking about seems generally useful.
There are situations where we may want to lower the arithmetic and binary operations in the Standard dialect into corresponding operations in combinational logic. For example std.addi gets lowered to firrtl.add in HandshakeToFIRRTL, as do several other operations. Another example is related to the work to lower SCFToCalyx, which has a need to convert Standard operations into something Calyx understands. A proposal is to create a library of Calyx primitives (which is what the existing Rust compiler does).
My thought is we could approach this how Linalg does it. Specifically, I’m talking about one of the properties of structured linalg ops: “Property 5: May Map To an External Library Call”. A structured Linalg op can be lowered into a well-defined loop nest and standard operations, or it can be lowered into a library call. The analogy in CIRCT would be to lower Standard operations into Comb logic, or a set of primitives.
To roughly sketch how this could work, I’m imagining a library with a bunch of RewritePatterns for converting Standard ops we care about into Comb ops. It would also have a mechanism to register primitives to lower into (which could also be accomplished with RewritePatterns, for example). We could expose a function to run the conversion, which would either use the primitives or fall back to generating Comb logic. This could be put into a standalone -lower-standard pass, or used directly by other things that need it.
Besides just lowering, such a library could provide other important information about what the operations lower into. For example, implementations (either in Comb or primitives) could have an associated latency, which the Scheduling infrastructure could query.
Any thoughts or feedback welcome. I’m hoping this could be useful for several things in tree: HandshakeToFIRRTL, SCFToCalyx, and upcoming Scheduling work.
I think such an infrastructure could be a great utility for any HLS-related flow in CIRCT, as a way of “applying” a specific target operator library to a behavioural description in the std dialect. I particularly like the concept of having the default comb mapping with the option to refine it to target-specific primitives.
Is your primary interest here the simple std → comb cases, or more complicated cases like what linalg does? If it is the simple things, then such an approach could be significant overkill.
In any case, I’d recommend switching away from generating firrtl.
I guess the idea is that different flows accept Standard and want to turn some of the common operations into corresponding Comb operations. I was hoping to put something together that makes this easy in CIRCT.
I think there are other situations where those Standard ops might be converted into operations from a library of primitives, which could be Calyx, Xilinx device primitives, or something else. The idea I was taking from Linalg is to have both paths supported easily: either a default lowering to Comb, or the ability to hook into a library.
Either way, I definitely wouldn’t want to target FIRRTL, this would be for flows that want to be “CIRCT native” and could target Comb or other dialects.
Disclaimer: I’m not following the Calyx work, so I may be missing major things
I agree that it is highly valuable to have different dialects mixed and match together into a module. This is a strong goal of the “std” stuff on the SW side, and the “hw” dialect on the HW side.
How do you see the tradeoffs between having a single monolithic “lower to HW” concept vs dialect specific lowerings? For example, the mapping between xilinx device primitives to instances seems like it could be a specific separable thing. Also, there is a related but different jump from the imperative SW domain to the implicitly parallel HW domain that seems separable.
Do you think these are the same problem, or are they separable?
I do think those are separable. I’m not advocating a monolithic lowering to “hw” from “sw”, but I’m observing different flows take the arithmetic and binary operations from std and convert them in a straightforward way to corresponding ops in a “hw” dialect. I’m not really talking about the branching and control flow operations in std, just the arithmetic and binary operations that show up frequently.
For example, one flow we’re looking at is to take a mixture of SCF and std, and start lowering that “to hardware”. What to do with control flow, pipelining, etc. is a whole other proposal, but at some point before emitting Verilog we’ll need to convert things like std.muli into a hardware dialect. I was imagining we could have a set of patterns for default conversions, like std.muli->comb.mul. Or, if you have a dialect with your own set of primitive multipliers, those could be plugged in instead.
Does this scope make sense? For starters I was imagining writing some RewritePatterns to do the simple conversion to Comb that could be plugged into whatever pass wants them. Down the road I was thinking that could be extended to other dialects, incorporate scheduling information, etc.
Does this imply that the multiply is always combinational? While this conversion may seem 1:1, I’m not sure it is. I imagine if this is part of the Combinational dialect, then its latency should be 0 cycles? However, I don’t think that’s what we actually want (always) here. Let me know if I’m missing something obvious here.
We ran into some similar (?) problems when designing the Calyx standard library. For example, an exp operator could be implemented to optimize for speed, LUT, space, etc. What exactly is a standard exp operator? What is its latency? We could take the C library approach and say “this is the standard, feel free to not use it.” However, I’m not sure what that’s what we want.
These are all similar questions you note in #1800 as well, and think could apply to some simpler operations even.
This is a good question, and certainly not all multiplies are combinational. I’m looking at it like this: everything in the comb dialect is combinational by definition, and if you choose to use the comb dialect in your primitive library, you are selecting one point in the design space.
I’m really just trying to provide the information needed by Julian’s scheduling infrastructure. If you want to schedule based on purely combinational primitives, so be it. If you want to use the Calyx standard library, great. If you come up with a vendor-specific dialect of primitives, that should work too.
I’m becoming less interested in having “default” lowerings to CIRCT core dialects, and “library-call” lowerings layered on top for specific things. In my mind, the core dialects and any other dialects should be equally first class in this “primitive library” infrastructure. It’s really about providing a generic way to answer the questions needed to schedule a computation onto a set of limited primitive resources.
I’m also thinking about how to capture the different design tradeoffs (throughput, latency, area, power, etc.). Multiple primitives may be functionally correct for a given operation, but some may be more beneficial than others given what you are optimizing for. I think we can express this in terms of cost models and benefits, just like how multiple RewritePatterns may match the same IR but have different benefits.
I’m also thinking about how to capture the different design tradeoffs (throughput, latency, area, power, etc.). Multiple primitives may be functionally correct for a given operation, but some may be more beneficial than others given what you are optimizing for. I think we can express this in terms of cost models and benefits, just like how multiple RewritePatterns may match the same IR but have different benefits.
Well said.
This makes sense. As far as providing the “correct” information for a scheduling infrastructure, wouldn’t it be a matter of defining certain attributes such as latency? It seems anything that is going to be used for scheduling should be required to provide certain information (well, if it is statically scheduled).
Yes, in order to perform static scheduling, we need to know a couple things: latency and resource limits. These apply to the primitives we will lower into, but we need to be able to answer the question at the high-level IR. For example, given a std.muli, what is the latency of the primitive it will lower into? It’s not enough to say that a calyx.std_mult_pipe takes 3 cycles, we also need a way to indicate that std.muli maps into calxy.std_mult_pipe. In other words, we need to associate high-level ops to primitive ops, and attach latency and limit information to the association.
One way to handle this could use an OpInterface and the relatively new mechanism for externally attaching an interface to an operation. Say we define a SchedulableOpInterface with methods getLatency and getLimit. A dialect that wants its primitives to be available for scheduling could define a “primitive library” that uses attachInterface to implement the SchedulableOpInterfacefor the high-level operations that can map into its primitives. A pass that wants to statically schedule some high-level IR would need to “load” a primitive library so the interface is attached to the appropriate ops, and then query the interface to construct the scheduling problem.
Another approach might be to just capture all this information in a datastructure that can be queried. In this case, a dialect could populate a “primitive library” with (high-level Operation class, latency, limit) tuples, and provide methods to query for a given Operation: getLatency(Operation *), getLimit(Operation *).
The second option seems like the simplest way to capture the necessary information for scheduling, but I’m intrigued by the attachInterface functionality. Any other ideas?
I’m intrigued by the options here too… I think this is worth spending the time to try out. I would suggest focusing on the intrinsic characteristics of the operators (like latency and II) and avoid properties that are likely to come from design constraints (like instantiation limits). Note that one of the trickiest things about modeling an operation library comes from the fact that it is inherently bitwidth parameterized. Despite @clattner 's objection, it will become important to model operations with different bitwidths here, which means a large characterization space is somewhat inherent.
That makes sense to me, and I’ll stick to things intrinsic to the operators for now. I was mentioning limit because it’s something the scheduling infrastructure can use, but I agree that this will probably come from somewhere else, based on the target.
This is great, I totally agree with the goals of, and ideas for, this infrastructure.
Could we maybe design the infrastructure in a way that optionally allows to defer this tradeoff to the scheduling phase? As you know, my pet peeve is combining the HLS steps into a single optimisation problem, and I have previously modelled these aspects (component selection, number of operator instances) as part of scheduling problems (which I also plan to bring to CIRCT). Solving these combined problems is hard, but it’s on my research agenda to find good heuristics, and having realistic benchmark instances from CIRCT-based flows would be amazing!
That sounds great to me. I’m glad you have experience with this and want to bring a solid approach to CIRCT. Happy to help push this forward.
Do you have any thoughts on the ideas presented above? I’m personally pretty fond of the idea of having the Scheduling infrastructure define interfaces. This could allow other dialects in-tree or out to implement the interface, or for passes/tools/etc. to use attachInterface if they want to implement it for a dialect they might not even own.
Yes, that’s a really elegant solution. I see the ability to have arbitrary code in these interface methods as a tangible advantage over the data structure-based approach; I’m thinking of hierarchical designs here where the interface implementation for call (or hw.instance) operations looks up a previously scheduled operator’s latency.
One important aspect that I think is currently missing in your prototypes is the association of operations to what I called OperatorType in scheduling::Problem: For resource-constrained problems, we need the ability to determine the subsets of operations that shall share the limited number of operator instances.
Currently, OperatorType is just a type alias for mlir::Identifier, and its properties are stored within the specific scheduling problem classes. We could remodel the operator types as an actual class hierarchy, and let the SchedulableOpInterface have only one method to return references to suitable operator types (managed by specific passes/tools/etc.).
Maybe (haven’t fully thought this through) it would even make sense to have a second interface OperatorTypeOpInterface (not a good name), which we would then attach to comb ops or hw.module to provide the properties.
Agreed. I think I will focus on the interface-based approach.
That’s a great point, and I had a shower thought along these lines. I can rework the interface prototype accordingly. If it makes sense to evolve OperatorType into an interface as well, we can definitely do that. But even just reworking the SchedulableOpInterface to have a method like OperatorType getOperatorType(Operation *) should help it fit more cleanly with the Scheduling infrastructure as it is today.