[RFC] Split Pipeline Dialect and Add Representation for Sequential Loop Scheduling

Separate Pipeline While and Pipeline Pipeline into different dialects

  • The current pipeline dialect contains two different levels of abstraction
    • Pipeline.pipeline
      • Represents a scheduled linear pipeline
      • RTL-like level of abstraction
    • Pipeline.while
      • Represents a scheduled loop pipeline
      • HLS-like level of abstraction
  • Essentially no shared code between the two levels of abstraction:
  • Could potentially lower from pipeline.while to pipeline.pipeline. Current way of sharing dialect between two levels of abstraction that have a clear lowering direction doesn’t match the normal structure of MLIR dialects
  • Should be separated out into two dialects, one that focuses on the RTL-like retimeable pipeline and one that represents HLS-like (but not only HLS) loop scheduling
  • A simplified pipeline dialect opens up the ability to introduce filament-like (paper, documentation) type checking for RTL pipelines

Combine Pipeline While with a new representation for unpipelined loops to produce the LoopSchedule Dialect

  • Enables the representation of pipelined loops nested inside unpipelined loops (common for many machine learning workloads)
  • Enables the representation of a sequence of pipelined or unpipelined loops
  • Can also enable loops to be mixed with other basic op types (add, mul) and function calls
  • Only need two more ops to achieve this: SeqWhile (each iteration happens sequentially) and Step
    • Total of 4 main ops in the dialect with PipelineWhile and Stage

SeqWhile and Step Rationale

Both of these operations are heavily inspired by the Pipeline dialect, but tailored for their application in unpipelined scheduling. SeqWhile represents an unpipelined loop, meaning that the II of the loop is equivalent to the loop body latency (i.e. we only start a new input once the entire loop body has completed). Step represents a control step in the schedule. The operations in a control step run in parallel and each control step at the same nest level of the IR run sequentially.

To illustrate this idea further, we will look at an example scheduled vadd design:

func.func @vadd(%arg0: memref<8xi32>, 
    %arg1: memref<8xi32>, %arg2: memref<8xi32>) {
    ls.step {
      %c0 = arith.constant 0 : index
      %c8 = arith.constant 8 : index
      %c1 = arith.constant 1 : index
      ls.seq_while iter_args(%arg3 = %c0) : (index) -> () {
        %0 = arith.cmpi slt, %arg3, %c8 : index
        ls.register %0 : i1
      } do {
        %0:3 = ls.step {
          %2 = arith.addi %arg3, %c1 : index
          %3 = memref.load %arg0[%arg3] : memref<8xi32>
          %4 = memref.load %arg1[%arg3] : memref<8xi32>
          ls.register %2, %3, %4 : index, i32, i32
        } : index, i32, i32
        %1 = ls.step {
          %2 = arith.addi %0#1, %0#2 : i32
          ls.register %2 : i32
        } : i32
        ls.step {
          memref.store %1, %arg2[%arg3] : memref<8xi32>
          ls.register
        }
        ls.terminator iter_args(%0#0), results() : (index) -> ()
      }
      ls.register
    }
    return
}

Step operations can be nested inside of a function or a SeqWhile op. All of the ops in a given Step are started at the same time and finish when all control flow contained in the step finished. Although not shown in this example, multi-cycle ops start in their given step but we do not wait for their results to be completed before the step finishes. Instead, their values must just be produced before each step that uses that value is run. This allows other steps to run simultaneously with the multi-cycle op.

The difference in semantics between Step and Stage operations is very important because in the general case we cannot know the runtime of a loop (unbounded loops). We would not nest another loop inside of a pipeline, but we absolutely would nest pipelines and other sequential loops inside of a sequential loop.

This small set of additional operations allows us to express a wide range of scheduled programs. Step ops can also support conditional execution of the form ls.step when %0.

Lowering LoopSchedule to Calyx

Although LoopSchedule could be lowered to a number of other dialects, our initial goal is lowering to Calyx. Calyx does a good job at allowing us to describe the kinds of operations we want to represent in LoopSchedule, which makes lowering much easier. PipelineWhile ops can be lowered through Calyx through existing passes and we will add to these passes to allow joint lowering of pipelined and unpipelined loops to Calyx. To lower SeqWhile and Step ops, each core operation (add, mul, etc) is translated into a Calyx group, each Step op is translated into a Par block in Calyx, and each SeqWhile is translated into a while loop. There are a number of edge cases that need to be handled, but the general concept is straightforward.

Why a LoopSchedule Dialect instead of a PipelinedLoop and UnpipelinedLoop Dialects

  • Can share a lot of ops such as register and terminator
  • Simplifies the mixing of Unpipelined and Pipelined loops
  • Lowering passes are more consistent as we cannot lower Pipelined and Unpipelined loops in the same file independently, must happen as one large pass

We would greatly appreciate feedback on this proposal and will be discussing in more detail at the CIRCT ODM meeting on March 22nd.

1 Like

I haven’t read the filament work in depth yet, but I agree with the general goals to better represent and schedule pipelines for loops, and I’m excited to hear more about this proposal in the ODM.

Regarding the desire to separate this into a new dialect–I could personally go either way on this. It’s not a huge deal right now since this is all pretty experimental, and dialects are easy to create/rearrange, but some food for though…

First, some backstory… the current Pipeline dialect has evolved over time. It was originally called the StaticLogic dialect, with a single pipeline operation. That operation was eventually removed, the pipeline.while operation added, and the newest pipeline.pipeline operation came next. This has all evolved as people have had time and energy to work on it, so further improvement is most welcome.

You mentioned there is no code sharing, and this is true, but does it always have to be this way? Perhaps as part of this effort, we can better unify the representations. You pointed to two different passes–one that takes the pipeline.pipeline to the HW dialect, and one that take the pipeline.while to Calyx. There is no reason the other directions can’t exist as well; we could lower pipeline.pipline to Calyx, or pipeline.while to HW.

I’m sort of playing devil’s advocate here. I can also see how carving out that parts related to loop scheduling into its own dialect, which can be targeted from HLS flow, and lowered to Calyx keeps things well scoped.

I guess I’m curious how the proposed LoopSchedule dialect would interact with the “simplified pipeline dialect”, if at all? Would they interact, or was the mention of filament just to point out that it doesn’t make sense on the parts of the current pipeline dialect related to loop scheduling?

Separately, I’m interested in the proposal for the sequential while and step ops. At the surface they look so much like the pipelined while and stage ops, that I might suggest adding attributes to indicate the differences in behavior. But you mention important semantic differences, so I think it makes sense to define separate ops for this. Curious to hear more about how they differ, and what they share. Perhaps there are some OpInterfaces, or at least helper functions, that can help share implementation, for example.

Anyway, just some initial thoughts from me. I’m very excited to see where this goes.

1 Like

Thanks for the feedback!

We could (and probably should) support passes from pipeline.pipeline to Calyx and pipeline.while to HW, although I’m not sure how much code we would actually be able to share here given that pipeline.pipeline uses comb dialect ops and has explicit clock/reset values. I think having the shared code in something like the CalyxLoweringUtils would probably be good enough from a code sharing perspective. In theory, lowering pipeline.pipeline should be quite a bit easier than pipeline.while since there are no back edges (represented as iter_args in pipeline.while), so I’m not sure that sharing code between these two lowering passes would actually even reduce code size that much. All of this is to say that I don’t think sharing lowering passes for these ops is the worst idea, I just don’t think it gives enough benefit to outweigh the clarity improvement from splitting this into two dialects.

The main point of mentioning filament was mostly to say that restricting the Pipeline dialect to linear pipelines could enable opportunities to ensure the correctness of those pipelines. The main goal of Filament is to give types to linear pipelines in such a way that if it type checks it is guaranteed to be safely pipelined. This kind of type checking doesn’t make much sense at the pipeline.while level because they do not have explicit loop control logic yet. So, restricting the Pipeline dialect to linear, RTL-like pipelines would make adding something like Filament type-checking much easier.

We could also envision an eventual lowering pass from PipelineWhile to pipeline.pipeline that instantiates explicit loop control logic, essentially transforming the PipelineWhile into a linear, RTL-like pipeline.

Yeah I’ve thought a lot about this too, and I think these new ops have different enough semantics that it is justified for them to be separate ops. If there was a reasonable way to share the ops, I would not be opposed to it, but I think it would just create more confusion that it solves. At least with separate ops it is immediately clear which loops are pipelined and which are not. I do think there could be some OpInterfaces or helper functions to share implementation details, but I would have to think a bit more about this while implementing.

Thanks again for the feedback, just wanted to give my quick thoughts here.

1 Like

By ‘scheduled’, you probably mean ‘scheduleable’. pipeline.pipeline supports both – if scheduling doesn’t occur, you end up with a combinational block, but you can (read: should) schedule it into pipeline stages. There are passes to schedule it.

Can you better define ‘pipelined’ vs ‘unpipelined’? By ‘unpipelined’ do you mean combinational? Or yet-to-be scheduled/pipelined? Non-pipelinable? By ‘pipelined’, do you mean automatically scheduled? Or does manual pipelining count? Pipelinable?

All RTL is re-timeable, which means changing the position in the pipeline but not the number of pipeline stages. Is this truly what you mean? If not (the number of stages can be modified), I think pipeline.pipeline is what you’re talking about.

Sorry to be so nit-picky on language. I tend to interpret things literally and differently than how people mean them, so using precise, well-defined language is important for me.

1 Like

Yes, pipeline.pipeline can represent both a scheduleable and a scheduled pipeline. I should have been more clear here.

By unpipelined I mean the II of the loop is equal to the latency of the loop body, which is equivalent to saying that there is only one in flight loop iteration at a time. Not sure if there is a better terminology for this than unpipelined, but the term comes from traditional HLS where a loop has not had a “pipeline pragma” applied. These kinds of unpipelined loops can be scheduled using plain resource constrained scheduling rather than modulo scheduling as required by pipelined loops.

You are right, I used the wrong wording here. Not sure what the correct term is here (maybe hyperpipelining although I think that is an Intel specific term). Functionally I mean split the dialect such that the current dialect contains pipeline.pipeline and the new dialect contains what was pipeline.while.

No worries, I appreciate the feedback and hope that clears things up some. Please let me know if I am still not being clear.