Problematic behavior of pass --convert-affine-to-loopschedule

Hi everyone! I’m testing the loopschedule dialect and I found this problem. So firstly I create this function with affine ops.

func.func @elementwise_add(%a: memref<4xi8>, %b: memref<4xi8>, %c: memref<4xi8>) {
  affine.for %i = 0 to 4 {
    %a_elem = affine.load %a[%i] : memref<4xi8>
    %b_elem = affine.load %b[%i] : memref<4xi8>
    %sum = arith.addi %a_elem, %b_elem : i8
    affine.store %sum, %c[%i] : memref<4xi8>
  }
  return
}

After I run the pass I got this:

func.func @elementwise_add(%arg0: memref<4xi8>, %arg1: memref<4xi8>, %arg2: memref<4xi8>) {
  %c0 = arith.constant 0 : index
  %c4 = arith.constant 4 : index
  %c1 = arith.constant 1 : index
  loopschedule.pipeline II =  1 trip_count =  4 iter_args(%arg3 = %c0) : (index) -> () {
    %0 = arith.cmpi ult, %arg3, %c4 : index
    loopschedule.register %0 : i1
  } do {
    %0:3 = loopschedule.pipeline.stage start = 0 {
      %2 = memref.load %arg0[%arg3] : memref<4xi8>
      %3 = memref.load %arg1[%arg3] : memref<4xi8>
      %4 = arith.addi %arg3, %c1 : index
      loopschedule.register %2, %3, %4 : i8, i8, index
    } : i8, i8, index
    %1 = loopschedule.pipeline.stage start = 1 {
      %2 = arith.addi %0#0, %0#1 : i8
      memref.store %2, %arg2[%arg3] : memref<4xi8>
      loopschedule.register %2 : i8
    } : i8
    loopschedule.terminator iter_args(%0#2), results() : (index) -> ()
  }
  return
}

, but I think this is actually incorrect. You can see here in the last stage, the memref.store is trying to use the value of addition result. And later with pass --lower-loopschedule-to-calyx it will fail because of this. If I manually rewrite the generated code to this:

func.func @elementwise_add(%arg0: memref<4xi8>, %arg1: memref<4xi8>, %arg2: memref<4xi8>) {
  %c0 = arith.constant 0 : index
  %c4 = arith.constant 4 : index
  %c1 = arith.constant 1 : index
  loopschedule.pipeline II =  1 trip_count =  4 iter_args(%arg3 = %c0) : (index) -> () {
    %0 = arith.cmpi ult, %arg3, %c4 : index
    loopschedule.register %0 : i1
  } do {
    %0:3 = loopschedule.pipeline.stage start = 0 {
      %2 = memref.load %arg0[%arg3] : memref<4xi8>
      %3 = memref.load %arg1[%arg3] : memref<4xi8>
      %4 = arith.addi %arg3, %c1 : index
      loopschedule.register %2, %3, %4 : i8, i8, index
    } : i8, i8, index
    %1 = loopschedule.pipeline.stage start = 1 {
      %2 = arith.addi %0#0, %0#1 : i8
      // memref.store %2, %arg2[%arg3] : memref<4xi8>
      loopschedule.register %2 : i8
    } : i8
    loopschedule.pipeline.stage start = 2 {
      memref.store %1, %arg2[%arg3] : memref<4xi8>
      loopschedule.register
    }
    loopschedule.terminator iter_args(%0#2), results() : (index) -> ()
  }
  return
}

, the pass lowering to calyx would work. I think you look at this problem.
Besides, I still have the problem of the following:

  1. The pass -lower-calyx-to-fsm won’t work because there is ParOp.
  2. The pass -lower-calyx-to-hw won’t work because there is GroupOp and the calyx pass -calyx-remove-groups won’t work either.

I think the lowering is intended, AddOp is considered combinational by the AffineToLoopschedule pass currently. But I can reproduce the error in -lower-loopschedule-to-calyx. @andrewb1999 @matth2k is this lowering expected, and if so, is some assumption in LoopScheduleToCalyx broken?

There are still some known issues in order to support this end to end. For example, support for ParOp is an open issue that no one has worked on: [CalyxToFSM] Add `par` FSM lowering · Issue #4606 · llvm/circt · GitHub.

Thanks for the reply. If we let the AddOp and the StoreOp in the same stage, I think it cannot be done in one clock cycle, so maybe it violates the II=1 assumption?

As for this, well, I think so, because last year I had the same problem lol. Btw the link in this issue is 404 now.

Another question: can I create the same thing in pipeline dialect, so that I could get rid of any of these issues?

If the AddOp lowers to a combinational circuit, I don’t see why the result of addition couldn’t be stored every cycle. That seems to be the intent of the current scheduling, but I will let the others speak up for what is intended here.

That is certainly an option, if you want to avoid the Calyx path completely. But there is currently no transformation from the higher level dialects like Affine to the Pipeline dialect, so you would have to handle the scheduling and implement that yourself.

Makes sense, I’ll definitely try that.

Thanks for looking into this @HahaLan97! The way we’ve been using the LoopSchedule path right is we generate native Calyx out of the program and compile it using the Calyx compiler.

If you’re interested in trying that out, please take a look at the CIRCT documentation for native Calyx.

Mike is right that we need to support par (and in fact, the newly added static par) operation in Calyx to make the flow work purely using CIRCT but that requires more engineering hours than we have access to. If you’re interested in fleshing some of this out, I’d be happy to help discuss the implementation details.

I’ll try out the pipeline dialect first. Anddefinitely yeah, I would love to help but I need to get more fimiliar with calyx first :wink: I only have some experience last year playing around with it.

Sounds good! FWIW, we’re working on a new HLS tool that uses those abstractions (calyx, pipeline, loopschedule) so we’re planning to support them long term as well.

It would be great!

@HahaLan97 quick update on this: I’ve added a new --calyx-native pass to circt-opt that calls out to the native Calyx compiler automatically and lowers things like ParOps for you. The calyx native compiler can be installed with just one command: cargo install calyx and the --calyx-native pass will start working.

I’m going to attempt to compile your example program through the pipeline and report back!

I’ve verified that once this PR is merged, the following command can lower the corrected loopschedule code all the way to Verilog:

./build/bin/circt-opt --lower-loopschedule-to-calyx --calyx-native --lower-calyx-to-hw --lower-hw-to-sv --lower-seq-to-sv --export-verilog vec-add.mlir

It does require you to use the native compiler but that only needs a simple cargo install calyx command and you’re good to go after that.

1 Like

Hi! Thanks for all the efforts, I’ve been working on my own project these days. I’ll definitely look at this later!