Execution time for custom canonical optimization in mlir

Hi There,
We have made a custom dialect and some ops. Now, there are also some canonical optimizations done on these ops. We want to measure the execution time with and without these canonical optimizations – no other llvm or clang optimizations.

Our pipeline is : Our Dialect → Affine → Mlir-llvm → llvm then we use clang-17 to compile our code: ie, Our steps can be summarized as below:

  1. Get optimization : *./OurBin1 file1.mlir -emit=llvm --canonOpt

  2. Get exe: clang-17 -O0 file.ll -o fileexe -lm

  3. Get time of execution: time ./fileexe or using python time.time() & subprocess.run()

But the 2 times with and without optimizations seem to give similar execution time – How do I capture the difference and measure time difference?

For quick reference : the difference is between the loops –

Without optimization:

`affine.for %arg0 = 0 to 2 {
      %cst_18 = arith.constant 0.000000e+00 : f64
      affine.store %cst_18, %alloc_0[%arg0] : memref<10xf64>
    }
    affine.for %arg0 = 0 to 8 {
      %1 = affine.load %alloc_3[%arg0] : memref<10xf64>
      %2 = affine.apply affine_map<(d0) -> (d0 + 2)>(%arg0)
      affine.store %1, %alloc_0[%2] : memref<10xf64>
    }
    affine.for %arg0 = 0 to 4 {
      %cst_18 = arith.constant 0.000000e+00 : f64
      affine.store %cst_18, %alloc[%arg0] : memref<10xf64>
    }
    affine.for %arg0 = 0 to 6 {
      %1 = affine.load %alloc_0[%arg0] : memref<10xf64>
      %2 = affine.apply affine_map<(d0) -> (d0 + 4)>(%arg0)
      affine.store %1, %alloc[%2] : memref<10xf64>
    }`

With Optimization:

    affine.for %arg0 = 0 to 6 {
      affine.store %cst_2, %alloc[%arg0] : memref<10xf64>
    }
    affine.for %arg0 = 0 to 4 {
      %1 = affine.load %alloc_4[%arg0] : memref<10xf64>
      affine.store %1, %alloc[%arg0 + 6] : memref<10xf64>
    }

How can I really measure the execution time difference because of the optimization? Any suggestions would be great.

If that’s the entirety of your code, it runs very fast because it doesn’t do much. If you want to measure something with this code, you probably need to wrap it in another loop that executes a ~million times and look at cumulative time differences. Also beware of machine effects such as caching and CPU frequency scaling.

Hi @ftynse ,
Thanks a lot for the quick reply.
I want to capture the execution time for this small kind of difference as well and for this – I am increasing the loop iteration size which I assume would increase the loop iteration and hence I could see at-least some measurable difference as there are less number of loop iterations as well as load-store instructions. Basically, I am trying with following IR –
More time taking Operations:

    affine.for %arg0 = 0 to 2 {
      %cst_18 = arith.constant 0.000000e+00 : f64
      affine.store %cst_18, %alloc_0[%arg0] : memref<100000xf64>
    }
    affine.for %arg0 = 0 to 99998 {
      %1 = affine.load %alloc_3[%arg0] : memref<100000xf64>
      %2 = affine.apply affine_map<(d0) -> (d0 + 2)>(%arg0)
      affine.store %1, %alloc_0[%2] : memref<100000xf64>
    }
    affine.for %arg0 = 0 to 4 {
      %cst_18 = arith.constant 0.000000e+00 : f64
      affine.store %cst_18, %alloc[%arg0] : memref<100000xf64>
    }
    affine.for %arg0 = 0 to 99996 {
      %1 = affine.load %alloc_0[%arg0] : memref<100000xf64>
      %2 = affine.apply affine_map<(d0) -> (d0 + 4)>(%arg0)
      affine.store %1, %alloc[%2] : memref<100000xf64>
    }

Same Logic with Less Number of Operations:

    affine.for %arg0 = 0 to 6 {
      affine.store %cst_2, %alloc[%arg0] : memref<100000xf64>
    }
    affine.for %arg0 = 0 to 99994 {
      %1 = affine.load %alloc_4[%arg0] : memref<100000xf64>
      affine.store %1, %alloc[%arg0 + 6] : memref<100000xf64>
    }

Still, I am not able to see any measurable time difference.
Here are my questions:

  1. Do you have any suggestions on how can I measure this small difference (may be using single core , disabling caching etc) to get the time difference ? I am using python script for time measurement and also for disabling caching – doing this :
    subprocess.run("sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'", shell=True, check=True)
    and for running on single core:
    command2 = "taskset -c 0 ./fileexe"

  2. Any other suggestions to measure the time difference of loop fusion kind of operation ie, 1 with and 1 without loop fusion ?

  3. I am also using createLoopFusionPass() and
    createAffineScalarReplacementPass() passes and is there any example where I can actually see the impact of these passes in terms of execution time? I am okay to use mlir-cpu-runner as long as I can just see the time difference for the different loop/affine-level optimizations.

Thanks a lot for the help.