Execution time for custom canonical optimization in mlir

tridhapuku · June 25, 2024, 6:05am

Hi There,
We have made a custom dialect and some ops. Now, there are also some canonical optimizations done on these ops. We want to measure the execution time with and without these canonical optimizations – no other llvm or clang optimizations.

Our pipeline is : Our Dialect → Affine → Mlir-llvm → llvm then we use clang-17 to compile our code: ie, Our steps can be summarized as below:

Get optimization : *./OurBin1 file1.mlir -emit=llvm --canonOpt
Get exe: clang-17 -O0 file.ll -o fileexe -lm
Get time of execution: time ./fileexe or using python time.time() & subprocess.run()

But the 2 times with and without optimizations seem to give similar execution time – How do I capture the difference and measure time difference?

For quick reference : the difference is between the loops –

Without optimization:

`affine.for %arg0 = 0 to 2 {
      %cst_18 = arith.constant 0.000000e+00 : f64
      affine.store %cst_18, %alloc_0[%arg0] : memref<10xf64>
    }
    affine.for %arg0 = 0 to 8 {
      %1 = affine.load %alloc_3[%arg0] : memref<10xf64>
      %2 = affine.apply affine_map<(d0) -> (d0 + 2)>(%arg0)
      affine.store %1, %alloc_0[%2] : memref<10xf64>
    }
    affine.for %arg0 = 0 to 4 {
      %cst_18 = arith.constant 0.000000e+00 : f64
      affine.store %cst_18, %alloc[%arg0] : memref<10xf64>
    }
    affine.for %arg0 = 0 to 6 {
      %1 = affine.load %alloc_0[%arg0] : memref<10xf64>
      %2 = affine.apply affine_map<(d0) -> (d0 + 4)>(%arg0)
      affine.store %1, %alloc[%2] : memref<10xf64>
    }`

With Optimization:

    affine.for %arg0 = 0 to 6 {
      affine.store %cst_2, %alloc[%arg0] : memref<10xf64>
    }
    affine.for %arg0 = 0 to 4 {
      %1 = affine.load %alloc_4[%arg0] : memref<10xf64>
      affine.store %1, %alloc[%arg0 + 6] : memref<10xf64>
    }

How can I really measure the execution time difference because of the optimization? Any suggestions would be great.

ftynse · June 25, 2024, 6:56am

If that’s the entirety of your code, it runs very fast because it doesn’t do much. If you want to measure something with this code, you probably need to wrap it in another loop that executes a ~million times and look at cumulative time differences. Also beware of machine effects such as caching and CPU frequency scaling.

tridhapuku · June 25, 2024, 7:23pm

Hi @ftynse ,
Thanks a lot for the quick reply.
I want to capture the execution time for this small kind of difference as well and for this – I am increasing the loop iteration size which I assume would increase the loop iteration and hence I could see at-least some measurable difference as there are less number of loop iterations as well as load-store instructions. Basically, I am trying with following IR –
More time taking Operations:

    affine.for %arg0 = 0 to 2 {
      %cst_18 = arith.constant 0.000000e+00 : f64
      affine.store %cst_18, %alloc_0[%arg0] : memref<100000xf64>
    }
    affine.for %arg0 = 0 to 99998 {
      %1 = affine.load %alloc_3[%arg0] : memref<100000xf64>
      %2 = affine.apply affine_map<(d0) -> (d0 + 2)>(%arg0)
      affine.store %1, %alloc_0[%2] : memref<100000xf64>
    }
    affine.for %arg0 = 0 to 4 {
      %cst_18 = arith.constant 0.000000e+00 : f64
      affine.store %cst_18, %alloc[%arg0] : memref<100000xf64>
    }
    affine.for %arg0 = 0 to 99996 {
      %1 = affine.load %alloc_0[%arg0] : memref<100000xf64>
      %2 = affine.apply affine_map<(d0) -> (d0 + 4)>(%arg0)
      affine.store %1, %alloc[%2] : memref<100000xf64>
    }

Same Logic with Less Number of Operations:

    affine.for %arg0 = 0 to 6 {
      affine.store %cst_2, %alloc[%arg0] : memref<100000xf64>
    }
    affine.for %arg0 = 0 to 99994 {
      %1 = affine.load %alloc_4[%arg0] : memref<100000xf64>
      affine.store %1, %alloc[%arg0 + 6] : memref<100000xf64>
    }

Still, I am not able to see any measurable time difference.
Here are my questions:

Do you have any suggestions on how can I measure this small difference (may be using single core , disabling caching etc) to get the time difference ? I am using python script for time measurement and also for disabling caching – doing this :
subprocess.run("sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'", shell=True, check=True)
and for running on single core:
command2 = "taskset -c 0 ./fileexe"
Any other suggestions to measure the time difference of loop fusion kind of operation ie, 1 with and 1 without loop fusion ?
I am also using createLoopFusionPass() and
createAffineScalarReplacementPass() passes and is there any example where I can actually see the impact of these passes in terms of execution time? I am okay to use mlir-cpu-runner as long as I can just see the time difference for the different loop/affine-level optimizations.

Thanks a lot for the help.

Topic		Replies	Views
Speeding up canonicalize? MLIR	50	1967	May 25, 2021
How to execute various MLIR programs? MLIR	1	311	February 7, 2023
Very slow performance of lli on x86 LLVM Dev List Archives	10	112	November 16, 2009
"mlir-translate --mlir-to-llvmir" and "llc" are slow MLIR	10	747	March 26, 2021
MLIR/Linalg bad performance MLIR	5	567	August 12, 2021

Execution time for custom canonical optimization in mlir

Related Topics