We tried loop tiling logic for matrix multiplication C program and got a hike in performance with execution time.So, we created a loop tiling pass and successfully executed. But unable to see the same hike with performance, with the pass created. In loop ordering , it comes after “Print Module IR” pass, with -O3 optimization level. Can anyone help why it is not performing ?
It’d be hard to help without the source code. But you can do a few things to guide your team:
- Take a small example untiled and tiled, run through upstream
llvm-optand print the final IR.
- Take the same examples, run through
llmv-optwith your pass, compare the IR with the tiled version.
If the result of your pass on the same untiled loop does not produce the same (optimal) IR, look for the deltas as clues to what happened.
A number of things can be affecting your pass to make an effect:
- The loop control structures aren’t built correctly and your pass is not actually doing what you want
- Alias analysis can’t prove the access are not independent
- The transform you do introduce inefficiencies elsewhere (loop structure)
- Some canonicalization introduces bloat inside the inner loop
- If you didn’t delete the original loop and somehow the code still goes through it
But without looking at the code and resulting IR it’s hard to be more specific.