Is it possible that optimization by O3 is getting worse because of the decreasing of parallelization?

I usually measure a program by optimizing with O3 and single pass - mem2reg. Like command below:

opt -S -passes=mem2reg input.ll -o mem2reg.ll
opt -S -passes='default<O3>' input.ll -o O3.ll

and then compiling them to executables.

Now, I used Linux tool - perf to measure these two files by running ten thousand times. And I found that the performance of executable which is optimized by O3 is worse than optimized by only mem2reg.

I compared the information of them like instruction, cycle, etc. And I found that the instruction numbers of O3-executable is less than mem2reg-executable's. But it took more cycle to execute O3-executable. I also found that the instruction per cycle of O3-executable is less than mem2reg-executable's, so it took more cycle to run O3-executable apparently.

The report:

Performance counter stats for './O3.exe' (10000 runs):

          0.718537      task-clock (msec)         #    0.814 CPUs utilized            ( +-  0.50% )
           2285477      cycles                    #    3.181 GHz                      ( +-  0.02% )
           2349142      instructions              #    1.03  insn per cycle           ( +-  0.01% )

       0.000883099 seconds time elapsed                                          ( +-  0.49% )

Performance counter stats for './mem2reg.exe' (10000 runs):

          0.597211      task-clock (msec)         #    0.801 CPUs utilized            ( +-  0.16% )
           2115689      cycles                    #    3.543 GHz                      ( +-  0.02% )
           3248302      instructions              #    1.54  insn per cycle           ( +-  0.00% )

       0.000745130 seconds time elapsed                                          ( +-  0.16% )

I remain the cycle, instruction and run time to show what I said.

Therefore, I conclude this by myself: O3 can reduce the instruction numbers of a program but it decease the parallelization of the program in the same time. If the effect of reducing instruction numbers cannot cover or better than the effect of decreasing the parallelization of a program, it will get worse by optimizing with O3.

To proof my conclusion, I do two experiments. I introduce them below.

  1. using single cpu core to measure them. The performance of O3-executable is better than mem2reg-executable now.
  2. Complicate the program. Comparing with mem2reg-executable, instruction number of O3-executable is reduced apparently and the parallelization is reduced too. But the performance of O3-executable is better than mem2reg-executable.

Maybe these two experiments can proof that my conclusion is true. But I don’t think this issue is not that easy.

Let’s have some discussion for this. Thanks!

It’s not terribly surprising to me if there are programs which are slower than -O3 than without. Optimization levels are largely heuristics tuned over large benchmark suites which try to be representative of most programs.

In any case, it’s hard to say what is going on without access to the program. :slight_smile: This is a very large program, and even though the instruction count increased, it’s possible that the program’s execution time is tied up in a hot loop or one or two functions. The parts which are larger may not be executing at all.

I’m puzzled: seems like O3 gets just faster here? How did you get to “O3 is worse”?