Macro-op fusion experiment

Most x86 implementations use register renaming these days, so micro-operations are non-destructive, but they don't refer to architectural registers. They refer to a larger number of real registers.

Register copies are mostly free to execute except they increase code size and consume decoder resources. To my knowledge, they are not fused in the way you describe.

Intel's optimization reference manual describes which instructions can be fused. The Sandy Bridge processors fuse more pairs than previous generations, but the second instruction is always a conditional branch.

There is no need to define pseudo-instructions to support this. If you want to experiment, you could add a late pass that tries to form fusable pairs by pushing instructions down to the conditional branch. This should happen after register allocation where code is often inserted before a branch.

I would be interested to see the performance impact of such a pass.

/jakob

            8B C3 mov eax, ebx
            03 C1 add eax, ecx

becomes
8B C3 03 C1 add eax, ebx, ecx

In my understanding, twoaddr pass tends to emit such a sequence.

Though I don't have sandybridge, I have not measured.
Prior processors(intel and amd) might spend 1 ALU to execute "mov",
then mov - add must have dependency.

In contrast, the sequence below might be executed in parallel;
mov %ebx, %eax
add %ecx, %ebx
(I understand it might not be applicable in all cases)
Thoughts?

...Takumi

                8B C3 mov eax, ebx
                03 C1 add eax, ecx
becomes
                8B C3 03 C1 add eax, ebx, ecx

In my understanding, twoaddr pass tends to emit such a sequence.

Yes, it always does, and the coalescer tries very hard to eliminate the copy.

Though I don't have sandybridge, I have not measured.
Prior processors(intel and amd) might spend 1 ALU to execute "mov",
then mov - add must have dependency.

I think you will find it is more complicated than that. A 'mov' usually doesn't need an ALU resource.

You should read about the 'reservation station' style register renaming.

http://en.wikipedia.org/wiki/Register_renaming
http://www.intel.com/Assets/PDF/manual/248966.pdf

/jakob

Hi Jacob,

As far as I know, an x86 'mov' instruction always

Hi Jacob,

As far as I know, an x86 ‘mov’ instruction always uses an ALU resource. According to Agner Fog’s documents (http://www.agner.org/optimize/), it can execute on port 0, 1 or 5 on recent architectures though. So it’s not that likely to be resource limited. But it still occupies an instruction slot throughout the entire pipeline, costing power and potentially limiting other actual arithmetic instructions from scheduling optimally. Also, it has a latency of 1 cycle, while non-destructive instructions would shorten the latency of dependent instructions.

My immediate concern is getting a reasonable estimate for how often this macro-op fusion could be performed. This could then be used to evaluate whether it’s worth the added decoder complexity.

Cheers,
Nicolas

In that case, just look at the generated code. I don't think any pass is inserting instructions between 'mov' and two-address arithmetic instructions.

/jakob

Hi Jacob,

I've considered that, but it would require either decoding the binary code, or detecting these instruction pairs during assembly output. I'm not sure which would work best and how to get started. Any and all ideas are greatly appreciated.

Thanks!
Nicolas