Hi all,
I noticed that the x86 backend tends to emit unnecessary vector insert
instructions immediately after sse scalar fp instructions like
addss/mulss.
For example:
/////////////////////////////////
__m128 foo(__m128 A, __m128 B) {
_mm_add_ss(A, B);
}
/////////////////////////////////
produces the sequence:
addss %xmm0, %xmm1
movss %xmm1, %xmm0
which could be easily optimized into
addss %xmm1, %xmm0
The first step is to understand why the compiler produces the
redundant instructions.
For the example above, the DAG sequence looks like this:
a0 : f32 = extract_vector_elt ( A, 0)
b0 : f32 = extract_vector_elt ( B, 0)
r0 : f32 = fadd a0, b0
result : v4f32 = insert_vector_elt ( A, r0, 0 )
(with A and B of type v4f32).
The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or
X86ISD::INSERTPS depending on the target's SSE feature level.
To start I checked if this bug was caused simply by the lack of
specific tablegen patterns to match the complex sequence described
above into a single ADDSS instruction.
However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
a commutative SSE scalar fp add instruction (that works on F32
ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes
and it will be translated to 'addss' in assembly.
At this stage, the MOVSS/INSERTPS is still required since the ADDSS
alone would not be equivalent to the hardware 'movss' instruction.
I then started investigating the possibility of adding a pass that
runs at 'postRegAlloc' stage.
Before RegAlloc it may not be safe to remove the redundant MOVSSrr
because of the TwoAddressInstruction Pass; this may decide to commute
the operands of the ADDSS/MULSS.
It is possible to write a pass that scans through each basic block in
a function looking for opportunities to fold instructions based on the
following patterns:
//////
B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
==>
A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
/////
/////
B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
==>
A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
/////
With 'NAME' in {ADD, SUB, MUL, DIV}.
I managed to create a very simple prototype pass which simplifies the
code according to pattern #1 (see the patch in attachment).
However, identifying patterns at postRegAlloc stage is definitely more
complicated than matching patterns on dags: instructions are not in
the SSA form; the pass must keep track of defs and kills and leave the
liveness info in a consistent state; some instructions may have side
effects etc.
Last but not least, the register mapping implemented by the register
allocator and how instructions are scheduled may strongly affect the
ability of a pass like this to pattern match specific sequences of
instructions.
In conclusion, my questions are:
Would a patch be acceptable that introduces a new MachineFunctionPass
that runs at 'postRegAlloc' stage?
If not, is there a better way to fix this bug?
Thanks,
Andrea Di Biagio
SN Systems - Sony Computer Entertainment Group
patch.diff (7.36 KB)