X86 - Help on fixing a poor code generation bug

Hi all,

I noticed that the x86 backend tends to emit unnecessary vector insert
instructions immediately after sse scalar fp instructions like
addss/mulss.

For example:
/////////////////////////////////
__m128 foo(__m128 A, __m128 B) {
  _mm_add_ss(A, B);
}
/////////////////////////////////

produces the sequence:
  addss %xmm0, %xmm1
  movss %xmm1, %xmm0

which could be easily optimized into
  addss %xmm1, %xmm0

The first step is to understand why the compiler produces the
redundant instructions.

For the example above, the DAG sequence looks like this:

  a0 : f32 = extract_vector_elt ( A, 0)
  b0 : f32 = extract_vector_elt ( B, 0)
  r0 : f32 = fadd a0, b0
  result : v4f32 = insert_vector_elt ( A, r0, 0 )

(with A and B of type v4f32).

The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or
X86ISD::INSERTPS depending on the target's SSE feature level.

To start I checked if this bug was caused simply by the lack of
specific tablegen patterns to match the complex sequence described
above into a single ADDSS instruction.

However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
a commutative SSE scalar fp add instruction (that works on F32
ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes
and it will be translated to 'addss' in assembly.

At this stage, the MOVSS/INSERTPS is still required since the ADDSS
alone would not be equivalent to the hardware 'movss' instruction.

I then started investigating the possibility of adding a pass that
runs at 'postRegAlloc' stage.

Before RegAlloc it may not be safe to remove the redundant MOVSSrr
because of the TwoAddressInstruction Pass; this may decide to commute
the operands of the ADDSS/MULSS.

It is possible to write a pass that scans through each basic block in
a function looking for opportunities to fold instructions based on the
following patterns:

  //////
  B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
    ==>
  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
  /////

  /////
  B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
    ==>
  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
  /////

With 'NAME' in {ADD, SUB, MUL, DIV}.

I managed to create a very simple prototype pass which simplifies the
code according to pattern #1 (see the patch in attachment).
However, identifying patterns at postRegAlloc stage is definitely more
complicated than matching patterns on dags: instructions are not in
the SSA form; the pass must keep track of defs and kills and leave the
liveness info in a consistent state; some instructions may have side
effects etc.
Last but not least, the register mapping implemented by the register
allocator and how instructions are scheduled may strongly affect the
ability of a pass like this to pattern match specific sequences of
instructions.

In conclusion, my questions are:
Would a patch be acceptable that introduces a new MachineFunctionPass
that runs at 'postRegAlloc' stage?
If not, is there a better way to fix this bug?

Thanks,
Andrea Di Biagio
SN Systems - Sony Computer Entertainment Group

patch.diff (7.36 KB)

Hey Andrea,

Hi all,

I noticed that the x86 backend tends to emit unnecessary vector insert
instructions immediately after sse scalar fp instructions like
addss/mulss.

For example:
/////////////////////////////////
__m128 foo(__m128 A, __m128 B) {
  _mm_add_ss(A, B);
}
/////////////////////////////////

produces the sequence:
  addss %xmm0, %xmm1
  movss %xmm1, %xmm0

which could be easily optimized into
  addss %xmm1, %xmm0

The first step is to understand why the compiler produces the
redundant instructions.

For the example above, the DAG sequence looks like this:

  a0 : f32 = extract_vector_elt ( A, 0)
  b0 : f32 = extract_vector_elt ( B, 0)
  r0 : f32 = fadd a0, b0
  result : v4f32 = insert_vector_elt ( A, r0, 0 )

(with A and B of type v4f32).

The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or
X86ISD::INSERTPS depending on the target's SSE feature level.

To start I checked if this bug was caused simply by the lack of
specific tablegen patterns to match the complex sequence described
above into a single ADDSS instruction.

However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
a commutative SSE scalar fp add instruction (that works on F32
ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes
and it will be translated to 'addss' in assembly.

At this stage, the MOVSS/INSERTPS is still required since the ADDSS
alone would not be equivalent to the hardware 'movss' instruction.

I then started investigating the possibility of adding a pass that
runs at 'postRegAlloc' stage.

Before RegAlloc it may not be safe to remove the redundant MOVSSrr
because of the TwoAddressInstruction Pass; this may decide to commute
the operands of the ADDSS/MULSS.

It is possible to write a pass that scans through each basic block in
a function looking for opportunities to fold instructions based on the
following patterns:

  //////
  B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
    ==>
  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
  /////

  /////
  B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
    ==>
  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
  /////

With 'NAME' in {ADD, SUB, MUL, DIV}.

I managed to create a very simple prototype pass which simplifies the
code according to pattern #1 (see the patch in attachment).
However, identifying patterns at postRegAlloc stage is definitely more
complicated than matching patterns on dags: instructions are not in
the SSA form; the pass must keep track of defs and kills and leave the
liveness info in a consistent state; some instructions may have side
effects etc.
Last but not least, the register mapping implemented by the register
allocator and how instructions are scheduled may strongly affect the
ability of a pass like this to pattern match specific sequences of
instructions.

In conclusion, my questions are:
Would a patch be acceptable that introduces a new MachineFunctionPass
that runs at 'postRegAlloc' stage?
If not, is there a better way to fix this bug?

Thanks,
Andrea Di Biagio
SN Systems - Sony Computer Entertainment Group

With my compiler, which has a proprietary frontend and Builtin
lowerings, we do produce the expected assembly:

# BB#0: # %file test.c, line 3, bb1
       addss %xmm1, %xmm0 # test.c:5
       ret # test.c:5

The IR produced is:

%r = load <4 x float>* %A, align 16, !dbg !11,
%r3 = load <4 x float>* %B, align 16, !dbg !11
%r4 = call <4 x float> @llvm.x86.sse.add.ss(<4 x float> %r, <4 x float> %r3), !dbg !11
ret <4 x float> %r4, !dbg !11

Since the int_x86_sse_add_ss intrinsic operates on vectors, not
scalars, there's no need for the inserts and extracts.

Of course, with my compiler, we miss out on any optimization
opportunities that may arise from using the IRBuilder. So, this is not
ideal either. But, we do avoid partial register updates... so we've
got that going for us. :wink:

I do realize that this does not solve your problem. Just thought that
it might help design an ideal solution.

HTH,
Cameron

Hi Andrea,

Thanks for working on this. I can see two approaches to solving this problem. The first one (that you suggested) is to catch this pattern after register allocation. The second approach is to eliminate this redundancy during instruction selection. Can you please look into catching this pattern during iSel? The idea is that ADDSS does an ADD plus BLEND operations, and you can easily catch them. You can add a new target specific DAGCombine or a table-ten pattern. You should also handle mul/add/sub.

define <4 x float> @foo(<4 x float> %A, <4 x float> %B) nounwind readnone ssp uwtable {
  %1 = extractelement <4 x float> %B, i32 0
  %2 = extractelement <4 x float> %A, i32 0
  %3 = fadd float %2, %1 // Both the fadd and the insert element below should be matched into
  %4 = insertelement <4 x float> %A, float %3, i32 0 // an ADDSS which does an ADD and a BLEND in one instruction.
  ret <4 x float> %4
}

Thanks,
Nadav

Hi Nadav and Cameron,