Instcombine-code-sinking increases the value’s live range

Hi,

In the InstCombinePass, by default the pass will try to sink an
instruction to its successor basic block when possible (so that the
instruction isn’t executed on a path where its result isn’t needed.).
But doing that will also increase a value’s live range. For example:

entry:
  ..
  %6 = load float, ..
  %s.0 = load float, ..
  %mul22 = fmul float %6, %s.0
  %add23 = fadd float %mul22, zeroinitializer

  %7 = load float, ..
  %s.1 = load float, ..
  %mul26 = fmul float %7, %s.1
  %add27 = fadd float %add23, %mul26

  ..
  br i1 %cmp, label %cleanup, label %if.end1

if.end1:
  %15 = load float, ..
  %add67 = fadd %add27, %15
  store float %add67, ..
  br label %cleanup

cleanup:
  return

In the original input, only %add27 has longer live range, but after
InstCombine with instcombine-code-sinking=true (default), it turns out
that %6, %s.0, %7, %s.1 are having longer live ranges.

entry:
  ..
  %6 = load float, ..
  %s.0 = load float, ..

  %7 = load float, ..
  %s.1 = load float, ..

  ..
  br i1 %cmp, label %cleanup, label %if.end1

if.end1:
  %mul22 = fmul float %6, %s.0
  %add23 = fadd float %mul22, zeroinitializer

  %mul26 = fmul float %7, %s.1
  %add27 = fadd float %add23, %mul26

  %15 = load float, ..
  %add67 = fadd %add27, %15
  store float %add67, ..
  br label %cleanup

cleanup:
  return

We see an issue which causes our customized register-allocator keeping
those values like %6, %s.0, %7, %s.1 in registers with a long period.

My questions are:

Does llvm expect the backend's instruction scheduler and register
allocator can handle this properly?

Can this be solved by llvm’s GlobalISel?

Thank you!
CY

Answer by myself :stuck_out_tongue:

The original input pattern is as below:

local memory
for (...) {
  a function (has side effect) which copies from global to local memory
  access data in local memory and do compute
}

if (...)
  return;
store the computed result back.

If the for loop is fully unrolled, and the computing part is sunk to
the basicblock which stores the computed result back, then the backend
compiler needs to find some places (registers or memory) to store
these copied data.

I've tested with aarch64 and amdgcn, in the test pattern both targets
will spill the data to memory.

In the for-loop If we can directly copy instead of using a copy
function, both targets can generate better basicblock layouts.
(aarch64: "Machine code sinking (machine-sink)" pass, amdgcn: "Code
sinking (sink)" pass)

Hi,

In the InstCombinePass, by default the pass will try to sink an
instruction to its successor basic block when possible (so that the
instruction isn’t executed on a path where its result isn’t needed.).
But doing that will also increase a value’s live range. For example:

entry:
..
%6 = load float, ..
%s.0 = load float, ..
%mul22 = fmul float %6, %s.0
%add23 = fadd float %mul22, zeroinitializer

%7 = load float, ..
%s.1 = load float, ..
%mul26 = fmul float %7, %s.1
%add27 = fadd float %add23, %mul26

..
br i1 %cmp, label %cleanup, label %if.end1

if.end1:
%15 = load float, ..
%add67 = fadd %add27, %15
store float %add67, ..
br label %cleanup

cleanup:
return

In the original input, only %add27 has longer live range, but after
InstCombine with instcombine-code-sinking=true (default), it turns out
that %6, %s.0, %7, %s.1 are having longer live ranges.

entry:
..
%6 = load float, ..
%s.0 = load float, ..

%7 = load float, ..
%s.1 = load float, ..

..
br i1 %cmp, label %cleanup, label %if.end1

if.end1:
%mul22 = fmul float %6, %s.0
%add23 = fadd float %mul22, zeroinitializer

%mul26 = fmul float %7, %s.1
%add27 = fadd float %add23, %mul26

%15 = load float, ..
%add67 = fadd %add27, %15
store float %add67, ..
br label %cleanup

cleanup:
return

We see an issue which causes our customized register-allocator keeping
those values like %6, %s.0, %7, %s.1 in registers with a long period.

My questions are:

Does llvm expect the backend's instruction scheduler and register
allocator can handle this properly?

Can this be solved by llvm’s GlobalISel?

GlobalISel’s function-scope optimization doesn’t really help in these cases unless the target can somehow fold expressions into simpler instructions. If that’s not possible, the generated code should be fairly similar to that of SelectionDAG.

Thanks Amara :slight_smile:
Yep that matches my experiment results on aarch64 and amdgcn!