[DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?

Dear all,

In our compiler we use a modified version LLVM Polly, which is very sensitive to proper code generation. Among the number of limitations, the loop region (enclosed by phi node on induction variable and branch) is required to be free of additional memory-dependent branches. In other words, there must be no conditional “br” instructions below phi nodes. The problem we are facing is that from identical GIMPLE for 3d loop used in different contexts DragonEgg may generate LLVM IR either conforming the described limitation, or violating it.

Let’s consider two examples. In first one some simple 3D loop is used directly in main program. In second - the same 3D loop is extracted into separate subroutine called from main. Attached source code and listings for GIMPLE and LLVM IR show that although GIMPLE codes are similar, by some reason in first case branching goes under phi nodes, making Polly to fail with parallelizing the resion. In second case everything is fine.

I have not looked into DragonEgg internals enough yet, but before I’ll do, the question is: should we expect DragonEgg to produce identical LLVM IRs for identical GIMPLEs? The case shown here is really really important. The success of parallelization utilities that are currently in a quite good shape (thanks to their developers!) nowadays heavily relies on the code quality: if IR is too rough, we can’t do much about it :frowning:

Many thanks and Happy New Year!

  • Dima.
  1. Bad LLVM IR:

Forgot to attach original sources (Fortran).

2012/12/31 Dmitry Mikushin <dmitry@kernelgen.org>

demo_f.f90 (1.49 KB)

demo_f_m.f90 (1.67 KB)

Hi Dmitry,

In our compiler we use a modified version LLVM Polly, which is very sensitive to
proper code generation. Among the number of limitations, the loop region
(enclosed by phi node on induction variable and branch) is required to be free
of additional memory-dependent branches. In other words, there must be no
conditional "br" instructions below phi nodes. The problem we are facing is that
from *identical* GIMPLE for 3d loop used in different contexts DragonEgg may
generate LLVM IR either conforming the described limitation, or violating it.

the gimple isn't the same at all (see below). The differences are directly
reflected in the unoptimized LLVM IR, turning up as additional memory loads
in the "bad" version. In addition, the Fortran isn't really the same either:
Fortran semantics allows the compiler to assume that the parameters of your
new function "compute" (which are all passed in by reference, i.e. as pointers)
do not alias each other or anything else in sight (i.e. they get the "restrict"
qualifier in the gimple, noalias in the LLVM IR). Thus by factorizing the loop
into "compute" you are actually giving the compiler more information.

Summary:
   (1) as far as I can see the unoptimized LLVM IR is a direct reflection of
the gimple: the differences for the loop part come directly from differences
in the gimple;
   (2) the optimizers do a better good when you have "compute" partly because you
provided them with additional aliasing information; this better optimized
version then gets inlined into MAIN__.
   (3) this leaves the question of whether in the bad version it is logically
possible for the optimizers to deduce the same aliasing information as is
handed to them for free in the good version. To work this out it would be
nice to have a smaller testcase.

Ciao, Duncan.

PS: Gimple differences for the loop part. I replaced all variable indices and
such with X otherwise they get in the way of the diffing.

  <bb X>:
    # k_X = PHI <k_X(X), k_X(X)>
- D.X_X = ny;
+ D.X_X = *ny_X(D);
    j_X = X;
    if (j_X <= D.X_X)
      goto <bb X>;
@@ -16,7 +74,7 @@

  <bb X>:
    # j_X = PHI <j_X(X), j_X(X)>
- D.X_X = nx;
+ D.X_X = *nx_X(D);
    i_X = X;
    if (i_X <= D.X_X)
      goto <bb X>;
@@ -25,48 +83,36 @@

  <bb X>:
    # i_X = PHI <i_X(X), i_X(X)>
- D.X_X = xy.data;
    D.X_X = (integer(kind=X)) i_X;
    D.X_X = (integer(kind=X)) k_X;
- D.X_X = xy.dim.stride;
- D.X_X = D.X_X * D.X_X;
+ D.X_X = D.X_X * stride.X_X;
    D.X_X = (integer(kind=X)) j_X;
- D.X_X = xy.dim.stride;
- D.X_X = D.X_X * D.X_X;
+ D.X_X = D.X_X * stride.X_X;
    D.X_X = D.X_X + D.X_X;
- D.X_X = xy.offset;
- D.X_X = D.X_X + D.X_X;
- D.X_X = x.data;
+ D.X_X = D.X_X + offset.X_X;
    D.X_X = (integer(kind=X)) i_X;
    D.X_X = (integer(kind=X)) k_X;
- D.X_X = x.dim.stride;
- D.X_X = D.X_X * D.X_X;
+ D.X_X = D.X_X * stride.X_X;
    D.X_X = (integer(kind=X)) j_X;
- D.X_X = x.dim.stride;
- D.X_X = D.X_X * D.X_X;
+ D.X_X = D.X_X * stride.X_X;
    D.X_X = D.X_X + D.X_X;
- D.X_X = x.offset;
- D.X_X = D.X_X + D.X_X;
- D.X_X = MEM[(real(kind=X)[X:] *)D.X_X][D.X_X];
+ D.X_X = D.X_X + offset.X_X;
+ D.X_X = *x_X(D)[D.X_X];
    D.X_X = sinf (D.X_X);
    D.X_X = asinf (D.X_X);
- D.X_X = y.data;
    D.X_X = (integer(kind=X)) i_X;
- D.X_X = y.dim.stride;
- D.X_X = D.X_X + D.X_X;
+ D.X_X = stride.X_X + stride.X_X;
    D.X_X = (integer(kind=X)) k_X;
    D.X_X = D.X_X * D.X_X;
    D.X_X = D.X_X + D.X_X;
- D.X_X = y.offset;
- D.X_X = D.X_X + D.X_X;
- D.X_X = MEM[(real(kind=X)[X:] *)D.X_X][D.X_X];
+ D.X_X = D.X_X + offset.X_X;
+ D.X_X = *y_X(D)[D.X_X];
    D.X_X = cosf (D.X_X);
    D.X_X = acosf (D.X_X);
    D.X_X = D.X_X + D.X_X;
- MEM[(real(kind=X)[X:] *)D.X_X][D.X_X] = D.X_X;
+ *xy_X(D)[D.X_X] = D.X_X;
    D.X_X = i_X == D.X_X;
    i_X = i_X + X;
    if (D.X_X != X)

I would also be interested in a minimal test case. If e.g. only the alias check is missing, we could introduce run-time alias checks such that Polly would be able to optimize both versions. It is probably not as simple, but a reduced test case would make it easier to figure out the exact problems.

Thanks
Tobi

Hi Duncan & Tobi,

Thanks a lot for your interest, and for pointing out differences in GIMPLE I missed.

Attached is simplified test case. Is it good?

Tobi, regarding runtime alias analysis: in KernelGen we already do it along with runtime values substitution. For example:

<------------------ kernelgen_main_loop_17: compile started --------------------->
Integer args substituted:
offset = 32, ptrValue = 47248855040
offset = 40, ptrValue = 47246749696
offset = 48, ptrValue = 47247802368
offset = 16, value = 64
offset = 20, value = 64
offset = 24, value = 64
MemoryAccess to pointer: float* inttoptr (i64 47246749696 to float*)
{ Stmt__12_cloned
[i0, i1, i2] → MemRef_nttoptr (i64 47246749696 to float*)[4096i0 + 64i1 + i2] }
allocSize: 4 storeSize: 4
replacedBy: { Stmt__12_cloned
[i0, i1, i2] → NULL[o0] : o0 >= 47246749696 + 16384i0 + 256i1 + 4i2 and o0 <= 47246749699 + 16384i0 + 256i1 + 4i2 }
MemoryAccess to pointer: float* inttoptr (i64 47247802368 to float*)
{ Stmt__12_cloned_[i0, i1, i2] → MemRef_nttoptr (i64 47247802368 to float*)[4096i0 + 64i1 + i2] }
allocSize: 4 storeSize: 4
replacedBy: { Stmt__12_cloned_[i0, i1, i2] → NULL[o0] : o0 >= 47247802368 + 16384i0 + 256i1 + 4i2 and o0 <= 47247802371 + 16384i0 + 256i1 + 4i2 }
MemoryAccess to pointer: float* inttoptr (i64 47248855040 to float*)
{ Stmt__12_cloned_[i0, i1, i2] → MemRef_nttoptr (i64 47248855040 to float*)[4096i0 + 64i1 + i2] }
allocSize: 4 storeSize: 4
replacedBy: { Stmt__12_cloned_[i0, i1, i2] → NULL[o0] : o0 >= 47248855040 + 16384i0 + 256i1 + 4i2 and o0 <= 47248855043 + 16384i0 + 256i1 + 4i2 }

Number of good nested parallel loops: 3
Average size of loops: 64 64 64

<------------------------------ Scop: end ----------------------------------->

<------------------------------ Scop: start --------------------------------->
<------------------- Cloog AST of Scop ------------------->
for (c2=0;c2<=63;c2++) {
for (c4=0;c4<=63;c4++) {
for (c6=0;c6<=63;c6++) {
Stmt__12_cloned_(c2,c4,c6);
}
}
}
<--------------------------------------------------------->
Context:
{ : }
Statements {
Stmt__12_cloned_
Domain :=
{ Stmt__12_cloned_[i0, i1, i2] : i0 >= 0 and i0 <= 63 and i1 >= 0 and i1 <= 63 and i2 >= 0 and i2 <= 63 };
Scattering :=
{ Stmt__12_cloned_[i0, i1, i2] → scattering[0, i0, 0, i1, 0, i2, 0] };
ReadAccess :=
{ Stmt__12_cloned_[i0, i1, i2] → NULL[o0] : o0 >= 47246749696 + 16384i0 + 256i1 + 4i2 and o0 <= 47246749699 + 16384i0 + 256i1 + 4i2 };
ReadAccess :=
{ Stmt__12_cloned_[i0, i1, i2] → NULL[o0] : o0 >= 47247802368 + 16384i0 + 256i1 + 4i2 and o0 <= 47247802371 + 16384i0 + 256i1 + 4i2 };
WriteAccess :=
{ Stmt__12_cloned_[i0, i1, i2] → NULL[o0] : o0 >= 47248855040 + 16384i0 + 256i1 + 4i2 and o0 <= 47248855043 + 16384i0 + 256i1 + 4i2 };
}
<------------------------------ Scop: end ----------------------------------->
<------------------------------ Scop: dependences --------------------------->
Write after read dependences:
{ }
Read after write dependences:
{ }
Write after write dependences:
{ }
loop is parallel
loop is parallel
loop is parallel
<------------------------------ Scop: dependences end ----------------------->
1 polly-detect - Number of regions that a valid part of Scop
<------------------ __kernelgen_main_loop_17: compile completed ------------------->

It works pretty well in many situations, but in this particular case it does not help. Those problematic “Fortran scalar values referred by pointers” (FSVRPs) can only substituted (replaced by actual value) after proper memory analysis. According to current design, memory analysis operates on SCoPs, but Polly is already unable to detect SCoP for the whole group of nested loops due to presence of those FSVRPs. So, chicken and egg problem.

  • D.

2013/1/2 Tobias Grosser <tobias@grosser.es>

aliasing_f.f90 (3.25 KB)

Hi,

So, from LLVM IR point of view, Polly is confused by loads of loops dimensions right from the corresponding loops headers:

“162”: ; preds = %“168”, %“162.preheader”
%391 = phi i32 [ %437, %“168” ], [ 1, %“162.preheader” ]
%392 = load i32* %ny, align 4

“163”: ; preds = %“166”, %“163.preheader”
%400 = phi i32 [ %435, %“166” ], [ 1, %“163.preheader” ]
%401 = load i32* %nx, align 4

In aliasing_f.f90 (2D loop case) problematic nx load could be taken out of the loop PHI-node block using the following sequence of LLVM optimizations:

{
PassManager manager;
manager.add(new TargetData(m));
manager.add(createBasicAliasAnalysisPass());
manager.add(createInstructionCombiningPass());
manager.add(createCFGSimplificationPass());
manager.add(createScalarReplAggregatesPass());
manager.add(createSimplifyLibCallsPass());
manager.add(createInstructionCombiningPass());
manager.add(createCFGSimplificationPass());
manager.add(createLoopRotatePass());
manager.add(createLICMPass());
manager.run(*m);
}

Unfortunately, this helps only for 2D loops. Even LLVM’s -O3 is unable to pull loads out of 3D loop (the case of original demo_f.f90 test). The only solutions seems to be

dragonegg -O1 -fplugin-arg-dragonegg-enable-gcc-optzns demo_f.f90

I.e. only adding some of GCC’s own optimization helps.

  • D.

2013/1/2 Dmitry Mikushin <dmitry@kernelgen.org>