[Polly] Summary of some expensive compiler passes, especially PollyDependence

Hi all,

I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:
https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

Based on the comparison between “clang -O3” and “polly -O3” listed on:
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14

we can see Polly’s compile-time overhead is mainly resulted by some expensive Polly passes such as PollyDependence, PollyOptimization and PollyCodegen. Especially, I notice that the PollyDependence can lead to significant extra compile-time overhead. Its compile-time percentage for some expensive benchmarks can be summarized as:

nestedloop: 41.4% (Polly - Calculate dependence)
salsa20: 98.5% (Polly - Calculate dependence)
seidel-2d: 72.1% (Polly - Calculate dependence)
multiplies: 54.3% (Poly - Calculate dependence)
Puzzle: 22.8% (Poly - Calculate dependence)

As a result, it is critical to improve the PollyDependence pass. I have previously committed a patch file to split start value from SCEVAddRecExpr (r187728), which can improve the PollyDependence pass for some benchmarks as follows:
http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18
Obviously, it is not enough.

I have investigated the salsa20 benchmark, in which the PollyDependence pass consumes 98.5% of total compile-time. The key code of salsa20 benchmark consists of a serial of array operations:

#define R(a,b) (((a) &! lt;< (b)) | ((a) >> (32 - (b))))

for (i = 20;i > 0;i -= 2) {
x[ 4] ^= R(x[ 0]+x[12], 7); x[ 8] ^= R(x[ 4]+x[ 0], 9);
x[12] ^= R(x[ 8]+x[ 4],13); x[ 0] ^= R(x[12]+x[ 8],18);

x[14] ^= R(x[13]+x[12],13); x[15] ^= R(x[14]+x[13],18);
}

which would be translated into a serial of affine functions:

ReadAccess := { Stmt_for_body5[i0] → MemRef_x[12] };
ReadAccess := { Stmt_for_body5[i0] → MemRef_x[4] };
MustWriteAccess := { Stmt_for_body5[i0] → MemRef_x[4] };
! ReadAccess := { Stmt_for_body5[i0] → MemRef_x[0] };


Consequently, the PollyDependence pass would produce very complex Read/Write/MayWrite as follows:

Read: { Stmt_for_body[i0] → MemRef_in[i0] : i0 >= 0 and i0 <= 15; Stmt_for_body5[i0] → MemRef_x[15] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRe! f_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[14] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[8] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[10] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and ! i0 <= 9); Stmt_for_body5[i0] → MemRef_x[0] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[13] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[1] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[3] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[7] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[12] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i! 0] → MemRef_x[5] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }
Write: …

As we can see, the reason why PollyDependence leads to expensive compile-time lies in the very complex Read/Write/MayWrite structure. In fact, we can run isl_union_map_coalesce() on the Read/MayWrite/MustWrite before feeding them into ISL analysis:

diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3…39c3fb6 100644 — a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << “\n”

With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as “nestedloop”, is still very high.

My plan is to continue investigating why PollyDependence pass leads to such high compile-time overhead. Could anyone give me some comments or suggestions?

Thanks,
Star Tan

Hi all,

I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:
     https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

Based on the comparison between "clang -O3" and "polly -O3" listed on:
     http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14

Please compare against clang -O3 -load LLVMPolly.so, otherwise especially the compile time of the small binaries includes the overhead of loading the Polly shared object file.

we can see Polly's compile-time overhead is mainly resulted by some expensive Polly passes such as PollyDependence, PollyOptimization and PollyCodegen. Especially, I notice that the PollyDependence can lead to significant extra compile-time overhead. Its compile-time percentage for some expensive benchmarks can be summarized as:
     nestedloop: 41.4% (Polly - Calculate dependence)
     salsa20: 98.5% (Polly - Calculate dependence)
     seidel-2d: 72.1% (Polly - Calculate dependence)
     multiplies: 54.3% (Poly - Calculate dependence)
     Puzzle: 22.8% (Poly - Calculate dependence)

As a result, it is critical to improve the PollyDependence pass. I have previously committed a patch file to split start value from SCEVAddRecExpr (r187728), which can improve the PollyDependence pass for some benchmarks as follows:
      http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18

Those are very nice results. Removing 80% of the compile time for the lu benchmark is a very nice outcome.

Obviously, it is not enough.

Sure, there is still plenty left to be optimised. However, when Polly detects a code region that it can optimise it seems fair that we spend some time in analysing and optimising it. Especially if our transformation improves the run-time performance.

I have investigated the salsa20 benchmark, in which the PollyDependence pass consumes 98.5% of total compile-time. The key code of salsa20 benchmark consists of a serial of array operations:
   #define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
   for (i = 20;i > 0;i -= 2) {
     x[ 4] ^= R(x[ 0]+x[12], 7); x[ 8] ^= R(x[ 4]+x[ 0], 9);
     x[12] ^= R(x[ 8]+x[ 4],13); x[ 0] ^= R(x[12]+x[ 8],18);
     ....
     x[14] ^= R(x[13]+x[12],13); x[15] ^= R(x[14]+x[13],18);
   }
which would be translated into a serial of affine functions:
       ReadAccess := { Stmt_for_body5[i0] -> MemRef_x[12] };
       ReadAccess := { Stmt_for_body5[i0] -> MemRef_x[4] };
       MustWriteAccess := { Stmt_for_body5[i0] -> MemRef_x[4] };
       ReadAccess := { Stmt_for_body5[i0] -> MemRef_x[0] };
       ...
Consequently, the PollyDependence pass would produce very complex Read/Write/MayWrite as follows:

       Read: { Stmt_for_body[i0] -> MemRef_in[i0] : i0 >= 0 and i0 <=
15; Stmt_for_body5[i0] -> MemRef_x[15] : (i0 >= 0 and i0 <= 9) or (i0

= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or

(i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
MemRef_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
MemRef_x[14] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
MemRef_x[8] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] ->
MemRef_x[10] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0

= 0 and i0 <= 9); Stmt_for_body5[i0] -> MemRef_x[0] : (i0 >= 0 and i0

<= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and
i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[13] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[1] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[3] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[7] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[12] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] -> MemRef_x[5] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0

= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }

       Write: ...

As we can see, the reason why PollyDependence leads to expensive compile-time lies in the very complex Read/Write/MayWrite structure. In fact, we can run isl_union_map_coalesce() on the Read/MayWrite/MustWrite before feeding them into ISL analysis:

Very nice.

diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3..39c3fb6 100644 --- a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << "\n"

This patch is unreadable in the mail. However, the one you submitted looked good and was committed.

With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as "nestedloop", is still very high.

It would be good to get an up-to-date comparison with the latest patch having gone into Polly.

I did not yet look at the nestedloop benchmark, but it sounds basically like a benchmark only consisting of loop nests that we can optimise. This is definitely interesting to look into. Both in respect of how fast
we can analyse it faster, but also if we are able to improve the performance of the generated code. Especially if we improve the execution performance some additional compile time is justified.

My plan is to continue investigating why PollyDependence pass leads to such high compile-time overhead. Could anyone give me some comments or suggestions?

You already made nice progress. I propose to just continue by measuring the remaining overhead, taking one of the top cases and analysing it. We then see on a case by case basis what can be done.

Cheers,
Tobias

Hi all,

I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:
https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

Based on the comparison between “clang -O3” and “polly -O3” listed on:
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14

Please compare against clang -O3 -load LLVMPolly.so, otherwise
especially the compile time of the small binaries includes the overhead
of loading the Polly shared object file.

In fact, the compile-time overhead of loading Polly shared object file is very small (usually less than 1%). Their overhead can be viewed on:
http://188.40.87.11:8000/db_default/v4/nts/15?compare_to=14&baseline=14

we can see Polly’s compile-time overhead is mainly resulted by some expensive Polly passes such as PollyDependence, PollyOptimization and PollyCodegen. Especially, I notice that the PollyDependence can lead to significant extra compile-time overhead. Its compile-time percentage for some expensive benchmarks can be summarized as:
nestedloop: 41.4% (Polly - Calculate dependence)
salsa20: 98.5% (Polly - Calculate dependence)
seidel-2d: 72.1% (Polly - Calculate dependence)
multiplies: 54.3% (Poly - Calculate dependence)
Puzzle: 22.8% (Poly - Calculate dependence)

As a result, it is critical to improve the PollyDependence pass. I have previously committed a patch file to split start value from SCEVAddRecExpr (r187728), which can improve the PollyDependence pass for some benchmarks as follows:
http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18

Those are very nice results. Removing 80% of the compile time for the lu
benchmark is a very nice outcome.

Obviously, it is not enough.

Sure, there is still plenty left to be optimised. However, when Polly
detects a code region that it can optimise it seems fair that we spend
some time in analysing and optimising it. Especially if our
transformation improves the run-time performance.

I have investigated the salsa20 benchmark, in which the PollyDependence pass consumes 98.5% of total compile-time. The key code of salsa20 benchmark consists of a serial of array operations:
#define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
for (i = 20;i > 0;i -= 2) {
x[ 4] ^= R(x[ 0]+x[12], 7); x[ 8] ^= R(x[ 4]+x[ 0], 9);
x[12] ^= R(x[ 8]+x[ 4],13); x[ 0] ^= R(x[12]+x[ 8],18);

x[14] ^= R(x[13]+x[12],13); x[15] ^= R(x[14]+x[13],18);
}
which would be translated into a serial of affine functions:
ReadAccess := { Stmt_for_body5[i0] → MemRef_x[12] };
ReadAccess := { Stmt_for_body5[i0] → MemRef_x[4] };
MustWriteAccess := { Stmt_for_body5[i0] → MemRef_x[4] };
ReadAccess := { Stmt_for_body5[i0] → MemRef_x[0] };

Consequently, the PollyDependence pass would produce very complex Read/Write/MayWrite as follows:

Read: { Stmt_for_body[i0] → MemRef_in[i0] : i0 >= 0 and i0 <=
15; Stmt_for_body5[i0] → MemRef_x[15] : (i0 >= 0 and i0 <= 9) or (i0

= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or
(i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] →
MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] →
MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] →
MemRef_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] →
MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] →
MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] →
MemRef_x[14] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] →
MemRef_x[8] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] →
MemRef_x[10] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >=
0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0
= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[0] : (i0 >= 0 and i0
<= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and
i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] → MemRef_x[13] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] → MemRef_x[1] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] → MemRef_x[3] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] → MemRef_x[7] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] → MemRef_x[12] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9);
Stmt_for_body5[i0] → MemRef_x[5] : (i0 >= 0 and i0 <= 9) or (i0 >= 0
and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0
= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }
Write: …

As we can see, the reason why PollyDependence leads to expensive compile-time lies in the very complex Read/Write/MayWrite structure. In fact, we can run isl_union_map_coalesce() on the Read/MayWrite/MustWrite before feeding them into ISL analysis:

Very nice.

diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3…39c3fb6 100644 — a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << “\n”

This patch is unreadable in the mail. However, the one you submitted
looked good and was committed.

With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as “nestedloop”, is still very high.

It would be good to get an up-to-date comparison with the latest patch
having gone into Polly.

Yes, you can view the comparison on:
http://188.40.87.11:8000/db_default/v4/nts/26?compare_to=25&baseline=25

Results show that this patch file is very effective for several benchmarks, such as salsa20 (reduced by 97.72%), Obsequi (54.62%), seidel-2d (48.64%), telecomm-gsm (33.71%).

I did not yet look at the nestedloop benchmark, but it sounds basically
like a benchmark only consisting of loop nests that we can optimise.
This is definitely interesting to look into. Both in respect of how fast
we can analyse it faster, but also if we are able to improve the
performance of the generated code. Especially if we improve the
execution performance some additional compile time is justified.

Yes. nestedloop.c is a very simple benchmark that contains a single nested loop as follows:
int n = ((argc == 2) ? atoi(argv[1]) : LENGTH);
int a, b, c, d, e, f, x=0;
for (a=0; a<n; a++)
for (b=0; b<n; b++)
for (c=0; c<n; c++)
for (d=0; d<n; d++)
for (e=0; e<n; e++)
for (f=0; f<n; f++)
x++;
Polly would significantly increases the compile-time from 0.0320s to 2.3320 (70x), but it also reduces the execution time from 0.048s to 0.004s (12x). Maybe it is worth, but I think that would be eif we can reduce the compile-time without hurting the execution performance.

My plan is to continue investigating why PollyDependence pass leads to such high compile-time overhead. Could anyone give me some comments or suggestions?

You already made nice progress. I propose to just continue by measuring
the remaining overhead, taking one of the top cases and analysing it. We
then see on a case by case basis what can be done.

Cheers,
Tobias

Thanks for your suggestion.

Cheers,
Star Tan

Hi all,

I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:
      https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

Based on the comparison between "clang -O3" and "polly -O3" listed on:
      http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14

Please compare against clang -O3 -load LLVMPolly.so, otherwise
especially the compile time of the small binaries includes the overhead
of loading the Polly shared object file.

In fact, the compile-time overhead of loading Polly shared object file is very small (usually less than 1%). Their overhead can be viewed on:
     http://188.40.87.11:8000/db_default/v4/nts/15?compare_to=14&baseline=14

I see. Still, I think we should eliminate to remove another source of noise.

diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3..39c3fb6 100644 --- a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << "\n"

This patch is unreadable in the mail. However, the one you submitted
looked good and was committed.

With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as "nestedloop", is still very high.

It would be good to get an up-to-date comparison with the latest patch
having gone into Polly.

Yes, you can view the comparison on:
     http://188.40.87.11:8000/db_default/v4/nts/26?compare_to=25&baseline=25

I am slightly confused. What are we comparing here? It seems we compare a PollyScopInfo run with a PollyDependences run (same source code?) and the Dependences run shows compile time speedup over ScopInfo? This is surprising.

The numbers rather suggest that those are two Dependences runs the first without the patch applied the second with. If this is the case,
why does the page say "*** WARNING ***: comparison is against a different machine (pollyScopInfo__clang_DEV__x86_64,11)"?

Results show that this patch file is very effective for several benchmarks, such as salsa20 (reduced by 97.72%), Obsequi (54.62%), seidel-2d (48.64%), telecomm-gsm (33.71%).

I did not yet look at the nestedloop benchmark, but it sounds basically
like a benchmark only consisting of loop nests that we can optimise.
This is definitely interesting to look into. Both in respect of how fast
we can analyse it faster, but also if we are able to improve the
performance of the generated code. Especially if we improve the
execution performance some additional compile time is justified.

Yes. nestedloop.c is a very simple benchmark that contains a single nested loop as follows:
     int n = ((argc == 2) ? atoi(argv[1]) : LENGTH);
     int a, b, c, d, e, f, x=0;
     for (a=0; a<n; a++)
         for (b=0; b<n; b++)
             for (c=0; c<n; c++)
                 for (d=0; d<n; d++)
                     for (e=0; e<n; e++)
                         for (f=0; f<n; f++)
                             x++;
Polly would significantly increases the compile-time from 0.0320s to 2.3320 (70x), but it also reduces the execution time from 0.048s to 0.004s (12x). Maybe it is worth, but I think that would be eif we can reduce the compile-time without hurting the execution performance.

Sure. Any ideas what is going on here? It seems you have a couple of open problems. It probably makes sense to open a bug report for each of them and attach source code as well as your findings there.

Cheers,
Tobias

Hi all,

I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:
https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

Based on the comparison between “clang -O3” and “polly -O3” listed on:
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14

Please compare against clang -O3 -load LLVMPolly.so, otherwise
especially the compile time of the small binaries includes the overhead
of loading the Polly shared object file.

In fact, the compile-time overhead of loading Polly shared object file is very small (usually less than 1%). Their overhead can be viewed on:
http://188.40.87.11:8000/db_default/v4/nts/15?compare_to=14&baseline=14

I see. Still, I think we should eliminate to remove another source of noise.

Got it. The comparison between Polly and “clang -O3 -load LLVMPolly.so” can be viewed on:
http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=15&baseline=15

diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3…39c3fb6 100644 — a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << “\n”

This patch is unreadable in the mail. However, the one you submitted
looked good and was committed.

With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as “nestedloop”, is still very high.

It would be good to get an up-to-date comparison with the latest patch
having gone into Polly.

Yes, you can view the comparison on:
http://188.40.87.11:8000/db_default/v4/nts/26?compare_to=25&baseline=25

I am slightly confused. What are we comparing here? It seems we compare
a PollyScopInfo run with a PollyDependences run (same source code?) and
the Dependences run shows compile time speedup over ScopInfo? This is
surprising.

In fact, there are three runs:

pollyOpt (run id = 18): the baseline of Polly
pollyScopInfo (run id = 25): Polly with the ScopInfo patch r187728 that split start value from SCEVAddRecExpr
pollyDependence (run id = 26): Polly with both ScopInfo patch r187728 and PollyDependence patch r187981 that simplify the Read/Write/MayWrite before feeding them into ISL.

As a result, I compare pollyDependence (run id = 26) with pollyScopInfo (run id = 25) to see the impact of our ScopDependence patch r187981. Their only difference is the ScopDependence patch file.

Of course, you can view the total performance improvements of ScopInfo patch and PollyDependence patch on:
http://188.40.87.11:8000/db_default/v4/nts/26?compare_to=18&baseline=18

The numbers rather suggest that those are two Dependences runs the first
without the patch applied the second with. If this is the case,
why does the page say “*** WARNING ***: comparison is against a
different machine (pollyScopInfo__clang_DEV__x86_64,11)”?

LNT always report warning if you compare two runs with different tester names.

Results show that this patch file is very effective for several benchmarks, such as salsa20 (reduced by 97.72%), Obsequi (54.62%), seidel-2d (48.64%), telecomm-gsm (33.71%).

I did not yet look at the nestedloop benchmark, but it sounds basically
like a benchmark only consisting of loop nests that we can optimise.
This is definitely interesting to look into. Both in respect of how fast
we can analyse it faster, but also if we are able to improve the
performance of the generated code. Especially if we improve the
execution performance some additional compile time is justified.

Yes. nestedloop.c is a very simple benchmark that contains a single nested loop as follows:
int n = ((argc == 2) ? atoi(argv[1]) : LENGTH);
int a, b, c, d, e, f, x=0;
for (a=0; a<n; a++)
for (b=0; b<n; b++)
for (c=0; c<n; c++)
for (d=0; d<n; d++)
for (e=0; e<n; e++)
for (f=0; f<n; f++)
x++;
Polly would significantly increases the compile-time from 0.0320s to 2.3320 (70x), but it also reduces the execution time from 0.048s to 0.004s (12x). Maybe it is worth, but I think that would be eif we can reduce the compile-time without hurting the execution performance.

Sure. Any ideas what is going on here? It seems you have a couple of
open problems. It probably makes sense to open a bug report for each of
them and attach source code as well as your findings there.

I see. I would report some bugs soon.

Cheers,
Tobias

Thanks,
Star Tan

  Hi all,

  I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:
        https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

  Based on the comparison between "clang -O3" and "polly -O3" listed on:
        http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14

  Please compare against clang -O3 -load LLVMPolly.so, otherwise
  especially the compile time of the small binaries includes the overhead
  of loading the Polly shared object file.

  In fact, the compile-time overhead of loading Polly shared object file is very small (usually less than 1%). Their overhead can be viewed on:
       http://188.40.87.11:8000/db_default/v4/nts/15?compare_to=14&baseline=14

I see. Still, I think we should eliminate to remove another source of noise.

Got it. The comparison between Polly and "clang -O3 -load LLVMPolly.so" can be viewed on:
    http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=15&baseline=15

I see.

  diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3..39c3fb6 100644 --- a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << "\n"

  This patch is unreadable in the mail. However, the one you submitted
  looked good and was committed.

  With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as "nestedloop", is still very high.

  It would be good to get an up-to-date comparison with the latest patch
  having gone into Polly.

  Yes, you can view the comparison on:
       http://188.40.87.11:8000/db_default/v4/nts/26?compare_to=25&baseline=25

I am slightly confused. What are we comparing here? It seems we compare
a PollyScopInfo run with a PollyDependences run (same source code?) and
the Dependences run shows compile time speedup over ScopInfo? This is
surprising.

In fact, there are three runs:

pollyOpt (run id = 18): the baseline of Polly
pollyScopInfo (run id = 25): Polly with the ScopInfo patch r187728 that split start value from SCEVAddRecExpr
pollyDependence (run id = 26): Polly with both ScopInfo patch r187728 and PollyDependence patch r187981 that simplify the Read/Write/MayWrite before feeding them into ISL.

As a result, I compare pollyDependence (run id = 26) with pollyScopInfo (run id = 25) to see the impact of our ScopDependence patch r187981. Their only difference is the ScopDependence patch file.

I see. I did not realize that all those runs have been performed with '-O3 -polly'. Now it makes sense.

If you happen to not use the tester machine right now, it would be interesting to schedule runs that give us the current state of Polly performance. Meaning pollyBasic, pollyNoGen, pollyNoOpt and pollyOpt for the latest version of Polly.

Cheers,
Tobi

Certainly, I will start the testing tonight. It usually takes about three days to complete a full testing ~

Cheers,
Star Tan