Hi all,

I have summarized the top 10 compiler passes for Polly when compiling LLVM test-ssuite. Results can be viewed on:

https://gist.github.com/tanstar/581bcea1e4e03498f935/raw/f6a4ec4e8565f7a7bbdb924cd59fcf145caac039/Polly-top10

Based on the comparison between “clang -O3” and “polly -O3” listed on:

http://188.40.87.11:8000/db_default/v4/nts/18?compare_to=14&baseline=14

we can see Polly’s compile-time overhead is mainly resulted by some expensive Polly passes such as PollyDependence, PollyOptimization and PollyCodegen. Especially, I notice that the PollyDependence can lead to significant extra compile-time overhead. Its compile-time percentage for some expensive benchmarks can be summarized as:

nestedloop: 41.4% (Polly - Calculate dependence)

salsa20: 98.5% (Polly - Calculate dependence)

seidel-2d: 72.1% (Polly - Calculate dependence)

multiplies: 54.3% (Poly - Calculate dependence)

Puzzle: 22.8% (Poly - Calculate dependence)

As a result, it is critical to improve the PollyDependence pass. I have previously committed a patch file to split start value from SCEVAddRecExpr (r187728), which can improve the PollyDependence pass for some benchmarks as follows:

http://188.40.87.11:8000/db_default/v4/nts/25?baseline=18&compare_to=18

Obviously, it is not enough.

I have investigated the salsa20 benchmark, in which the PollyDependence pass consumes 98.5% of total compile-time. The key code of salsa20 benchmark consists of a serial of array operations:

#define R(a,b) (((a) &! lt;< (b)) | ((a) >> (32 - (b))))

for (i = 20;i > 0;i -= 2) {

x[ 4] ^= R(x[ 0]+x[12], 7); x[ 8] ^= R(x[ 4]+x[ 0], 9);

x[12] ^= R(x[ 8]+x[ 4],13); x[ 0] ^= R(x[12]+x[ 8],18);

…

x[14] ^= R(x[13]+x[12],13); x[15] ^= R(x[14]+x[13],18);

}

which would be translated into a serial of affine functions:

ReadAccess := { Stmt_for_body5[i0] → MemRef_x[12] };

ReadAccess := { Stmt_for_body5[i0] → MemRef_x[4] };

MustWriteAccess := { Stmt_for_body5[i0] → MemRef_x[4] };

! ReadAccess := { Stmt_for_body5[i0] → MemRef_x[0] };

…

Consequently, the PollyDependence pass would produce very complex Read/Write/MayWrite as follows:

Read: { Stmt_for_body[i0] → MemRef_in[i0] : i0 >= 0 and i0 <= 15; Stmt_for_body5[i0] → MemRef_x[15] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[9] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[4] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRe! f_x[11] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[2] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[6] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[14] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[8] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[10] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and ! i0 <= 9); Stmt_for_body5[i0] → MemRef_x[0] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[13] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[1] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[3] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[7] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i0] → MemRef_x[12] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9); Stmt_for_body5[i! 0] → MemRef_x[5] : (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) or (i0 >= 0 and i0 <= 9) }

Write: …

As we can see, the reason why PollyDependence leads to expensive compile-time lies in the very complex Read/Write/MayWrite structure. In fact, we can run isl_union_map_coalesce() on the Read/MayWrite/MustWrite before feeding them into ISL analysis:

diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp index 9f918f3…39c3fb6 100644 — a/lib/Analysis/Dependences.cpp +++ b/lib/Analysis/Dependences.cpp @@ -95,6 +95,10 @@ void Dependences::calculateDependences(Scop &S) { collectInfo(S, &Read, &Write, &MayWrite, &Schedule); + Read = isl_union_map_coalesce(Read); + Write = isl_union_map_coalesce(Write); + MayWrite = isl_union_map_coalesce(MayWrite); + DEBUG(dbgs() << "Read: " << Read << “\n”

With this patch file, we can reduce the compile-time percentage of PollyDependence from 98.5% to 15.3%. Unfortunately, the compile-time percentage of PollyDependence for benchmarks, such as “nestedloop”, is still very high.

My plan is to continue investigating why PollyDependence pass leads to such high compile-time overhead. Could anyone give me some comments or suggestions?

Thanks,

Star Tan