Hello everybody,
I want to share my problem now I am struggling to the community.
It is [fuse two loops] and [integrate loop vectorization and loop unroll that contains external function call] from the affine
dialect. Ultimately, I want to lower it into LLVM to produce an optimized binary with clang -O3
.
For example, the input code is like (I will use C++ like style for readability):
for (int i = 0; i < 1024; i++) { // constant divisible by 8
a[i] = function_call(); // hope to be unrolled
}
// hope to be fused
for (int i = 0; i < 1024; i++) { // same loop frame
d[i] = a[i] * b[i] + c[i]; // hope to be vectorized
}
My desired output code from the above is the following:
for (int i = 0; i < 1024; i += 8) {
a[i] = function_call();
a[i+1] = function_call();
a[i+2] = function_call();
...
a[i+7] = function_call(); // unrolled
// loops fused!
d[i to i+7] = a[i to i+7] * b[i to i+7] + c[i to i+7]; // vectorized
}
To do that, I tried the following.
-
affine-loop-fusion
andclang++
with-force-vector-width=8
option onllvm
to binary phase. However, it did not vectorize the arithmetics with a warning message:remark: <unknown>:0:0: loop not vectorized: call instruction cannot be vectorized [-Rpass-analysis
. It seems that the (fused) loop cannot be vectorized if it contains any call instructions. - Similar result to (1), but
affine-super-vectorize
with-virtual-vector-width=8
is applied instead. It also didn’t vectorize the loops with function calls.
Furthermore, I had trouble with compilation intollvm
. It causederror: failed to legalize operation 'builtin.unrealized_conversion_cast' that was explicitly marked illegal
on thereconcile-unrealized-casts
pass stage. However, I cannot see any code diffs between [the mlir input code toreconcile-unrealized-casts
that passedaffine-super-vectorize
] and [the one without vectorization] above the erroneous code line… -
affine-loop-unroll
with-unroll-factor=8
option for the first, but the followingaffine-loop-fusion
didn’t work propery that it couldn’t even merge two simple arithmetic unrolled loops. It seems that loop fusion should be applied before the any passes, but it makes things hard to both vectorize and unroll.
Now I am stuck…
I would appreciate it if you share any knowledge or relative experiences… Thank you!