Fusing affine loop vectorization with unrolled function calls?

Hello everybody,

I want to share my problem now I am struggling to the community.

It is [fuse two loops] and [integrate loop vectorization and loop unroll that contains external function call] from the affine dialect. Ultimately, I want to lower it into LLVM to produce an optimized binary with clang -O3.

For example, the input code is like (I will use C++ like style for readability):

for (int i = 0; i < 1024; i++) { // constant divisible by 8
  a[i] = function_call();    // hope to be unrolled
// hope to be fused
for (int i = 0; i < 1024; i++) { // same loop frame
  d[i] = a[i] * b[i] + c[i]; // hope to be vectorized

My desired output code from the above is the following:

for (int i = 0; i < 1024; i += 8) {
  a[i] = function_call();
  a[i+1] = function_call();
  a[i+2] = function_call();
  a[i+7] = function_call(); // unrolled
  // loops fused!
  d[i to i+7] = a[i to i+7] * b[i to i+7] + c[i to i+7]; // vectorized

To do that, I tried the following.

  1. affine-loop-fusion and clang++ with -force-vector-width=8 option on llvm to binary phase. However, it did not vectorize the arithmetics with a warning message: remark: <unknown>:0:0: loop not vectorized: call instruction cannot be vectorized [-Rpass-analysis. It seems that the (fused) loop cannot be vectorized if it contains any call instructions.
  2. Similar result to (1), but affine-super-vectorize with -virtual-vector-width=8 is applied instead. It also didn’t vectorize the loops with function calls.
    Furthermore, I had trouble with compilation into llvm. It caused error: failed to legalize operation 'builtin.unrealized_conversion_cast' that was explicitly marked illegal on the reconcile-unrealized-casts pass stage. However, I cannot see any code diffs between [the mlir input code to reconcile-unrealized-casts that passed affine-super-vectorize] and [the one without vectorization] above the erroneous code line…
  3. affine-loop-unroll with -unroll-factor=8 option for the first, but the following affine-loop-fusion didn’t work propery that it couldn’t even merge two simple arithmetic unrolled loops. It seems that loop fusion should be applied before the any passes, but it makes things hard to both vectorize and unroll.

Now I am stuck…
I would appreciate it if you share any knowledge or relative experiences… Thank you!