Matrix Multiplication not Vectorized using double pointers

Hello,
My matrix multiplication code has variables allocated via double pointers on heap. The code is not getting vectorized. Following is the code.

int **buffer_A = (int **)malloc(vecsize * sizeof(int *));
int **buffer_B = (int **)malloc(vecsize * sizeof(int *));
for(p = 0; p < vecsize; p++)
{
buffer_A[p] = (int *)malloc(vecsize * sizeof(int));
}
for(p = 0; p < vecsize; p++)
{
buffer_B[p] = (int *)malloc(vecsize * sizeof(int));
}
long i, j, k;
int **result = (int **)malloc(vecsize * sizeof(int *));
for(p = 0; p < vecsize; p++)
{
result[p] = (int *)malloc(vecsize * sizeof(int));
}
for(i = 0; i < vecsize; i++)
{
for(j = 0; j < vecsize; j++)
{
result[i][j]=0;
for(k = 0; k < vecsize; k++)
{
result[i][j] += buffer_A[i][k] * buffer_B[k][j];
}
}
}

what is the issue? how to resolve it?

Thank You

Regards

Compile with optimization remarks on ("-Rpass-analysis=.*", or "-Rpass-missed=.*") to get some optimization remarks for why the optimizer is or is not handling certain constructs. (Warning, this is a lot, and it will often be hard to decode if you don't already know how the relevant optimization passes work.)

There are several problems here, but zooming out, there are just way too many pointers here that could all potentially alias. Proving statically that none of these pointers alias is easier said than done, especially when variable-sized arrays are involved. Even if you could handle that within a function (not trivial), the next problem after that is crossing function boundaries (if there's code between allocation and use), and that gets messy really quick.

If you have 2D arrays, then if at all possible, allocating them contiguously and use a single base pointer to access all elements of a given array. That's a form the loop analysis passes know how to deal with and it's going to go better, although you will probably still need some annotations to promise the compiler there's no aliasing. (Easier said than done; you can flag the result pointer as __restrict, but let's just say that this path is fraught with peril.)

After writing the whole thing using single base pointers, manual 2D array access and, I can get Clang 9.0.0 to vectorize the loop (for x86 at -march=skylake which has cheap enough gathers), but it's still a very direct translation of the original code, which is not what you actually want.

Efficient vectorization of matrix multiplies is a well-studied problem, but that doesn't make it easy. Turning the basic three nested loops into a form that achieves high performance requires _many_ different transform steps. (Among other things, almost certainly turning the three nested loops into at least five, to process the data in small blocks, getting rid of the gathers and turning them into in-register transpositions.)

Getting state-of-the-art loop optimizations into LLVM has been an ongoing project for many years (-> "Polly", though that seems to have gotten quiet recently?). I believe LLVM+Polly would do a lot better on this particular example, but it's not enabled (or compiled in) for normal compiler releases.

-Fabian

Compile with optimization remarks on ("-Rpass-analysis=.*", or

"-Rpass-missed=.*") to get some optimization remarks for why the optimizer is or is not handling certain constructs. (Warning, this is a lot, and it will often be hard to decode if you don't already know how the relevant optimization passes work.)
>
> There are several problems here, but zooming out, there are just way too many pointers here that could all potentially alias. Proving statically that none of these pointers alias is easier said than done, especially when variable-sized arrays are involved. Even if you could handle that within a function (not trivial), the next problem after that is crossing function boundaries (if there's code between allocation and use), and that gets messy really quick.
>
> If you have 2D arrays, then if at all possible, allocating them contiguously and use a single base pointer to access all elements of a given array. That's a form the loop analysis passes know how to deal with and it's going to go better, although you will probably still need some annotations to promise the compiler there's no aliasing. (Easier said than done; you can flag the result pointer as __restrict, but let's just say that this path is fraught with peril.)
>
> After writing the whole thing using single base pointers, manual 2D array access and, I can get Clang 9.0.0 to vectorize the loop (for x86 at -march=skylake which has cheap enough gathers), but it's still a very direct translation of the original code, which is not what you actually want.
>
> Efficient vectorization of matrix multiplies is a well-studied problem, but that doesn't make it easy. Turning the basic three nested loops into a form that achieves high performance requires _many_ different transform steps. (Among other things, almost certainly turning the three nested loops into at least five, to process the data in small blocks, getting rid of the gathers and turning them into in-register transpositions.)
>
> Getting state-of-the-art loop optimizations into LLVM has been an ongoing project for many years (-> "Polly", though that seems to have gotten quiet recently?). I believe LLVM+Polly would do a lot better on this particular example, but it's not enabled (or compiled in) for normal compiler releases.

FWIW, Polly performs pattern matching for matrix multiplications (only)
and replaces them with pretty good code. Without that the tiling Polly
applies is usually the most beneficial factor for code like this. That
said, the code below, with the indirection and missing local restrict
keywords* in the 3D loop, will not be optimized by Polly.

* Support to actually use local restrict keywords is still under
development in LLVM.

Cheers,
Johannes

> -Fabian
>
>> Hello,
>> My matrix multiplication code has variables allocated via double pointers on heap. The code is not getting vectorized. Following is the code.