What I am trying to do here is different than a simple matrix vecteur multiplication. In fact, we start by multiplying the first row with the vector %Y and we directely update %Y[0] then we use the new updated %Y vector for the multiplication with the second row and so on. In lower level code, that should generate something like that

for i = 0 to N {
yi = 0
for j = 0 to N {
yi += A[i,j]*Y[j]
}
Y[i] = yi
}

Another question related to that: we can see that the outer loop can not be parallelized especially in the case of a fully dense matrix, but if I would like to consider the case of a very sparse matrix and propably use sparse tensor: we can do some analysis on the sparsity to detect the groups of indices i that can be parallelized. So do you think that linalg.generic can allow that? or if it can be extended ?

linalg.generic in its current form does not support accessing rows. Instead, the body always works on elements. Another limitation is that we only support parallel and reduction dimensions. I think you code example would need a sequential and a reduction dimension?

Leveraging sparsity would probably require some inspector-executor model? I think that is an interesting research direction but quite a change from what we currently have.

I think you code example would need a sequential and a reduction dimension?

Yes indeed, it will need a sequential and a reduction (if we are not considering the sparsity of the matrix that can allow to do some smart parallelisation)

Leveraging sparsity would probably require some inspector-executor model? I think that is an interesting research direction but quite a change from what we currently have

I am not so familiar with inspector-executor model, but I think that what is needed to find independent groups of indices āiā corresponding to the first loop. And I am wondering how much that can be complicated to do with the current status of linalg ??

I am not so familiar with inspector-executor model, but I think that what is needed to find independent groups of indices āiā corresponding to the first loop. And I am wondering how much that can be complicated to do with the current status of linalg ??

linalg.generic can only do the dense part of the computation. A generic always iterates all elements of its input / output tensors and the loops are either parallel or reduction loops. That means in your case you probably need to use an scf.for to iterate over all groups and then you may be able to use a linalg.generic for the inner reduction. If you want to abstract the algorithm that searches the groups some how then you need custom abstractions / IR.