Hello, everyone.

I’m working on HPC with CPU. The following pseudo code illustrates the microkernel of my program, where i, j are base addresses, and m,n are the sizes of computation.

```
for k = 1 to K do
C[i:i+m-1, j:j+n-1] += A[i:i+m-1,k] * B[k,j:j+m-1]
endfor
```

Considering memory operations used in the microkernel are costly, I try to make sure the corresponding hardware behaviors are only performed on registers. Therefore, the idea is that all the required data are copied from memory to register before computation, as the following pseudo code said:

```
vector vc = load <mxn> vector from C[i,j]
for k = 1 to K do
vector va = load <mx1> column vector from A[i,0]
vector vb = load <1xn> row vector from B[0,j]
vc += va * vb;
endfor
store vc back to C[i,j]
```

I used ‘vector’ dialect to implement this microkernel. But I cannot find appropriate ‘vector’ ops to impelment “+=” operation.

```
%vc = vector.load %C[%i, %j] : memref<?x?xf32>, vector<8x8xf32>
scf.for %p = %c1 to %K step %c1 {
// %A_tran is the transpose of A
%va = vector.load %A_tran[%p, %c0] : memref<?x?xf32>, vector<8xf32>
%vb = vector.load %B[%p, %c0] : memref<?x?xf32>, vector<8xf32>
%result = vector.outerproduct %va, %vb, %vc : vector<8xf32>, vector<8xf32>
// how to move %result to %vc
}
```

I think the reason is the ‘vector’ dialect doesn’t provide an op with movement or assignment semantics on (virtual vector) registers. The ‘x86vector’ dialect doesn’t provide such an op, though AVX-512 intrinsics provide float-point move operations. The lack of such ops makes the intermediate computation results cannot be passed to accumulation variable outside the loop.

It’s said in document for ‘vector’ dialect that hardware vector ops for CPU are in flight. I’m wondering how could I implement the aforementioned microkernel and if there is a generic CPU dialect in development.

Any feedbacks are highly appreciated.