Movement between vector type on 'vector' dialect

Hello, everyone.

I’m working on HPC with CPU. The following pseudo code illustrates the microkernel of my program, where i, j are base addresses, and m,n are the sizes of computation.

for k = 1 to K do
	C[i:i+m-1, j:j+n-1] += A[i:i+m-1,k] * B[k,j:j+m-1]

Considering memory operations used in the microkernel are costly, I try to make sure the corresponding hardware behaviors are only performed on registers. Therefore, the idea is that all the required data are copied from memory to register before computation, as the following pseudo code said:

vector vc = load <mxn> vector from C[i,j]
for k = 1 to K do
	vector va = load <mx1> column vector from A[i,0]
	vector vb = load <1xn> row vector from B[0,j]
	vc += va * vb;
store vc back to C[i,j]

I used ‘vector’ dialect to implement this microkernel. But I cannot find appropriate ‘vector’ ops to impelment “+=” operation.

%vc = vector.load %C[%i, %j] : memref<?x?xf32>, vector<8x8xf32>
scf.for %p = %c1 to %K step %c1 {
	// %A_tran is the transpose of A
	%va = vector.load %A_tran[%p, %c0] : memref<?x?xf32>, vector<8xf32>
	%vb = vector.load %B[%p, %c0] : memref<?x?xf32>, vector<8xf32>
	%result = vector.outerproduct %va, %vb, %vc : vector<8xf32>, vector<8xf32>
	// how to move %result to %vc

I think the reason is the ‘vector’ dialect doesn’t provide an op with movement or assignment semantics on (virtual vector) registers. The ‘x86vector’ dialect doesn’t provide such an op, though AVX-512 intrinsics provide float-point move operations. The lack of such ops makes the intermediate computation results cannot be passed to accumulation variable outside the loop.

It’s said in document for ‘vector’ dialect that hardware vector ops for CPU are in flight. I’m wondering how could I implement the aforementioned microkernel and if there is a generic CPU dialect in development.

Any feedbacks are highly appreciated.

You can simply use arith.addf. Then you can use scf.yield to handle the loop carried dependency. The IR you want is something like this:

%vc = vector.load %C[%i, %j] : memref<?x?xf32>, vector<8xf32>
%loop_carried = scf.for %p = %c1 to %K step %c1 (%arg = %vc) -> (vector<8xf32>) {
	// %A_tran is the transpose of A
	%va = vector.load %A_tran[%p, %c0] : memref<?x?xf32>, vector<8xf32>
	%vb = vector.load %B[%p, %c0] : memref<?x?xf32>, vector<8xf32>
    %prod = arith.mulf %va, %vb : vector<8xf32>
	%result = arith.addf %prod, %arg : vector<8xf32>, vector<8xf32>
	scf.yield %result : vector<8xf32>
} %loop_carried 

Did you see the Case Study Docs on Vector Dialect CPU Codegen posting? The docs presented in this postings contain a lot of vector dialect examples together with generated SIMD code, such as AVX512?

Thanks for @ThomasRaoux’s advice. Honestly, I didn’t realize that scf.for can return the final values after loop termination to handle the loop carried dependency. Your advice helps me a lot.

Thanks for @aartbik’s reply. I didn’t find this post before. This posting and the docs presenting in this posting are related to my work. I will read it later, as you suggested.