I have just starting experimenting with vector distribution. Based on code review comments this needs some discussions, so I’m starting this thread so that we can the design of those kind of transformation.
Disclaimer: This is highly experimental and many things are likely to change. The idea is to start experimenting with basic transformations to be able find the problems early and be able to iterate until we can find something we are happy with.
As @nicolasvasilache had explained in the ODM on vectors, there are benefits to represent the program as large vectors (much larger than what the target supports). This allows expressing the dependencies with SSA values which makes the analysis simpler and allow us to later decide what should be demoted to memory and what should stay in register.
One of the challenge with this, is that we need to break up those large vectors incrementally during codegen to eventually map to the native size the HW support. The distribution could be done in many ways, it could be distributed on different threads, it could be serialized in a loop or it could be unrolled.
To break up the vector, the current direction I’m experimenting is to use some transient instructions called extract_map/insert_map (the name is still under progress to be improved, the op will most likely have to use affine_map to generalize the transformation to N-D vectors) and we propagate those instructions through the SSA chain until we get to memory access operations.
For instance if we have:
%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
%acc = addf %a, %b: vector<256xf32>
vector.transfer_write %acc, %out[%c0]: vector<256xf32>, memref<?xf32>
and we want to distribute to a serial loop to be:
scf.for %arg5 = %c0 to %c32 step %c1 {
%idx = affine.apply #map0(%arg5)
%a = vector.transfer_read %in1[%idx], %cf0: memref<?xf32>, vector<8xf32>
%b = vector.transfer_read %in2[%idx], %cf0: memref<?xf32>, vector<8xf32>
%acc = addf %a, %b: vector<8xf32>
vector.transfer_write %ins, %out[%idx]: vector<8xf32>, memref<?xf32>
}
To avoid having to do the rewrite all at once we use the vector.extract_map/insert_map to do the conversion incrementally:
scf.for %arg5 = %c0 to %c32 step %c1 {
%a = vector.transfer_read %in1[%c0], %cf0: memref<?xf32>, vector<256xf32>
%b = vector.transfer_read %in2[%c0], %cf0: memref<?xf32>, vector<256xf32>
%acc = addf %a, %b: vector<256xf32>
%ext = vector.extract_map %acc[%arg5 : 32] : vector<256xf32> to vector<8xf32>
%ins = vector.insert_map %ext, %arg5, 32 : vector<8xf32> to vector<256xf32>
vector.transfer_write %ins, %out[%c0]: vector<256xf32>, memref<?xf32>
}
Then propagate, we merge the vector.insert_map
with vector.transfer_write
. We match elementwise operations with vector.extract_map
and eventually vector.extract_map
with vector.transfer_read
Obviously we need some analysis to make sure the transformation overall is correct and this will have to be done at some point. Currently I’m working on creating the basic patterns that will combine to be able to get the transformations we want.
I hope this helps to understand the rational behind those first patterns. Feel free to join the discussion if you have ideas on alternative solutions.
FYI @mehdi_amini