Hi all,

In Linalg there are two ways (as far as I know) to deal with vectorization at the boundaries: padding and peeling.

- Padding “extends” the region of memory to align with the vector length
- Peeling extract a partial loop that to a bound that aligns with the vector length and then create a remainder loop to do the rest

However, there is a third option, which I am not able to kick in from Linalg, i.e., **masking**. Consider a simple saxpy:

```
#map0 = affine_map<(d0) -> ()>
#map1 = affine_map<(d0) -> (d0)>
module {
func @saxpy_linalg(%arg0: f32, %arg1: tensor<33xf32>, %arg2: tensor<33xf32>) -> tensor<33xf32> {
%0 = linalg.generic {indexing_maps = [#map0, #map1, #map1], iterator_types = ["parallel"]} ins(%arg0, %arg1 : f32, tensor<33xf32>) outs(%arg2 : tensor<33xf32>) {
^bb0(%arg5: f32, %arg6: f32, %arg7: f32):
%3 = arith.mulf %arg5, %arg6 : f32
%4 = arith.addf %3, %arg7 : f32
linalg.yield %4 : f32
} -> tensor<33xf32>
return %0 : tensor<33xf32>
}
}
```

When I tile by 4, this is what I got:

```
#map0 = affine_map<(d0) -> (-d0 + 33, 4)>
#map1 = affine_map<(d0) -> ()>
#map2 = affine_map<(d0) -> (d0)>
module {
func @saxpy_linalg(%arg0: f32, %arg1: tensor<33xf32>, %arg2: tensor<33xf32>) -> tensor<33xf32> {
%c4 = arith.constant 4 : index
%c33 = arith.constant 33 : index
%c0 = arith.constant 0 : index
%0 = scf.for %arg3 = %c0 to %c33 step %c4 iter_args(%arg4 = %arg2) -> (tensor<33xf32>) {
%1 = affine.min #map0(%arg3)
%2 = tensor.extract_slice %arg1[%arg3] [%1] [1] : tensor<33xf32> to tensor<?xf32>
%3 = tensor.extract_slice %arg4[%arg3] [%1] [1] : tensor<33xf32> to tensor<?xf32>
%4 = linalg.generic {indexing_maps = [#map1, #map2, #map2], iterator_types = ["parallel"]} ins(%arg0, %2 : f32, tensor<?xf32>) outs(%3 : tensor<?xf32>) attrs = {iree_linalg_transform.matched} {
^bb0(%arg5: f32, %arg6: f32, %arg7: f32):
%6 = arith.mulf %arg5, %arg6 : f32
%7 = arith.addf %6, %arg7 : f32
linalg.yield %7 : f32
} -> tensor<?xf32>
%5 = tensor.insert_slice %4 into %arg4[%arg3] [%1] [1] : tensor<?xf32> into tensor<33xf32>
scf.yield %5 : tensor<33xf32>
}
return %0 : tensor<33xf32>
}
}
```

I would like to lower that to emit `transfer_read`

with masks and subsequently `masked_load`

instructions. Is there a way to do so directly from linalg?

I see that the vectorization does not even kick in because of the dynamic shapes. Before I throw some ideas, is there any way I can mask the loop? Did someone already do this?

Thanks,

Giuseppe