Bug in `PadOpTransformationPattern`

Hi there,

I think I found a bug in the current implementation of PadOpTransformationPattern. The implemented stragtegy for lowering is invalid when there is no padding in lower dimensions. This is due to the implicit iteration domains of the linalg.generic used for copying. Consider the following input:

module {
    func.func @test(%in : tensor<1x128x23x23xf32>) -> tensor<1x128x24x24xf32> {
        %cst0 = arith.constant 0.0 : f32
        %out = tensor.pad %in low [0, 0, 0, 0] high [0, 0, 1, 1] {
            ^bb0(%n : index, %c : index, %y : index, %x : index):
                tensor.yield %cst0 : f32
        } : tensor<1x128x23x23xf32> to tensor<1x128x24x24xf32>
        return %out : tensor<1x128x24x24xf32>
    }
}

On 598c5ddba (recent head), you will receive the following error:

test.mlir:15:17: error: 'linalg.generic' op inferred input/output operand #1 has shape's dimension #2 to be 23, but found 24
        %copy = linalg.generic {
                ^
test.mlir:15:17: note: see current operation: %3 = "linalg.generic"(%arg0, %2) ({
^bb0(%arg1: f32, %arg2: f32):
  "linalg.yield"(%arg1) : (f32) -> ()
}) {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"], operand_segment_sizes = dense<1> : vector<2xi32>} : (tensor<1x128x23x23xf32>, tensor<1x128x24x24xf32>) -> tensor<1x128x24x24xf32>

Indeed, this lowering is not valid:

func.func @test2(%in : tensor<1x128x23x23xf32>) -> tensor<1x128x24x24xf32> {
    %cst0 = arith.constant 0.0 : f32
    %init = linalg.init_tensor [1, 128, 24, 24] : tensor<1x128x24x24xf32>
    %fill = linalg.fill 
        ins(%cst0 : f32) 
        outs(%init : tensor<1x128x24x24xf32>) 
        -> tensor<1x128x24x24xf32>
    %copy = linalg.generic {
        indexing_maps = [
            affine_map<(n,c,y,x) -> (n,c,y,x)>,
            affine_map<(n,c,y,x) -> (n,c,y,x)>
        ],
        iterator_types = ["parallel", "parallel", "parallel", "parallel"]
    } 
        ins(%in : tensor<1x128x23x23xf32>) 
        outs(%fill : tensor<1x128x24x24xf32>) {
        ^bb0(%arg0 : f32, %arg1 : f32):
            linalg.yield %arg0 : f32
    } -> tensor<1x128x24x24xf32>
    return %copy : tensor<1x128x24x24xf32>
}

I looked at the code and the proposed alternative of using tensor.insert_slice seems to work fine:

func.func @test3(%in : tensor<1x128x23x23xf32>) -> tensor<1x128x24x24xf32> {
    %cst0 = arith.constant 0.0 : f32
    %init = linalg.init_tensor [1, 128, 24, 24] : tensor<1x128x24x24xf32>
    %fill = linalg.fill
        ins(%cst0 : f32)
        outs(%init : tensor<1x128x24x24xf32>)
        -> tensor<1x128x24x24xf32>
    %copy = tensor.insert_slice %in into %fill
        [0,0,0,0][1,128,23,23][1,1,1,1] :
        tensor<1x128x23x23xf32> into tensor<1x128x24x24xf32>
    return %copy : tensor<1x128x24x24xf32>
}

I have never used Phabricator before and I don’t have the time to steward this fix now, but I wanted to let people know. Maybe @agostini01 can keep an eye on this one if they decide to do another pass on this implementation.

I just noticed that there is already a GeneralizePadOpPattern that implements the desired behavior. I’m now wondering what the difference is, seeing as both only work on constant padding values, except that the lowering to tensor.insert_slice actually works.

I also noted that tensor.init_tensor and tensor.fill could now probably be replaced with tensor.splat.