Can linalg represent the loop of the reduced domain?

Hi,

I’m trying to add an Op named linalg.padding to the LinalgNamedStructuredOps.yaml file, but I’m encountering some unknown errors.

First of all, I expect linalg.padding to function as shown in padding() in the following code:

#include <stdio.h>

#define IN 1
#define IH 256
#define IW 384
#define IC 128

#define ON 1
#define OH 260
#define OW 388
#define OC 128

#define LN 0
#define LH 2
#define LW 2
#define LC 0

#define HN 0
#define HH 2
#define HW 2
#define HC 0

float input_data[IN][IH][IW][IC];
float output_data[ON][OH][OW][OC];
float padding_value = 2.0;

void initialize() {
    for (int i = 0; i < ON; i++) {
        for (int j = 0; j < IH; j++) {
            for (int k = 0; k < IW; k++) {
                for (int t = 0; t < IC; t++) {
                    input_data[i][j][k][t] = 1.0;
                }
            }
        }
    }

    for (int i = 0; i < ON; i++) {
        for (int j = 0; j < OH; j++) {
            for (int k = 0; k < OW; k++) {
                for (int t = 0; t < OC; t++) {
                    output_data[i][j][k][t] = 0.0;
                }
            }
        }
    }
}

void padding() {
    #pragma scop
    for (int i = 0; i < ON; i++) {
        for (int j = 0; j < OH; j++) {
            for (int k = 0; k < OW; k++) {
                for (int t = 0; t < OC; t++) {
                    if (i >= LN && i < IN + LN && j >= LH && j < IH + LH &&
                        k >= LW && k < IW + LW && t >= LC && t < IC + LC) {
                        output_data[i][j][k][t] = input_data[i - LN][j - LH][k - LW][t - LC];
                    } else {
                        output_data[i][j][k][t] = padding_value;
                    }
                }
            }
        }
    }
    #pragma endscop
}

void print_output() {
    for (int i = 0; i < ON; i++) {
        for (int j = 0; j < OH; j++) {
            for (int k = 0; k < OW; k++) {
                printf("[i=%d,j=%d,k=%d]: ", i, j, k);
                for (int t = 0; t < OC; t++) {
                    printf("%f ", output_data[i][j][k][t]);
                }
                printf("\n");
            }
        }
    }
}

int main() {
    initialize();
    padding();
    print_output();
    return 0;
}

Here’s what I’ve added:

--- !LinalgOpConfig
metadata: !LinalgOpMetadata
  name: padding
  cpp_class_name: PaddingOp
  doc: |-
    Padding the input tensors.
  implements:
  - LinalgPaddingOpInterface
  defines:
  - hasCanonicalizer
structured_op: !LinalgStructuredOpConfig
  args:
  - !LinalgOperandDefConfig
    name: I
    kind: input_tensor
    type_var: T1
    shape_map: affine_map<()[s0, s1, s2, s3, s4, s5, s6, s7] -> ()>
  - !LinalgOperandDefConfig
    name: value
    kind: scalar
    type_var: T2
  - !LinalgOperandDefConfig
    name: O
    kind: output_tensor
    type_var: U
    shape_map: affine_map<()[s0, s1, s2, s3, s4, s5, s6, s7] -> ()>
  - !LinalgOperandDefConfig
    name: low
    kind: index_attr
    index_attr_map: affine_map<()[s0, s1, s2, s3, s4, s5, s6, s7] -> (s0, s2, s4, s6)>
    default_indices:
    - 0
    - 1
    - 1
    - 0
  - !LinalgOperandDefConfig
    name: high
    kind: index_attr
    index_attr_map: affine_map<()[s0, s1, s2, s3, s4, s5, s6, s7] -> (s1, s3, s5, s7)>
    default_indices:
    - 0
    - 1
    - 1
    - 0
  indexing_maps: !LinalgIndexingMapsConfig
    static_indexing_maps:
    - affine_map<(d0, d1, d2, d3)[s0, s1, s2, s3, s4, s5, s6, s7]  -> (d0 - s0, d1 - s2, d2 - s4, d3 - s6)>
    - affine_map<(d0, d1, d2, d3)[s0, s1, s2, s3, s4, s5, s6, s7]  -> ()>
    - affine_map<(d0, d1, d2, d3)[s0, s1, s2, s3, s4, s5, s6, s7]  -> (d0, d1, d2, d3)>
  iterator_types:
    - parallel
    - parallel
    - parallel
    - parallel
  assignments:
  - !ScalarAssign
    arg: O
    value: !ScalarExpression
      scalar_fn:
          kind: type
          fn_name: cast_signed
          type_var: U
          operands:
          - !ScalarExpression
            scalar_arg: I

The code compiles without any problems, but reports the following error when executed:

linalg_padding_test.mlir:34:17: error: 'linalg.padding' op unexpected result less than 0 at expression #1 in (d0, d1, d2, d3) -> (d0, d1 - 2, d2 - 2, d3)
    %padd_res = linalg.padding {low = dense<[0, 2, 2, 0]> : tensor<4xi64>, high = dense<[0, 2, 2, 0]> : tensor<4xi64>}
                ^
linalg_padding_test.mlir:34:17: note: see current operation: 
%2 = "linalg.padding"(%arg0, %1, %0) <{high = dense<[0, 2, 2, 0]> : tensor<4xi64>, low = dense<[0, 2, 2, 0]> : tensor<4xi64>, operandSegmentSizes = array<i32: 2, 1>}> ({
^bb0(%arg2: f32, %arg3: f32, %arg4: f32):
  "linalg.yield"(%arg2) : (f32) -> ()
}) {linalg.memoized_indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1 - 2, d2 - 2, d3)>, affine_map<(d0, d1, d2, d3) -> ()>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>]} : (tensor<1x256x384x128xf32>, f32, tensor<1x260x388x128xf32>) -> tensor<1x260x388x128xf32>

The op’s already in this yaml file all seem to have full iteration space, so is it possible that linalg doesn’t support reduced iteration domain? If not, how do I implement what I need (e.g. how do I represent if conditions in it)?

Is there an solution?

Thanks,

sheen

Can you use tensos pad or pack instead?

I tried to use tensor.pad before, but because I need to tile and fuse padding with its upstream and downstream operators, transform.structured.fuse_into_containing_op cannot support tensor.pad, and I may need to do some special work on padding operation, so I will try to implement this linalg.padding.

cannot or does not? What would a linalg.padding add that tensor.pad can’t?

I’m just making sure we have clear reasons why we need such a new operation in linalg that cannot be done by using / extending existing ops / transforms.

No in the sense you are trying to express. The iteration space is inferred from the indexing maps and the sizes of input tensors in such a way that each iteration accesses an element from inputs without reading out-of-bounds. In your case, you’d be reading out of bounds for cases when padding is necessary.

Before trying that:

  1. Make sure you can produce a linalg.generic form of your computation; named ops are merely an alias for that generic.
  2. Take a look at llvm-project/mlir/python/mlir/dialects/linalg/opdsl/ops/core_named_ops.py at 76a3be7c766bd55221c3d0d0a74c42f82c5d76ed · llvm/llvm-project · GitHub which is the source of truth for op definitions. YAML is only there because we didn’t want to run python during compilation, nobody should be writing it manually.

@nicolasvasilache and I did work on padding in the GPU context at some point. IIRC, the issue with padding is that tensor.pad is not in the destination-passing style, which makes it annoying to handle. It can be rewritten as linalg.fill of padding value in the desired size + tensor.insert_slice that inserts the original tensor at the right offset. This construct should be fusable because linalg.fill is a reguar op and we know how to handle slices.

2 Likes

Thank you for your reply, I will try it.

In general padding cannot be represented as a linalg op. But as Alex pointed out the issue is pad is not destination passing style. I think it would be useful to take a look at tensor.pad and rework this (maybe create a new simpler op) that is destination passing style.

1 Like

Thanks for the suggestion, I’ll try it.

Recently, I read the relevant code and design concept documents of linalg dialect, and linalg cannot express padding. But there is still a question: How should linalg represent multiple nested loops, in other words, how to represent a combination of multiple linalg.generic? Judging from the current code, this is only the case for Softmax in linalg. It seems that there is no formal solution for this situation yet, or is it still under discussion? As for me, my current practical need is to treat this situation as a whole operator and tile and fuse this whole in the producer-consumer link. For specific requirements, see: How to extend linalg ops without modifying MLIR code?