Tosa.tile op only support 1d-4d tensor

def Tosa_TileOp: Tosa_Op<"tile", [
      DeclareOpInterfaceMethods<InferShapedTypeOpInterface,
                              ["inferReturnTypeComponents"]>, 
      NoSideEffect]> {
  let summary = "Tile operator";

  let description = [{
    Replicates input 0 multiplies times along each dimension.
  }];

  let arguments = (ins
    Tosa_Tensor1Dto4D:$input1,
    I64ArrayAttr:$multiples);

  let results = (outs
    Tosa_Tensor1Dto4D:$output
  );

  let hasFolder = 1;
}

tosa.tile op support 1d ~ 4d tensor, but when i convert yoloV3 model from tensorflow dialect to tosa dialect, a 5d tensor needed by tf.tile op, causing convertion failure.
So why tosa.tile op definition here is what it is.

The definition in the dialect reflects the TOSA spec (⚡ TOSA).

Thanks for your feedback! We’ve lowered YoloV3 but did not encounter a 5D tensor tiling operation on this network (or any other).

It is unusual to encounter a 5D tensor, and our own YoloV2/V3 forms do not have it . How is this generated, since it sounds like another op emits a 5D tensors that could not then be tiled ? Tosa.tile doesn’t add dimensions (TOSA 0.23.0 specification),

Thanks @jpienaar for the pointer - a new and easily readable option of the TOSA spec is TOSA 0.23.0 specification .

Thank you for your reply.
I used yolov3 model from yolov3, and I find the 5d tile op from
tensorflow
``%578 = “tf.Reshape”(%577, %cst_0) {device = “”} : (tensor<16x52x52x255xf32>, tensor<5xi32>) → tensor<16x52x52x3x85xf32>

%579 = “tf.StridedSlice”(%578, %cst_8, %cst_11, %cst_7) {begin_mask = 15 : i64, device = “”, ellipsis_mask = 0 : i64, end_mask = 15 : i64, new_axis_mask = 0 : i64, shrink_axis_mask = 0 : i64} : (tensor<16x52x52x3x85xf32>, tensor<5xi
32>, tensor<5xi32>, tensor<5xi32>) → tensor<16x52x52x3x2xf32>

%580 = “tf.Sigmoid”(%579) {device = “”} : (tensor<16x52x52x3x2xf32>) → tensor<16x52x52x3x2xf32>
%581 = “tf.StridedSlice”(%578, %cst_11, %cst_10, %cst_7) {begin_mask = 15 : i64, device = “”, ellipsis_mask = 0 : i64, end_mask = 15 : i64, new_axis_mask = 0 : i64, shrink_axis_mask = 0 : i64} : (tensor<16x52x52x3x85xf32>, tensor<5x
i32>, tensor<5xi32>, tensor<5xi32>) → tensor<16x52x52x3x2xf32>
%582 = “tf.Exp”(%581) {device = “”} : (tensor<16x52x52x3x2xf32>) → tensor<16x52x52x3x2xf32>
%583 = “tf.Mul”(%582, %cst_13) {device = “”} : (tensor<16x52x52x3x2xf32>, tensor<3x2xf32>) → tensor<16x52x52x3x2xf32>
%584 = “tf.Mul”(%583, %cst_12) {device = “”} : (tensor<16x52x52x3x2xf32>, tensor) → tensor<16x52x52x3x2xf32>
%585 = “tf.StridedSlice”(%578, %cst_10, %cst_9, %cst_7) {begin_mask = 15 : i64, device = “”, ellipsis_mask = 0 : i64, end_mask = 15 : i64, new_axis_mask = 0 : i64, shrink_axis_mask = 0 : i64} : (tensor<16x52x52x3x85xf32>, tensor<5xi
32>, tensor<5xi32>, tensor<5xi32>) → tensor<16x52x52x3x1xf32>
%586 = “tf.Sigmoid”(%585) {device = “”} : (tensor<16x52x52x3x1xf32>) → tensor<16x52x52x3x1xf32>
%587 = “tf.StridedSlice”(%578, %cst_9, %cst_8, %cst_7) {begin_mask = 15 : i64, device = “”, ellipsis_mask = 0 : i64, end_mask = 31 : i64, new_axis_mask = 0 : i64, shrink_axis_mask = 0 : i64} : (tensor<16x52x52x3x85xf32>, tensor<5xi3
2>, tensor<5xi32>, tensor<5xi32>) → tensor<16x52x52x3x80xf32>
%588 = “tf.Sigmoid”(%587) {device = “”} : (tensor<16x52x52x3x80xf32>) → tensor<16x52x52x3x80xf32>
%589 = “tf.Tile”(%cst, %cst_1) {device = “”} : (tensor<1x52x52x1x2xi32>, tensor<5xi32>) → tensor<16x52x52x3x2xi32>
%590 = “tf.Cast”(%589) {Truncate = false, device = “”} : (tensor<16x52x52x3x2xi32>) → tensor<16x52x52x3x2xf32>
%591 = “tf.AddV2”(%580, %590) : (tensor<16x52x52x3x2xf32>, tensor<16x52x52x3x2xf32>) → tensor<16x52x52x3x2xf32>
%592 = “tf.Mul”(%591, %cst_12) {device = “”} : (tensor<16x52x52x3x2xf32>, tensor) → tensor<16x52x52x3x2xf32>
%593 = “tf.ConcatV2”(%592, %584, %cst_6) {device = “”} : (tensor<16x52x52x3x2xf32>, tensor<16x52x52x3x2xf32>, tensor) → tensor<16x52x52x3x4xf32>
%594 = “tf.ConcatV2”(%593, %586, %588, %cst_6) {device = “”} : (tensor<16x52x52x3x4xf32>, tensor<16x52x52x3x1xf32>, tensor<16x52x52x3x80xf32>, tensor) → tensor<16x52x52x3x85xf32>
return %594, %561, %528 : tensor<16x52x52x3x85xf32>, tensor<16x26x26x3x85xf32>, tensor<16x13x13x3x85xf32>
``
Is this model different from yours?

Yes, ours are internal conditioned models. It has no 5D tensors . It does have tile, reshape and other ops but just not constructed in the manner above. Is another source of the model an option ? Our TOSA legalization regressions have a few dozen real world networks but none of them deal with >4D tensors in this manner.

The TOSA limits on tensor rank attempt to judge what a broad set of hardware implementations would be able to support and optimize for. We landed on 4D as the max tensor rank for TILE. It would be a bit inconvenient, but you could insert RESHAPE operators to squash rank to 4d, do the TILE, and then RESHAPE back out to the original rank. In this case, you might have to do it twice for the two dimensions you’re tiling in.

We’re open to feedback on the TOSA specification, In addition to the spec linked above, there’s a better landing spot here: Developer Resources - ML Platform

It would be great if TOSA could grow to relax this limitation, and have a “broad set of hardware implementations would be able to support and optimize for” profile which only allows 4d tensors – we could develop lowering passes that attempt to convert general programs into that form, while not hindering the use of TOSA as a more general “mid-level tensor IR” higher in the stack.

Thanks for that feedback @_sean_silva . It drove some thinking and conversation about this topic since we want TOSA to sit at the compiler/hardware interface in a manner that benefits both sides.

Obviously for hardware targeting, the current TOSA spec defines rank limitations. However, a purely compiler IR form need not have that limitation in place. The way we were thinking was that perhaps we can expand upon the technique we used to express n-D matmul through 3-D matmul in TorchToTosa. In that construct, we use a sequence of reshape/transposes to pack and unpack invariant dims (separately managing broadcasting and non-broadcasting ones).

On to this one, the potential option is to relax TOSA op definitions in the dialect but have a dim packing pass that aligns the forms to hardware constraints as defined by the spec. Thus there’d be the ‘relaxed form’ and the ‘HW constrained form’.

In this case, it would permit n-D tile but that would subsequently be packed to 4D. The real issue here is that it takes detailed analysis to work out that the generalized form can always be converted to the HW constrained form. If that can be effectively resolved, then yes the dialect itself could express the relaxed form, but it requires a verifiable means to convert to the spec-defined limitations though a pass.

Why does is it required to lower to the HW constrained form? Some people might be going through TOSA just to lower to Linalg, where such limitations don’t exist. Only people who absolutely need to lower to the HW constrained form should pay the cost of limited program expressibility.

I think the question is whether TOSA wants to position itself as a solid mid-level tensor compute IR that also has a well-defined hardware-constrained profile (for those who need that), or whether the hw constrained profile is really the “source of truth” and the rest of the system must somehow conform (indirectly) to those limitations. As we add further necessary features to TOSA, like dynamic shapes and all the associated modeling, I think it will be necessary to view the original statically-shaped, rank-constrained TOSA as a lowering target rather than the source of truth.

There is a big opportunity here to become a very central, facto standard mid-level tensor compute IR, rather than a leaf lowering target for hardware interfacing. At least we in Torch-MLIR (and I can speak somewhat indirectly for the JAX/MHLO ecosystem) really need TOSA to be the former, rather than the latter – we need stability, versioning, well-speccedness, etc. first and foremost. We are actually using it for that today (and crossing our fingers that it evolves to fully meet our needs), but hitting clear walls in terms of today’s TOSA’s ability to fill that role. We will not be able to truly subsume our existing straight-to-linalg path with TOSA without this.

TOSA is more than a mid-level tensor compute IR. One of the goals is that a given TOSA model should be able to effectively target heterogeneous compute - CPU/GPU codegen as well as custom/semi-custom accelerators implemented to TOSA spec that implement the rank limitations defined in the spec. There are potentially multiple such hardware efforts in flight.

In compiler form, the suggested relaxed form does not need to be converted to the HW constrained form for all cases. E.g. TosaToLinAlg for CPU/GPU can continue without having to codegen the constrained form if the underlying hardware doesn’t require it.

However, the guarantee of expressibility in terms of the spec constraints must remain, because otherwise there may be TOSA legalized forms that cannot be re-expressed in spec constrained form and thus cannot run on custom hardware that was designed to be aligned to a TOSA version.

I think the question is whether TOSA wants to position itself as a solid mid-level tensor compute IR that also has a well-defined hardware-constrained profile (for those who need that), or whether the hw constrained profile is really the “source of truth” and the rest of the system must somehow conform (indirectly) to those limitations.

Without spending a lot of time navel-gazing this, intuitively the first description looks like the right expression and the second one is too rigid. The hw constrained picture is specific to the requirement of effectively intersecting with long gestation hw efforts - it shouldn’t gate software side expression. If it does that’s something we want to address, and I think this conversation is part of that.

The real concern is that the conversion between the forms must be semantically expressible, and ideally it should carry across most/all of the compiler support capabilities like dynamic shapes.

As we add further necessary features to TOSA, like dynamic shapes and all the associated modeling, I think it will be necessary to view the original statically-shaped, rank-constrained TOSA as a lowering target rather than the source of truth.

This domain is where we want to understand more about how well compiler side expressions like dynamic shapes work with the notional relaxed form, vs the hw constrained form. My view is that these are much easier with the former than the latter. However, we’d like to understand how this impacts code generation and runtime design specifically for , lets say, a custom accelerator. Would the dynamic shape support be degraded, or a subset of the general capabilities otherwise ? Can the differential capabilities be defined early / ahead of time ?