[RFC] Linalg type casting

While looking to implement something like a linalg.element_type_cast, I realised there is one already: linalg.copy.

I want the cast op to be an “annotation” on the operand, not something that necessarily has materialization semantics. Copy strongly suggests I want a new buffer at some point.

Question:

  1. Do we clarify the semantics of linalg.copy saying materialization is not guaranteed and that it can be used as an element cast operation, or;
  2. Do we create a very similar op that has the same generalization even, but has a different and clear semantics, or;
  3. Do we rename linalg.copy to something more suitable and explain it can be used as a copy if the types are the same? linalg.identity isn’t quite right?

I’d prefer (3), but I don’t have a good name.

Neither tensor nor memref dialects have a good alternative (just element type casting) and the quant dialect is too confusing and vague, I’m not even sure what it does.

@nicolasvasilache @ftynse @stellaraccident @mehdi_amini

We’re going to have to clarify semantics no matter what, and there aren’t good names. The compiler is always free to elide materialization, so I think we should only have one op. I’m fine with leaving it called copy with more description (and have used it this way without too much cognitive dissonance).

2 Likes

If you only want element type casting, an optimisation can be added that may or may not kick in.

If you want real casting with shape changes and guaranteed no-copy then you want a cast op (there are already a few, I am unclear whether your use case is already supported or not).

1 Like

Casting of all types (shape only, physical layout, element type) are all supported. I just want the semantics to be clear to avoid unwanted materialization.

The semantics we agreed on the round table was that compute named ops don’t have implicit casting and that casting ops are used on their operands instead. This allows you to match a def-use chain/DAG to lower with special semantics instead of relying on affine maps/iterator types/ops chain inside multiple generics.

However, we need to avoid early materialization / modification that “may or may not kick in” before lowering (as you said).

Probably the key place to add this semantics is bufferization?

A few patterns come to mind:

  1. Avoid memref.alloc for linalg.copy, but then if we don’t fuse the casts, we can’t materialize them into thin air, so the cast must lower with an alloc (horrible case).
  2. Don’t change the bufferization pass, but then run a clean up of any stray allocations (after lowering). LLVM may do that for CPU, but not all code lowers to CPU.
  3. Create a special type of allocation (ex. memref.may_alloc) that only materializes if it has users, otherwise it’s a NOP.

No ideas above are good, but I haven’t really thought much about them.

Hoping @matthias-springer has a better idea. :smiley:

Note that there’s a difference between “copy” and “allocation”. linalg.copy turns into a memcpy. It may also turn into an allocation if necessary (but we try not to allocate).

E.g., an alloc would be needed in cases such as:

func.func @foo(%arg0: tensor<5xf32>, %arg1: tensor<5xf32>, %idx: index) -> tensor<5xf32> {
  // Cannot copy to buffer(%arg1) because the original value of %arg1 is still needed later. A new buffer will be allocated for %0.
  %0 = linalg.copy ins(%arg0) outs(%arg1)
  %1 = tensor.extract %arg1[%idx] : tensor<5xf32>
  vector.print %1 : f32
  return %0 : tensor<5xf32>
}

If you want a guaranteed allocation, you can use bufferization.alloc_tensor copy(%arg0). (However, that op does not implement the LinalgOp interface and the copy may not be lowered as efficiently.)

There is also bufferization.materialize_in_destination %t in %a, which guarantees that the contents of the tensor %t end up in %a after bufferization. %a can be a tensor or a memref. In case %a is a tensor, the semantics are that the contents of the tensor %t should materialize in buffer(%a). This is similar to linalg.copy but with stronger guarantees. In the example above, a new buffer is allocated for %0, so the contents do not actually end up in buffer(%arg1). When replacing the linalg.copy with bufferization.materialize_in_destination, the bufferization pass would fail (because the user specified constraints that are impossible to satisfy).

Side note: There is no such thing as “copying a tensor”, so the name linalg.copy is confusing. “Copying” makes sense for memrefs but not for tensors. Ideally, we should not have such terms in tensor-based dialects (apart from the bufferization dialect, which operates at the boundary of memref and tensor). Ideally, I’d like to remove linalg.copy (or maybe turn it into a “cast” op) and use the above mentioned bufferization ops instead. But these cannot implement the LinalgOp interface because of layering issues.

Also, there is currently no such thing as “guaranteed no-alloc” (same for no-copy) apart from bufferization.materialize_in_destination. E.g, replace the linalg.copy with a hypothetical linalg.cast in the above example; it will alloc a new buffer for the casted result. (But we could make any op behave like bufferization.materialize_in_destination, such that bufferization fails if it would be forced to bufferize out-of-place.)

1 Like

Right, this plays with the tensor vs. memref discussion, and why I mention bufferization, as you mention in your example.

I think of copy here more as an adjective than a verb. That’s fine with value semantics.

  %copy = linalg.copy ins(%a) outs(%b)
  • Yes: %b “is a copy of” %a
  • No: copy %b “into” %a

That’s my option (3) above. I’m happy with either keeping (1) or renaming (3), but not having multiple of them (2).

The bufferization interplay isn’t just about bufferization itself, but what happens when we “fuse” the cast into a lower op.

Silly example (non practical):

  // Down cast %arg0 to f16
  %0 = tensor.empty() : tensor<32x64xf16>
  %arg0_cast = linalg.cast ins(%arg0) outs(%0) : tensor <32x64xf32> to tensor<32x64xf16>
  // arg1 is a constant
  %arg1 = arith.constant 1.23456e1 : tensor<32x64xf16>
  // Perform the add
  %1 = tensor.empty() : tensor<32x64xf16>
  %add = linalg.add ins (%arg1, %arg0_cast : tensor<32x64xf16>, tensor<32x64xf16>) outs (%1 : tensor<32x64xf16>) : tensor<32x64xf16>
  // But accumulate on f32
  %2 = tensor.empty() : tensor<32x64xf16>
  %add_cast = linalg.cast ins(%add) outs (%2) : tensor<32x64xf16> to tensor<32x64xf32>
  return %add_cast

Problem #1: There may be graph patterns that need to be lowered in a particular convoluted way (as above) that may be destroyed by intermediate passes, like bufferization.

Solution #1: We can always revert back to generics with precise semantics (ex. accumulation type), but then we’re also back to matching too many things. I’d like to have a design that is robust to both cases without reverting to generics.

Note that the comment says: “accumulate on f32” but the implementation accumulates on f16, then casts to f32. Breaking the pattern may mean I can’t match to a fast op down the road.

That code could bufferize to:

  // Down cast %arg0 to f16
  %0 = memref.alloc() : memref<32x64xf16>
  %arg0_cast = linalg.cast ins(%arg0) outs(%0) : memref <32x64xf32> to memref<32x64xf16>
  // arg1 is a constant
  %arg1 = arith.constant 1.23456e1 : memref<32x64xf16>
  // Perform the add
  %1 = memref.alloc() : memref<32x64xf16>
  %add = linalg.add ins (%arg1, %arg0_cast : memref<32x64xf16>, memref<32x64xf16>) outs (%1 : memref<32x64xf16>) : memref<32x64xf16>
  // But accumulate on f32
  %2 = memref.alloc() : memref<32x64xf16>
  %add_cast = linalg.cast ins(%add) outs (%2) : memref<32x64xf16> to memref<32x64xf32>
  return %add_cast

Problem #2: Bufferization can try to be smart about this and reuse some of those buffers, but then it would create variations on the number of allocations within a pattern.

Solution #2: What we do today is completely ignore the allocations in the middle and if we match, we clean up the remaining ones, if not we just lower pessimistically.

What I really wanted is:

  // Some lowering with wider accumulation type
  %arg1 = arith.constant 1.23456e1 : memref<32x64xf16>
  %out = memref.alloc() : memref<32x64xf16>
  %add = foobar.add_from_f16_and_f32_into_f32 ins (%arg1, %arg2 : memref<32x64xf16>, memref<32x64xf32>) outs (%out : memref<32x64xf32>) : memref<32x64xf32>
  return %add

Note how all allocs disappear “as if they weren’t necessary after all”.

The four main ways of doing this are:

  1. Straight from graph to hardware and not having a compiler.
  2. Generic implementation and lack of complex structures (ie perfectly nested).
  3. Named op for everything and combinatorial explosion.
  4. Named op chain complexity and interplay with other passes.

We have enough anti-patterns from 1-3, but we should be careful with 4 and avoid obvious patterns.

Our current approach is to mix them all, but that also comes with a bit of all complexites.

I see the problem. There is indeed no guarantee that a tensor.empty() materializes as a memref.alloc() and the bufferization may find a way to avoid the allocation in some cases.

Two ideas that come to mind:

(1) Use memref.alloc() instead of tensor.empty(). memref.alloc has a side effect and is guaranteed to survive bufferization. E.g.:

// Down cast %arg0 to f16
%0 = memref.alloc() : memref<32x64xf16>
%0_tensor = bufferization.to_tensor %0 restrict writable : memref<32x64xf16>
%arg0_cast = linalg.cast ins(%arg0) outs(%0_tensor) : tensor <32x64xf32> to tensor<32x64xf16>

This will always be bufferized to the IR from your second listing.

(2) Design a non-DPS cast op. If the op has no destination, then the bufferization cannot be “smart” about optimizing it. E.g.:

// No outs!
%arg0_casted = foobar.cast %arg0 : (tensor<32x64xf32>) -> (tensor<32x64xf16>)

Probably linalg is the wrong dialect for this op, that’s why I called it foobar.cast. Maybe it would fit in the tensor dialect, we already have tensor.cast and tensor.bitcast there. (Or you could have it in your own foobar dialect.)

The above IR could be bufferized to:

%arg0_casted = foobar.cast %arg0 : (memref<32x64xf32>) -> (memref<32x64xf16>)

This memref-based foobar.cast can then be matched by your pattern that lowers to foobar. add_from_f16_and_f32_into_f32.

We’re trying to avoid allocations, not force them. :smiley:

This is how we designed the TPP dialect initially. Value SSE on tensors, DPS on memrefs. But then you need two syntaxes for each op.

Let’s say we introduce a tensor.typecast which can easily carry semantics without being materialized. If I don’t match it at tensor level (to a low-level HW intrinsic or micro-kernel call, which is too high level here), I have to bufferize it to something.

We could create an equivalent memref.typecast, which would have to be DPS, since this is buffers. And since a type cast can yield a different allocation size (bigger/smaller types), the destination needs to be a separate buffer, which needs to be allocated. This is exactly what linalg.copy does.

I’m ok with having to work my way around buffers at memref level. We do that already on the XSMM dialect and it’s not too hard. I think linalg.copy will work fine for the time being, even if bufferization leaves some crumbs around.

At the very least it will allow us to prototype the patterns and see more clearly the problems we need to fix with a potentialy new operation.

What I meant is if you force an allocation, a rewrite pattern can then safely match for linalg.cast ins(...) outs(%alloc)+linalg.add combinations. (That’s what you did in your example going from the second listing to the third listing.) Now that I look at it again, it probably doesn’t even matter whether %alloc is an allocation or not: a cleanup pattern that removes unused allocations may be sufficient. I guess that’s what you hinted at with the memref.may_alloc op.

Is there any problem with this rewrite pattern-based approach? I.e., matching linalg.cast(linalg.add(linalg.cast, linalg.cast))? Do you forsee situations (maybe caused by the bufferization etc.) where such a pattern may fail to match or leave behind unnecessary IR that is hard to remove?

Nit maybe, but why can’t bufferization be smart?
The “destination” (it is more “init” semantically!!) shouldn’t be anything more than an eventual hint for the bufferization.

I find this confusing: on memref I would think that a “cast” would not be in DPS form?

Maybe I misunderstood, but it sounded like the above mentioned transformation/pattern that creates the “foobar.add_and_cast” op is looking for very specific IR. And the more predictable the output of the bufferization, the easier the matching will be. I.e., in this case, not being smart for the sake of “I know exactly what kind of IR will be produced”.

In memref you need to tell what the buffer is, right? Does any memref op create buffers (other than alloc/alloca)?

  // Explicit allocation
  %alloc = memref.alloc() : memref<123x456xf16>

  // Use the buffer
  %a = linalg.some_op ins(%arg1) outs (%alloc) : memref<123x456xf16>

  // This is twice the size, does typecast create a new buffer?
  %b = memref.typecast %a : memref<123x456xf32>

Ok I misunderstood: I thought you wanted to avoid copying (and so only casting at the same size)

That already exists, the bitcast operation.

I want to avoid creating temporary buffers for an operation that may only use the accumulator type in registers, for example.

Perhaps it’d make sense to do a rewrite after bufferization that eliminates temporary buffers for things that can be done in registers?

Yeah, we thought of that, and half-implemented it at some point. But the advice was that this should be unnecessary if the bufferization does the right thing.

In truth is, we abandoned it because @matthias-springer really made bufferization better with the current semantics. But as we start to add more ops in op-patterns (not just generics), then the current tricks won’t be enough.

Perhaps this rewrite could be specific to linalg? If we document the semantics and implement it by the book, then such a rewrite would be a valid transformation on a dialect that has a clear semantics.

Now, getting that semantics right before knowing how it can go wrong is the hard part. :slight_smile:

I’m confused about what’s wrong with linalg.generic here? You’re trying to do a fusion of several elementwise functions, and that’s what generic is for?

Like, the silly example you gave should really be

%2 = tensor.empty() : tensor<32x64xf32>
%add = linalg.generic (ins %arg0, %arg1 : tensor<32x64xf32>, tensor<32x64xf16>) outs(%2 : tensor<32x64xf32>) {
^bb0(%0 : f32, %1 : f16, %2 : f32):
  %0.cast = truncf %0 : f32 to f16
  %add = addf %0.cast, %1 : f16
  %add.cast = extf %add : f16 to f32
  linalg.yield %add.cast : f32 
}
return %add

and then the block inside the generic has an obvious canonical form that you can replace with %2 = foobar.extending_add %0, %1 : f32, f16 to f32 which is a pattern-match you might want to do program-wide anyway?

Or are there a lot of other examples like this running aroun

Sorry, there are just too many threads on this subject, I get lost.

TL;DR: generics can do anything perfectly nested do, including special casts and op fusion. But two cases fall apart: non-perfectly nested and long chains of generics where the affine maps are not always quite the same. (Some context here and here)

The silly example is misleading. The more important fact is that the auto-cast in some linalg ops pre-cast, so this is broken:

  // This
  linalg.binary_op ins(f32, f32) outs(f16)

  // Becomes
  linalg.generic ins(f32, f32) outs(f16) {
  ^bb0(%0 : f32, %1 : f32, %2 : f16):
    %0.cast = truncf %0 : f32 to f16
    %1.cast = truncf %1 : f32 to f16
    %add = binary_op %0.cast, %1.cast : f16
    linalg.yield %add : f16
  }

and there is no way to ask it to accumulate in f32 before casting it down, other than changing the generic return type to f32 and downcasting later, which then it’s the same as a named op:

  %0 = linalg.binary_op ins(f32, f32) outs(f32)
  %ret = linalg.copy ins(f32) outs(f16)

which is really easy to match against and has the perfect semantics.

On fusion, linalg generic can only fuse if the ops have compatible affine maps throughout the entire chain, but understanding how to compose under tiling and fusion them isn’t trivial (ex. softmax). If all generics have the same affine maps, then it becomes trivial, but that’s just another way of writing a named op with explicit casts.

To be clear, this discussion is about adding strict semantics to some named ops so that we can short-cut the complexity of handling linalg.generic ops for large graphs. It is not about replacing or deprecating generics, which are still very important for linalg transforms.And I still very much want to continue using generics for all the cases where it’s still the best tool around.

I’ll leave the non-perfectly nested issue for another thread, because that’s a complex issue and not really in the scope here.