Is the TCP "matmul" op marked NoSideEffect?

I agree with the pandora’s box assessment here but there is also the necessity of supporting this. So I am looking at this asking myself how small the hole can be that we drill into the box.

I was thinking that we would only use 0-strided memrefs when reading from a memref but not for writing, which feels like limiting the impact. Also, if we model it such that we can statically differentiate between the 0-stride and 0-stride free variants, we would not impact the optimizations for the cases without internal aliasing.

In some pseudo-IR this would give

%0 = broadcast(%strides0) : memref<?x?x?xf32>
%1 = broadcast(%strides1) : memref<?x?x?xf32>
%2 = op(%0, %1) : (memref<?x?x?xf32>, memref<?x?x?xf32>) -> memref<?x?x?xf32>
%3 = other_op(%2) : (memref<?x?x?xf32>) -> memref<?x?x?xf32>

The broadcast, instead of materializing the result, could be lowered into a zero-strided memref, in a similar fashion to what std.subview does. However, we would get a new type out.

%0 = std.subview(...) : aliasing-memref<?x?x?xf32>
%1 = std.subview(...) : aliasing-memref<?x?x?xf32>
%2 = op(%0, %1) : (aliasing-memref<?x?x?xf32>, aliasing-memref<?x?x?xf32>) -> memref<?x?x?xf32>
%3 = other_op(%2) : (memref<?x?x?xf32>) -> memref<?x?x?xf32>

When lowering the op to say linalg, the types would reflect the fact that there is inner aliasing but it would still be confined to that area.

Consider a simple matmul lowering. In such a lowering, it’s likely that the contracting dimension’s extent will be taken from an arbitrary operand (such as the LHS), so if the RHS has the wrong size, we could be reading random out of bounds memory and segfaulting, for example.

I am not sure we can resume UB to just “any side-effects”, otherwise we couldn’t re-order operations that may have UB. Since UB is an invalid execution, it is hard to reason about it in terms of side-effects I think.

But you can move the guard_b above op_a and have a nice sequence for a-b-c.

I explicitly assume here that guard_b cannot be moved. It may, for example, depend on the result of op_a or there might be some other dependency.

My main point here is that not being ably to move any kind of operation over any kind of guard is very coarse grained as it restricts the scheduling options of otherwise sideeffect free operations.

I address this case further down in my post.

So if your guards are strong enough to ensure that the operation never executes outside its well behaved boundaries (and the guards should be!) then the UB and its side-effects are not an issue.

By “any side effect” I mean “any side effect of the compiler’s choosing”. This is why we say (for instance) that if a program has UB then any behavior is valid behavior.

I don’t think this breaks reordering though, as long as we don’t introduce UB to a program that doesn’t have UB.

Just so we’re on the same page, even if we use guards with witness values, matmul still could not be tagged as side-effect free.

For instance, a “malicious” optimizer could translate:

%w = unrelated_witness

to

%w = unrelated_witness
matmul(%a, %b, %w) // segfaulting matmul, %w has nothing to do with %a and %b

if matmul was stated as side effect free.

How do we model this in LLVM? Like dividing by zero with the udiv instruction. Where in LLVM is the code/predicate/property that would allow us to insert an add (with fully defined wrapping) in an arbitrary place but not a udiv? Is it just llvm::isSafeToSpeculativelyExecute and the burden is on pass authors to make sure they remember to call it?

Btw, it seems to me that we currently model llvm.udiv in the LLVM dialect incorrectly. It is currently marked NoSideEffect:

LLVMOps.td

class LLVM_ArithmeticOp<string mnemonic, string builderFunc,
                        list<OpTrait> traits = []> :
    // !!!!! Incorrectly marked NoSideEffect !!!!!
    LLVM_OneResultOp<mnemonic,
           !listconcat([NoSideEffect, SameOperandsAndResultType], traits)>,
    ... {
...
}
...
def LLVM_UDivOp : LLVM_ArithmeticOp<"udiv", "CreateUDiv">;

I believe so.

Only if you assume that UB is a side-effect, I am not convinced by this though. If we were looking at LLVM for example we could also have a “safe to speculate” traits/interface independently of side-effects.

I think we’re subtly talking about two different things. It will take a blog post (or “one pager” :slight_smile: ) to elaborate, but roughly:

  1. I’m not saying “UB is a side effect”. That would indeed be a problem.
  2. Programs can have multiple legal behaviors and the compiler can arbitrarily “choose” a particular behavior out of this set. In LLVM the canonical example is undef: print((i8)undef) has the following set of behaviors {"prints 0", "prints 1", ..., "prints 0xff"}. This choice is usually called “refinement”. There is some more discussion about refinement here.
  3. UB can be modeled as an extreme version of such an operation with multiple possible behaviors: it can have any possible behavior. So *(int*)0 has the infinite set of behaviors {"prints 0", "prints 1", ..., "formats hard drive", "invokes nasal demons"}. The compiler is free to choose any such behavior to “implement” the UB, and since this set contains all possible behaviors, the compiler cannot choose incorrectly.

NoSideEffect means “it is guaranteed to have no side effect”. If there is the possibility of UB, then that statement is not true, since UB can result in a side effect. To break this down into unambiguous logical implications:

(UB is possible when executing FooOp && UB can trigger a side effect)
==> the statement “executing FooOp is guaranteed to have no side effect” is false
==> FooOp cannot be marked NoSideEffect

This does not involve claiming that “UB is a side effect”.

As I understand it, UB is just not a valid execution of the program. So I have a hard time following your reasoning when it starts with “UB is possible when executing FooOp”, at least not in the usual sense of UB we manipulate when coming from C/++.

If I make a parallel with C again, my claim is that any call to this function: void foo(int i) { return i+1; } does not involve a side-effect. There are calls to this function that are UB, but they just aren’t valid program execution, I don’t consider them as “having a side effect”.

This is why LLVM has the llvm::isSafeToSpeculativelyExecute check I believe.

Okay, it sounded like we were using a definition of NoSideEffect that implied isSafeToSpeculativelyExecute (at least, @sanjoy_das_google’s examples of inserting erroneous operations seemed to imply that). Is this the behavior we want?

It would seem then that the canBeHoisted function in LoopInvariantCodeMotion is buggy under “NoSideEffect does not necessarily imply isSafeToSpeculativelyExecute”.

@River707 does NoSideEffect imply isSafeToSpeculativelyExecute?

I actually think this problem goes deeper. Consider

func @f(%arg0: tensor<?x1xf32>, %arg1: tensor<1x?xf32>) -> tensor<?x?xf32> {
  %0 = "tcp.matmul"(%arg0, %arg1) : (tensor<?x1xf32>, tensor<1x?xf32>) -> tensor<?x?xf32>
  return %0 : tensor<?x?xf32>
}

It’s trivially easy for a user to pass in dynamic arguments of size 1M which results in a 1M*1M == 1T sized output, which is likely to OOM any end system. So when expanding this tcp.matmul op to “alloc_result_memref; matmul_on_memrefs”, the “alloc_result_memref” could fail. The end-user API is unlikely to allow UB in this case. The implication is that tcp.matmul still needs to encapsulate some sort of defined error behavior. The compilers I’ve seen that handle the simpler fully static shape case don’t have to deal with this because they computes a static memory plan ahead of time, and errors arise deterministically at the initial “allocate all the memory for the model” (possibly spuriously, e.g. spuriously triggering an error if the program contains if dynamically_false(): do_useless_matmul_that_results_in_ridiculously_large_output() is in the program).

Sadly, my background is perhaps too heavily in the C/C++ side of the world which leans heavily on UB for these cases, which most of our frontends when talking about tensors operations don’t have.

Would love to hear from folks with more experience with safe languages (@sanjoy_das_google, @rjmccall ?). E.g. how does Java/Swift/etc. model the notion of reporting “required errors” for even “basic” ops. Surely some amount of reordering is allowed. E.g. does an overflow check prevent hoisting an integer operation past a print? Or potential OOM prevent hoisting any kind of allocation past a print?

Out of bound and null pointer exceptions are “strict” and cannot be reordered. But the JVM spec also defines “virtual machine errors” that “may be thrown at any time during the operation of the Java Virtual Machine”. But it isn’t UB though.

Why is this a problem? Won’t we have to have error semantics for mismatched shapes anyway?

You generally shouldn’t hoist allocations unless they’re “provably” not going to fail, at least within some reasonable model.

The mismatched shape checks can be reified in IR and separated. But having a “tcp.matmul” op that takes tensors and returns a tensor intrinsically hides an allocation of the result (which might fail). I can’t think of any predicate we can reasonably reify and separate from the tcp.matmul op that would guarantee the allocation of its result cannot fail.

I see, that’s a good catch! For now I have an intellectually unsatisfying solution: let’s not provide any hard guarantees around not introducing OOMs; speculating a matmul that causes an OOM that would not have otherwise happened is a QoI issue, not a correctness issue. This is probably fine since we almost never want to (say) launch a non-trivial CUDA kernel that would not have otherwise been launched.

Provide such a guarantee on the maximum memory usage would be really hard: any change in the schedule of these operation could break it by changing the live range of the tensors.

If this becomes a real problem we might have to go TF’s way and not force a total order for all operations and make alloc an async operation that can return at some future time when more memory is available.