[RFC] Define precise arith semantics

Hi folks,

Summary

We would like to start a discussion on the semantics of the Arithmetic dialect. Currently, there are many ops that we expect to be undefined for some inputs, yet this is not covered by the dialect’s documentation. This makes it difficult to lower to arith without having to rely on the subsequent lowering to the next dialect (e.g., LLVM, SPIR-V), or reason about the correctness of transforms within the arith dialect. We propose to gradually start filling in the gaps in the semantics.

Background

Let’s consider two examples that illustrate the difficulties we run into:

  1. The arith.muli op does not specify the overflow semantics. It is unclear if one should expect wrapping semantics (result modulo 2^N), or an undefined result. When emitting arith.muli, one has to either insert some runtime checks to avoid triggering a potential UB, or check the lowering paths to LLVM/SPIR-V/etc. and confirm that arith.muli will be converted to an op with defined overflow behavior.

  2. The arith.shrui op does not specify what happens with shift amount >= the type bitwidth. LLVM defines the result of such shifts as poison. When handling shifts in the wide integer emulation pass, the LLVM spec allows us to emit selects to stop poison propagation. If we had a (hypothetical) lowering from arith.shrui to C’s shift operator >>, these selects wouldn’t have been sufficient as this might trigger an immediate UB.

In the current state of things, all of the following: lowering to arith, lowering from arith, transforming within arith, require confirming the exact semantics with *all* of the lowering targets: LLVM, SPIR-V, and potentially more. We argue it would be better for arith to define the exact semantics, so that there’s only a single place to check in these scenarios.

Solution Space

We propose to make arith more specific wrt to potential undefined behavior of each op. We could execute that in a number of ways.

Source IR → arith → Destination IR

One high level consideration is how much ‘defined’ we want the ops to be:

  • If we make an op undefined, it becomes difficult to emit it. If the source IR is more defined, it may require runtime checks at the source.
  • If we make an op overdefined, it becomes difficult to lower it. If the target IR is less defined, it may require runtime checks at the destination.

Because the two main existing destination IRs, LLVM and SPIR-V, have similar semantics, it seems reasonable to use the LLVM semantics as the starting point.

Another thing to consider is how to implement the changes to the semantics:

  1. Tie arith semantics to the LLVM Language Reference Manual. This probably requires the least amount of changes, but does not allow for independent evolution of both IRs. For example, it seems unreasonable to require LLVM contributors take the impact on arith and the rest of MLIR into consideration every LLVM IR change they propose. Also, arith is not simply a subset of LLVM: it supports more types, e.g., n-D vectors, tensors, which may have more semantic options.

  2. ‘Copy’ the current LLVM semantics in one sweep, but allow for future divergence if it makes sense for particular ops.

  3. Consider the semantics on the op-by-op basis, default to the LLVM semantics unless there is a reason not to. This follows the spirit of incremental development and is expected to make code reviews easier, and we can also audit existing lowerings to make sure they are proper wrt the defined semantics step by step.

Proposal

We propose to move forward with the implementation strategy 3. This allows us to not disrupt the status quo by following what seems to be the existing undocumented expectation, and fill in the gaps through a series of incremental changes developed and reviewed independently.

-Jakub (@kuhar), with input from @jpienaar, @ftynse, and @antiagainst

3 Likes

I found prior art:

Thanks for the links, @tschuett. One key difference between these and this RFC is that I do not propose any new ops / attributes / behaviors, and want to merely document the behavior of the existing arith ops.

There are/will be customers that are interested in fast, strict, … math. LLVM models fast math with flags. There is also support for strict math intrinsics.

This approach makes sense to me. I think this also argues that we import the notion of poison and undef from LLVM as well.

A related but different issue is that the arith semantics for values of index type are completely broken for 32-bit targets because we do constant folding etc at 64-bit precision. If there is an interest in fixing this, we had a replacement for this part of arith that could be useful to mlir main. (@Mogball)

-Chris

I’d be interested in seeing the approach/fix. There are a couple of broken things here and it has not been clear (to me) how to fix them.

If we decide to follow the LLVM semantics, then essentially yes, because we would need some notion of deferred UB. I wanted to defer this discussion until we agree on the overall strategy and hand-wave things as simply ‘undefined’ for the time being, but I now see it may be difficult to decouple this.

For the context, this is what the SPIR-V spec says about OpUndef:

Each consumption of Result <id> yields an arbitrary, possibly different bit pattern or abstract value resulting in possibly different concrete, abstract, or opaque values.

Which I think matches the state of LLVM before poison/freeze.

That’s a very execution focused definition, which doesn’t seem to define the /semantics/ of the operation. The later defines which optimizations and assumptions can be applied. @nlopes is an expert on this, Nuno, what is your perspective?

We solved this in our index dialect by giving all operations target-width specific semantics. This means that you can only fold operation when you know it will produce the same value on both 32- and 64-bit targets. For constant folding, we do all math in 64-bits, but only fold if trunc(f(a, b)) = f(trunc(a), trunc(b)) is true (where trunc is to 32-bits). Thus you can fold “4-12” and even “2 << 31”, but “(2 << 31) / 4” will leave the divide unfolded, since the result is different on 32-bit and 64-bit systems. Jeff can share the code, it isn’t complicated or tricky.

The arith dialect has a number of subtle problems like this unfortunately, and is very expansive (given it accepts tensors with unclear semantics). I think it is great to improve its specification, and I think the cleaving parts of it off - e.g. index operations are materially different than the rest of the types supported.

First I just want to say I’m all for improving the specification of arithmetic semantics.

As you mentioned there are two (families of) impedance-mismatch problems: src-to-arith, and arith-to-dst; plus the subsequent (family of) coherence problem: src-to-dst. The impedance-mismatch problems are unavoidable, but it’s the coherence problem that’s the crucial thing to get right— since these non-local interactions are liable to be the main source of bugs.

For the cases where we can’t come up with a suitable one-size-fits-all semantics, one option to consider is adding attributes to the ops for indicating when some particular semantic property is allowed to be relaxed in lowering. (N.B., I only mean allowing the semantics to become less specified, not allowing them to “change” into something different.)

For example, let’s say we decide that arith.muli should have some particular overflow semantics S. If a particular source dialect doesn’t like that semantics, they might emit guards to check for overflow beforehand; but then it doesn’t matter what S actually is, since the guards ensure that overflow will never happen. Then we might lower arith to some other dialect where implementing S takes a lot of extra work. So if the source dialect could give the arith.muli an attribute saying “I don’t care what happens on overflow (because I’ve ensured it’ll never happen)”, then the lowering out of arith would be free to use some other semantics that’s easier to implement.

To be clear, I’m not suggesting that’s an optimal approach, just that it might be worth considering for those cases where we need some extra wiggle room to avoid coherence problems. The main benefit of this sort of thing is that it saves lower-level analyses from needing to rediscover that the implementation of the S semantics is dead code for that particular op instance.

Ah, I appeared to have slacked too long on writing that RFC. If there’s marked interest in a dialect like that, I’ll get it to ASAP. But in summary, our index dialect is most of the arith ops sliced out into a dialect that only deals with index types and which has no dependencies. With a separate dialect and ops with clear semantics, we can be careful about handling index types.

1 Like

That makes sense and we’ve done something similar on an implementation. What if we were to phase in such an index dialect and phase out support for arbitrary arithmetic on index in the arith dialect?

With respect to the tensor baggage specifically, I think we should consider that something to be deprecated and removed. Because the semantics are so unclear and incomplete in those cases, I’m not aware of any practical uses of it (i.e. I’ve seen folks try and then realize that the actual very limited case modeled is of unusably limited utility).

We raced with me asking if we could do this :slight_smile:

One other thing that I’ve found tends to compose well is an op like index.sizeof. In some situations where the size of the index type can vary within partitioned parts of the program, such an op can be used to materialize things like dynamic allocation/copies/etc of arrays of index. It would be fixated to the actual target index size late in a lowering pipeline, but you also have the option to fixate it early if you know.

+1 that’s a really solid idea

I’d love to see both things!

Since we’re well on a way to a wishlist, anything else? This stuff only gets harder to fix… Might as well plan to clean it up soon and get it to a long term heathy place.

Our experience with hand-waving hasn’t been great. Different people will interpret what’s written in a different way, and then you’ve miscompilations.
I think it’s better to have some semantics that are clearly specified but not perfect than try to have the optimal semantics but not fully specified. You can always evolve the semantics (though it’s not trivial).

Following LLVM’s semantics has several benefits:

  • Lowering to LLVM is trivial.
  • LLVM supports several architectures, and the design of the current design already reflects the tradeoffs in supporting all such ISAs. In general, we want the lowest common denominator to avoid forcing certain architectures to have to use runtime checks to give more precise semantics that may not be needed.
  • It exists.

My suggestion is to introduce poison and freeze in MLIR as standard concepts. And do not introduce undef (see below).

Yep, that matches the semantics of LLVM’s undef.
However, in the past few months we’ve been replacing all uses of undef with poison whenever appropriate. This as part of an effort to remove undef completely from LLVM, as just the existence of undef (even if unused) breaks certain optimizations.

Nevertheless, you can always lower SPIR-V’s OpUndef into freeze poison. That’s correct, and equivalent if freeze ends up with a single use only.
(this is not exploited by InstCombine yet, but it’s on the todo list)

2 Likes

Would poison and freeze go in arith or builtin? I can’t see the proposal for the index dialect yet, so can’t determine if it would be used there (seems like it would). Trying to parse how “standard” this is.

In current pipelines, arith lowers to SPIR-V, which seems the opposite of what is described here?

I don’t know much about MLIR, but probably builtin. You need poison for a lot of things, including as a placeholder for phi nodes, vector creation instructions (insertelement, shufflevector, etc in LLVM).
freeze is needed for optimizations that hoist stuff outside of loops and/or branches and for bit-fields. So it may not be useful for everyone (or maybe no one right now).

Ah, that’s ok as well, as poison can be lowered into undef.

1 Like

I would somewhat strongly pushback against this going into builtin. It isn’t clear that these are fundamental constructs of building IR, vs useful for modeling UB of specific domains.