MemRef type and data layout

It would be undefined behavior just like it would be with scalar out of bound memory accesses. I’m not sure why we should specially worry about defining the semantics of these. I don’t see why we would try to define it as/guarantee getting seven valid values and one garbage one in the vector. If the op is asking for a vector of 8 from a memref of size 7, that’s undefined behavior.

We could go with having the 1-d vector coming from only one dimension (the fastest varying one in the logical space). Hence, this would be undefined behavior as well.

You are just getting a vector of four values from the start in the first/only dimension of the logical space. Just like for regular loads on dynamically shaped memrefs, you won’t know if it’s undefined behavior unless you know the sizes. This would just work fine if the ? is an SSA value that is at least 4. It doesn’t matter what memref layout you use (so long as the layout map is one-to-one). In fact, a tile size 32 layout map is fine here. However, had it been d0 floordiv 2, d0 mod 2, lowering such load's or normalizing such memrefs would not be possible without memref shape/reinterpret cast’s and loads from those cast memrefs.

If the semantics are not defined at all, they may be interpreted differently than “undefined behavior, don’t do this”, and somebody will spend time debugging their understanding against what was implied.

Tiling + vectorization without the need for extra control flow for handling partial tiles?

If we go for logical fastest-varying, this is something that should be very clearly specified in the description. Otherwise, somebody may assume the load is always an equivalent of a machine vector load, but it in fact can be a bunch of scalar loads that pack a vector, e.g. in case of transposed or strided layouts.

I was thinking above to rely on the fastest varying dimension to ensure “vector load” to be emitted, but the logical fastest varying does not necessarily have a stride 1, does it?

Exactly. That would allow modeling vector loads/stores with arbitrary alignment, regardless of the alignment constraints of the target.

It sounds like a good starting point to me.

I think out-of-bounds behavior should be defined as target-specific (is that compatible with UB definition?) and it should be allowed representation-wise. I actually have use cases for this. As a reference, some advanced vectorization techniques may require speculative execution that may result in partial out-of-bounds vector loads. This might be “legal” (safe), depending on the target. For x86, it’s safe to read out-of-bounds as long as you don’t cross page boundaries and the compiler may take advantage of that.

I thought the same thing. How could we even generate vector loads/stores today through the Vector dialect if that wasn’t the case? I think we should limit this to only contiguous vector loads/stores, at least for now. We may want to represent strided or gather/scatter accesses differently.

I really think it’s not great to load the semantics of load to say “Provide me eight elements if it’s all within the boundary, else provide me up until the boundary”. You’d perhaps want to use another op to encode such conditional semantics. If the op is asking for a vector of that size (encoded in the type as vector<8xf32>), I think it should get that or be UB. As you know, such a conditional isn’t static when the shape is dynamic.

It does not. Neither does the physical fastest varying dimension, which can also have a non-unit stride. Worse, the stride may not even be known at compile time, e.g. memref<?xf32, offset: 0, stride: ?>.

What do we do if we cannot statically prove that the load is contiguous?

In C++ terminology, I would call this implementation-defined behavior. UB means the compiler can assume this never happens in a valid program and base optimization on this assumption, e.g. loading a vector of 8 from a dynamically-shaped memref with index known to be X mod 8 lets the compiler assume the size of memref is also X mod 8 because the inverse results in UB, it can then use this knowledge in, e.g., address computation. What you describe reads like the compiler should assume it is valid, and may potentially get some information about how this is implemented. Practically, some lowerings could refuse to handle such loads, for example, but it becomes awkward when there isn’t enough static information to decide.

The point of my message was to say that we need to be very careful when defining the semantics of a vector load op. I am not arguing one way or another.

1 Like

I’m missing something… the fastest varying dimension of a memref with a default stride would be contiguous in memory, isn’t that the case? E.g., memref<200xf32> and even memref<?xf32>. Otherwise, a specific stride would be provided, right? We could limit the proposal to only unit-stride cases for now.

Generalizing this approach to non-unit stride cases could also be something to consider before moving forward, even if we leave these cases out for now. One option could be allowing the following:

%interleaved = std.load %memref0[%c0] : memref<100xf32, offset: 0, stride: 2>, vector<8xf32>

%gather = std.load %memref1[%c0] : memref<100xf32, offset: 0, stride: ?>, vector<8xf32>

In this context, the vector type is modeling that we are packetizing 8 scalar elements into an SSA vector value but how those elements are loaded would be up to the backend. For %interleave (stride=2) the backend could, for instance, generate two contiguous vector loads from &memref1[c0] and then shuffle the scalar elements to produce the expected result. For %gather (unknown stride), the backend could generate a generic gather instruction or populate a vector register out of scalar loads. Specifically to LLVM, both cases could be lowered to a gather intrinsic (LLVM Language Reference Manual — LLVM 13 documentation) and it would be up to the target how to implement it.

That would be a relatively high-level representation, which I’m not sure it’s what we want at this level of abstraction. I think I would lean towards having something more explicit for each particular case to facilitate code generation later: probably one op to model unit-stride loads and at least one to model gathers (we actually have a vector.gather op already). Maybe it’s worth introducing an std.vector_load for the specific unit-stride cases and not overloading std.load. Since we are splitting the Standard dialect and given the vector.gather precedent, maybe we should add this new op(s) to the Vector dialect? Vector transfer reads/writes (higher level) could eventually be lowered to the new lower level counterparts through progressive lowering. WDYT?

We would generate a gather for those cases. That would be legal even if the load ends up being unit-stride. It’s up to the target how to implement it.

Exactly, that was my concern. Even simpler, I would be worried if UB would lead to having target-independent canonicalization patterns that would shrink the vector shape to only the in-bounds sizes.

This sounds good to me. We make similar target-specific decisions using TTI in LLVM. Note that the compiler might be the one generating some of these out-of-bounds accesses, for example, through vectorization. For these cases, the compiler could check if the target “supports” out-of-bounds accesses and generate code accordingly. We follow a similar approach for vector masking support in LLVM’s loop vectorizer. I don’t see out-of-bounds accesses being any different.

It seems valuable to have an op that requires unit strides or otherwise guarantees an actual vector load instruction is ultimately generated, instead of a gather. Like you pointed out, we already have a gather. Where this op should live is a different question. I don’t think I have strong arguments for any specific choice. If we think we will need to pattern-match for vector loads specifically, having a separate op may turn out beneficial. Putting it in the vector dialect seems logical, but @nicolasvasilache is right that there is currently a lot of awkwardness involved in further conversions from the vector dialect.

In MLIR infra, this would map to specifically saying that out-of-bounds are permitted by the op (so that nobody adds an in-bounds verifier / assertion generator), but may not be supported by all targets. Then a specific conversion can mark such ops illegal.