Why do sub-byte loads on AArch64 not require masking?

the AArch64 backend lowers IR like this:

define i32 @f(ptr %0) {
  %2 = load i7, ptr %0, align 1
  %.not = icmp eq i7 %2, 0
  %3 = zext i1 %.not to i32
  ret i32 %3
}

to

f:
	ldrb	w8, [x0]
	cmp	w8, #0
	cset	w0, eq
	ret

this translation, where a 7-bit load is widened to an 8-bit load, without applying a mask, can only be correct if that 8th bit is guaranteed to be zero. I’m trying to understand where this guarantee comes from and what other guarantees go along with it; basically we need to fully understand the contract that exists at this level of the compiler before we can go about formalizing it properly. thanks!

We’ve had inconsistent handling for this in the past. Some places assume masking is needed, others don’t. In practice the mirrored store is lowered with 0 high bits

Thanks Matt!

I wonder if you (and others who work on this backend) might be willing to help me write a brief English-language description of the guarantees and obligations here? For example, so far I think we might be able to say something like:

When an LLVM-side value whose size in bits is not a multiple of 8 is stored to RAM, any extra bits up to the next multiple of 8, on the ARM side, must contain zeroes.

Thanks,
John

Do we want to define this 0 behavior, and should it really be limited to ARM? I’d kind of expect universal consistency

I think we should also try to nail down the behavior for non-byte element vectors. That’s always been a mess

1 Like

for ARM, if we indeed zero the extra storage in practice, and rely on same, then I think we absolutely should document it, and fix any exceptions that we find.

regarding consistency across targets, I don’t know – might there be ones that don’t want to mandate this particular behavior?

it seems to me that this sort of guarantee falls into the same category as something we discussed earlier, which is whether a signext i1 gets zero-extended first or sign-extended first. in other words, this is part of the LLVM-specific ABI for a particular target.

John

One piece of target-independent code that I think rely on that a non-byte-sized store is filling the padding with zero is SelectionDAGLegalize::LegalizeLoadOps. At least this code that follows the comment

    // The extra bits are guaranteed to be zero, since we stored them that
    // way.  A zext load from NVT thus automatically gives zext from SrcVT.

IIRC we had some problems related to that downstream when we did not ensure to fill the padding with zeroes. What happened was that we had a “bug” in isBytewiseValue that didn’t take padding into consideration. So it resulted in a memset of a struct using a non-zero byte pattern (such as 0xAA). But then when reading the sub-byte member from the struct we ended up reading non-zero bits (and LegalizeLoadOps assumed that those bits would be zero in memory).

At the time we found this I also had a hard time understanding if this was a documented rule.

There is an old discussion about non-byte-sized load/stores with some answers here:
Semantics for non-byte-sized stores? (or whenever "store size in bits" is different than "size in bits") - #2 by efriedma-quic

One answer from that thread was:

But in practice, SelectionDAG legalization always zero-extends stores, and loads assume the value is zero-extended.

It seems to me that LLVM has taken the path of creating a reliable compiler platform;
doing things like ensuring that a typed register container cannot contain a value that
the same typed memory container can contain.

Thus, it seems to me that if someone creates a struct such as::

struct { uint64_t a:3,
b:5,
c:9,
d:17,
e:28; } henry;

And in the code we find::

 d = a*b+c;

That one should expect something like::

      LDD      Rcontiner,[henry]
      EXT      Ra,Rcontainer,<3:0>
      EXT      Rb,Rcontainer,<5:3>
      EXT      Rc,Rcontainer,<9:8>
      MUL      Rtemp,Ra,Rb
      ADD      Rd,Rtemp,Rc
      INS       Rcontainer,Rcontianer,Rd,<17:17>
      STD      Rcontainer,[henry]

Where EXT and INS may expand as the ISA in question supports bit-fields.

Furthermore: If the value in Rd is used, that value may have to be stripped of significance
beyond its 17 bits–although this example was constructed such that Rd contains no sig-
nificance beyond bit 16:: but I’m pretty sure value tracking is not up to the task of bit-sized
values

Thus, it seems to me that since LLVM has chosen to preserve container size value space
on a per container basis, that bit fields should be no different, in principle.

I take no position on bit-field sizes outside of structures, you may expand these to any
suitable sized memory-referenceable container that is suitable.

Aha, thanks-- I hadn’t seen that! It seems to be saying the same things we’re saying in this thread.

So I think I have what I needed here: a particular behavior that I can formalize and check in the translation validation work for the AArch64 backend that my group is doing.

But also it seems very reasonable that (at least on ARM, if not other targets) we should document the behavior that is (I think) more clearly described in that previous thread than I managed to describe here.

John

Hi Mitch, I would prefer to leave bitfields out of this discussion, they are a source-level construct and aren’t (last time I looked, at least) translated by clang into the kinds of non-multiple-of-8-bit integer types at the LLVM level that we’re discussing the semantics of.
John

The “store” section of langref currently says this:

When writing a value of a type like i20 with a size that is not an integral number of bytes, it is unspecified what happens to the extra bits that do not belong to the type, but they will typically be overwritten.

Would people go along with this changed version?

When writing a value of a type like i20 with a size that is not an integral number of bytes, the backend must make a consistent, documented choice about what to do with the extra bits that do not belong to the type. One possibility is to fill these bits with zeroes; another is to leave the contents of these bits unspecified but then to mask them off when the value is loaded.

Or something stronger? I don’t work on backends so don’t want to try to tell people what to do.

Then, on the load side the current text is:

When loading a value of a type like i20 with a size that is not an integral number of bytes, the result is undefined if the value was not originally written using a store of the same type.

This reads a bit vague to me, does it mean immediate UB or does it mean a poison value is returned? It sounds kind of like the latter but we should be explicit about points like this. Of course, if those extra bits were going to be reliably zeroed, there may not be any reason to leave this as UB.

John

Based on the choices LLMV made with {bytes, halfs, and words}:: and language interoperability; I would suggest::

      When storing a value of a type {not an integral number of 2^{3,4,5,6} bits and/or not on an integral 2^{3,4,5,6} boundary}, the backend must leave the contents of those other bits unmodified.

      When loading a value of a type {not an integral number of 2^{3,4,5,6} bits} and/or not on an integral 2^{3,4,5,6} boundary}, the result is defined as the value contained only within the specified bits.

When written correctly, these rules also cover sizes that are integral 2^{3,4,5,6} bits.

While these do not fully cover the range requirements of ADA they at least follow the spirit of ADA type model.

Consider _BitInt from C2x, as that is translated to a non-multiple-of-8-bit integer types. e.g., Compiler Explorer