[RFC] Loading Bitfields with Smallest Needed Types

We're running into an interesting issue with the Linux kernel, and
wanted advice on how to proceed.

Example of what we're talking about: https://godbolt.org/z/ABGySq

The issue is that when working with a bitfield a load may happen
quickly after a store. For instance:

struct napi_gro_cb {
    void *frag;
    unsigned int frag_len;
    u16 flush;
    u16 flush_id;
    u16 count;
    u16 gro_remcsum_start;
    unsigned long age;
    u16 proto;
    u8 same_flow : 1;
    u8 encap_mark : 1;
    u8 csum_valid : 1;
    u8 csum_cnt : 3;
    u8 free : 2;
    u8 is_ipv6 : 1;
    u8 is_fou : 1;
    u8 is_atomic : 1;
    u8 recursive_counter : 4;
    __wsum csum;
    struct sk_buff *last;
};

void dev_gro_receive(struct sk_buff *skb)
{
...
        same_flow = NAPI_GRO_CB(skb)->same_flow;
...
}

Right before the "same_flow = ... ->same_flow;" statement is executed,
a store is made to the bitfield at the end of a called function:

    NAPI_GRO_CB(skb)->same_flow = 1;

The store is a byte:

    orb $0x1,0x4a(%rbx)

while the read is a word:

    movzwl 0x4a(%r12),%r15d

The problem is that between the store and the load the value hasn't
been retired / placed in the cache. One would expect store-to-load
forwarding to kick in, but on x86 that doesn't happen because x86
requires the store to be of equal or greater size than the load. So
instead the load takes the slow path, causing unacceptable slowdowns.

GCC gets around this by using the smallest load for a bitfield. It
seems to use a byte for everything, at least in our examples. From the
comments, this is intentional, because according to the comments
(which are never wrong) C++0x doesn't allow one to touch bits outside
of the bitfield. (I'm not a language lawyer, but take this to mean
that gcc is trying to minimize which bits it's accessing by using byte
stores and loads whenever possible.)

The question I have is what should we do to fix this issue? Once we
get to LLVM IR, the information saying that we're accessing a bitfield
is gone. We have a few options:

* We can glean this information from how the loaded value is used and
fix this during DAG combine, but it seems a bit brittle.

* We could correct the size of the load during the front-end's code
generation. This benefits from using all of LLVM's passes on the code.

* We could perform the transformation in another place, possible in MIR or MC.

What do people think?

-bw

By default, clang emits all bitfield load/store operations using the width of the entire sequence of bitfield members. If you look at the LLVM IR for your testcase, all the bitfield operations are i16. (For thread safety, the C/C++ standards treat a sequence of bitfield members as a single "field".)

If you look at the assembly, though, an "andb $-2, (%rdi)" slips in. This is specific to the x86 backend: it's narrowing the store to save a couple bytes in the encoding, and a potential decoding stall due to a 2-byte immediate. Maybe we shouldn't do that, or we should guard it with a better heuristic.

-Eli

We also have the opposite problem of the store shrinking. We’ll also try to widen 4 byte aligned i16/i8 extload to i32. I didn’t count out the alignment in this particular struct layout.

When writing my initial email, I forgot another option which Eli
pointed out: don't shrink the store's size. That would be acceptable
for our purposes. If it's something that needs further consideration,
perhaps we could disable it via a flag (not an official "-m..." flag,
but "-mllvm -disable-store-shortening" or whatever)?

-bw

Clang used to generate narrower loads and stores for bit-fields, but a
long time ago it was intentionally changed to generate wider loads
and stores, IIRC by Chandler. There are some cases where I think the
“new” code goes overboard, but in this case I don’t particularly have
an issue with the wider loads and stores. I guess we could make a
best-effort attempt to stick to the storage-unit size when the
bit-fields break evenly on a boundary. But mostly I think the frontend’s
responsibility ends with it generating same-size accesses in both
places, and if inconsistent access sizes trigger poor performance,
the backend should be more careful about intentionally changing access
sizes.

John.

When I spent some time looking at this back in March when Bill mentioned it on IRC. I think I saw a write to one bit in one of the 8-bit pieces and then a read of that bit and a bit from the adjacent byte. So we used a narrow store and then a wider load due to the 2 bits.

When I spent some time looking at this back in March when Bill mentioned it on IRC. I think I saw a write to one bit in one of the 8-bit pieces and then a read of that bit and a bit from the adjacent byte. So we used a narrow store and then a wider load due to the 2 bits.

I think you're referring to same_flow and free in the structure below.
Those both have stores, as does most of the rest of the bitfield (it's
an initialization, which seems like could be done with a few bitwise
operations on the whole bitfield, but I digress). But yeah, in the
case that we have consecutive accesses of bitfields in adjacent bytes,
then a bigger read & store are better.

~Craig

> We're running into an interesting issue with the Linux kernel, and
> wanted advice on how to proceed.
>
> Example of what we're talking about: https://godbolt.org/z/ABGySq
>
> The issue is that when working with a bitfield a load may happen
> quickly after a store. For instance:
>
> struct napi_gro_cb {
> void *frag;
> unsigned int frag_len;
> u16 flush;
> u16 flush_id;
> u16 count;
> u16 gro_remcsum_start;
> unsigned long age;
> u16 proto;
> u8 same_flow : 1;
> u8 encap_mark : 1;
> u8 csum_valid : 1;
> u8 csum_cnt : 3;
> u8 free : 2;
> u8 is_ipv6 : 1;
> u8 is_fou : 1;
> u8 is_atomic : 1;
> u8 recursive_counter : 4;
> __wsum csum;
> struct sk_buff *last;
> };
>
> void dev_gro_receive(struct sk_buff *skb)
> {
> ...
> same_flow = NAPI_GRO_CB(skb)->same_flow;
> ...
> }
>
> Right before the "same_flow = ... ->same_flow;" statement is executed,
> a store is made to the bitfield at the end of a called function:
>
> NAPI_GRO_CB(skb)->same_flow = 1;
>
> The store is a byte:
>
> orb $0x1,0x4a(%rbx)
>
> while the read is a word:
>
> movzwl 0x4a(%r12),%r15d
>
> The problem is that between the store and the load the value hasn't
> been retired / placed in the cache. One would expect store-to-load
> forwarding to kick in, but on x86 that doesn't happen because x86
> requires the store to be of equal or greater size than the load. So
> instead the load takes the slow path, causing unacceptable slowdowns.
>
> GCC gets around this by using the smallest load for a bitfield. It
> seems to use a byte for everything, at least in our examples. From the
> comments, this is intentional, because according to the comments
> (which are never wrong) C++0x doesn't allow one to touch bits outside
> of the bitfield. (I'm not a language lawyer, but take this to mean
> that gcc is trying to minimize which bits it's accessing by using byte
> stores and loads whenever possible.)
>
> The question I have is what should we do to fix this issue? Once we
> get to LLVM IR, the information saying that we're accessing a bitfield
> is gone. We have a few options:
>
> * We can glean this information from how the loaded value is used and
> fix this during DAG combine, but it seems a bit brittle.
>
> * We could correct the size of the load during the front-end's code
> generation. This benefits from using all of LLVM's passes on the code.
>
> * We could perform the transformation in another place, possible in
> MIR or MC.
>
> What do people think?

Clang used to generate narrower loads and stores for bit-fields, but a
long time ago it was intentionally changed to generate wider loads
and stores, IIRC by Chandler. There are some cases where I think the
“new” code goes overboard, but in this case I don’t particularly
have
an issue with the wider loads and stores. I guess we could make a
best-effort attempt to stick to the storage-unit size when the
bit-fields break evenly on a boundary. But mostly I think the
frontend’s
responsibility ends with it generating same-size accesses in both
places, and if inconsistent access sizes trigger poor performance,
the backend should be more careful about intentionally changing access
sizes.

Fair enough.

-bw

FWIW, when I was at Green Hills, I recall the rule being “Always use the declared type of the bitfield to govern the size of the read or write.” (There was a similar rule for the meaning of volatile. I hope I’m not just getting confused between the two. Actually, since of the compilers on Godbolt, only MSVC follows this rule, I’m probably wrong.) That is, if the bitfield is declared int16_t, then use 16-bit loads and stores for it; if it’s declared int32_t, then use 32-bit loads and stores. This gives the programmer a reason to prefer one declared type over another. For example, in

template
struct A {
T w : 5;
T x : 3;
T y : 4;
T z : 4;
};

the only differences between A and A are

  • whether the struct’s alignment is 1 or 2, and
  • whether you use 8-bit or 16-bit accesses to modify its fields.

“The backend should be more careful about intentionally changing access sizes” sounds like absolutely the correct diagnosis, to me.

my $.02,
Arthur

I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are
quite different from the standard Unix rules. On Windows, T x : 3
literally allocates an entire T in the structure, and successive
bit-fields get packed into that T only if their base type is of the
same size (and they haven’t exhausted the original T). So of course
all accesses to that bit-field are basically of the full size of the T;
there’s no overlap to be concerned with. On Unix, bit-fields will typically
get packed together regardless of the base type; the base type does have
some influence, but it’s target-specific and somewhat odd.

I’d prefer if we degraded to a Windows-like access behavior as much
as we can, but it’s not always possible because of that packing.

John.

At least in this test-case, the “bitfield” part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent.

Of course that insight does mean it’s quite easy to create a test-case with the exact same problematic store->load mismatch which doesn’t use bit-fields at all. For example:
short f2(short *bfs) {
*bfs &= ~0x1;
g();
return *bfs;
}

creates the same bad sequence:
movq %rdi, %rbx

andb $-2, (%rdi)
callq g()

movzwl (%rbx), %eax

Similar issue with aligned extload widening. The load and store are both i16 in IR.

struct s {
int x; // force alignment
short y;
};

void g();
short f2(s *a, short b) {
a->y = b;
g();
return a->y - b;
}

produces this

movq %rdi, %rbx
movw %bp, 4(%rdi) // 2 byte store
callq g()
movl 4(%rbx), %eax // 4 byte load
subl %ebp, %eax

There is a related discussion about load widening (byte read → 64bit read) transformation 9 years ago: http://lists.llvm.org/pipermail/llvm-dev/2011-December/046322.html

David

At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent.

I suspect that this is more prevalent with bitfields as they're more
likely to have the load / bitwise op / store operations done on them,
resulting in an access type that can be shortened. But yes, it's not
specific to just bitfields.

I'm more interested in consistency, to be honest. If the loads and
stores for the bitfields (or other such shorten-able objects) were the
same, then we wouldn't run into the store-to-load forwarding issue on
x86 (I don't know about other platforms, but suspect that consistency
wouldn't hurt). I liked Arthur's idea of accessing the object using
the type size the bitfield was defined with (i8, i16, i256). It would
help with improving the heuristic. The downside is that it could lead
to un-optimal code, but that's the situation we have now, so...

-bw

Okay, but what concretely are you suggesting here? Clang IRGen is
emitting accesses with consistent sizes, and LLVM is making them
inconsistent. Are you just asking Clang to emit smaller accesses
in the hope that LLVM won’t mess them up?

John.

I don’t think this has anything to do with bit-fields or Clang’s lowering. This seems to “just” be an optimizer issue (one that happens to show up for bit-field access patterns, but also affects other cases). Much-reduced testcase:

unsigned short n;

void set() { n |= 1; }

For this testcase, -O2 generates a 1-byte ‘or’ instruction, which will often be a pessimization when there are also full-width accesses. I don’t think the frontend can or should be working around this.

We're a bit (heh) tied in what we can do. LLVM's IR doesn't convey
that we're working with a bitfield, but as many pointed out it's not
bitfield-specific (I didn't mean to imply that it was *only* bitfields
affected by this with my initial post, that was just where it showed
up). The front-end generates code that's perfectly valid for the
various accesses, and changing what it generates doesn't seem like it
would be worth the hassle. (E.g. using i1, i3, etc. for the bitfields
(note I'm *not* suggesting we do this, I'm just using as an example).)

So what we're left with is trying to ensure that we don't generate
sub-optimal code via the DAG combiner and other passes. Some ideas off
the top of my head:

1. Avoid generating byte-sized bitwise operations, because as Richard
pointed out they can lead to pessimizations.

2. Come up with a better heuristic to determine if we're going to
generate inconsistent load/store accesses to the same object.

3. Transform the machine instructions before register allocation to
ensure that all accesses to the same address are the same size, or at
least that the stores are of equal or greater size than the loads.

Option (1) may be the simplest and could potentially solve our immediate issue.

Option (2) is difficult because the DAG combiner operates on a single
BB at a time and doesn't have visibility into past DAGs it modified.

Option (3) allows us to retain the current behavior and gives us
visibility into the whole function, not just an individual BB.

I kind of like option 3, but I'm open to other suggestions.

-bw

I thought the IR you showed in March had the store in one function and the load in another function. Option 3 would not handle that.

~Craig

I thought the IR you showed in March had the store in one function and the load in another function. Option 3 would not handle that.

Yeah. That's an entirely different issue though as it'll need IPA. I'm
a bit skeptical about the initial report as I suspect that a store at
the end of a function would have retired or at least made it to a
cache by the time the function returns, the stack's readjusted,
registers restored, etc. So I think some of the details in the initial
reports may have been a bit off.

-bw

At least in this test-case, the "bitfield" part of this seems to be a
distraction. As Eli notes, Clang has lowered the function to LLVM IR
containing consistent i16 operations. Despite that being a different
choice from GCC, it should still be correct and consistent.

I suspect that this is more prevalent with bitfields as they're more
likely to have the load / bitwise op / store operations done on them,
resulting in an access type that can be shortened. But yes, it's not
specific to just bitfields.

I'm more interested in consistency, to be honest. If the loads and
stores for the bitfields (or other such shorten-able objects) were the
same, then we wouldn't run into the store-to-load forwarding issue on
x86 (I don't know about other platforms, but suspect that consistency
wouldn't hurt). I liked Arthur's idea of accessing the object using
the type size the bitfield was defined with (i8, i16, i256). It would
help with improving the heuristic. The downside is that it could lead
to un-optimal code, but that's the situation we have now, so...

Okay, but what concretely are you suggesting here? Clang IRGen is
emitting accesses with consistent sizes, and LLVM is making them
inconsistent. Are you just asking Clang to emit smaller accesses
in the hope that LLVM won’t mess them up?

I don't think this has anything to do with bit-fields or Clang's lowering. This seems to "just" be an optimizer issue (one that happens to show up for bit-field access patterns, but also affects other cases). Much-reduced testcase:

unsigned short n;
void set() { n |= 1; }

To be clear, I agree with you that this is not a frontend problem.
I asked because the question seemed to lead towards changing the width
that Clang uses for accesses.

We're a bit (heh) tied in what we can do. LLVM's IR doesn't convey
that we're working with a bitfield, but as many pointed out it's not
bitfield-specific (I didn't mean to imply that it was *only* bitfields
affected by this with my initial post, that was just where it showed
up). The front-end generates code that's perfectly valid for the
various accesses, and changing what it generates doesn't seem like it
would be worth the hassle. (E.g. using i1, i3, etc. for the bitfields
(note I'm *not* suggesting we do this, I'm just using as an example).)

So what we're left with is trying to ensure that we don't generate
sub-optimal code via the DAG combiner and other passes. Some ideas off
the top of my head:

1. Avoid generating byte-sized bitwise operations, because as Richard
pointed out they can lead to pessimizations.

2. Come up with a better heuristic to determine if we're going to
generate inconsistent load/store accesses to the same object.

3. Transform the machine instructions before register allocation to
ensure that all accesses to the same address are the same size, or at
least that the stores are of equal or greater size than the loads.

Option (1) may be the simplest and could potentially solve our immediate issue.

Option (2) is difficult because the DAG combiner operates on a single
BB at a time and doesn't have visibility into past DAGs it modified.

Option (3) allows us to retain the current behavior and gives us
visibility into the whole function, not just an individual BB.

I kind of like option 3, but I'm open to other suggestions.

What’s the concrete suggestion for (3), that we analyze the uses
of a single address and try to re-widen the access to make it use
a consistent width? Widening accesses is difficult in general,
unless the information is still present that it was originally
wider before DAG combine.

At any rate, seems like a backend problem. :slight_smile:

John.

>>>
>>>>>
>>>>> At least in this test-case, the "bitfield" part of this seems to
>>>>> be a
>>>>> distraction. As Eli notes, Clang has lowered the function to LLVM
>>>>> IR
>>>>> containing consistent i16 operations. Despite that being a
>>>>> different
>>>>> choice from GCC, it should still be correct and consistent.
>>>>>
>>>> I suspect that this is more prevalent with bitfields as they're
>>>> more
>>>> likely to have the load / bitwise op / store operations done on
>>>> them,
>>>> resulting in an access type that can be shortened. But yes, it's
>>>> not
>>>> specific to just bitfields.
>>>>
>>>> I'm more interested in consistency, to be honest. If the loads and
>>>> stores for the bitfields (or other such shorten-able objects) were
>>>> the
>>>> same, then we wouldn't run into the store-to-load forwarding issue
>>>> on
>>>> x86 (I don't know about other platforms, but suspect that
>>>> consistency
>>>> wouldn't hurt). I liked Arthur's idea of accessing the object using
>>>> the type size the bitfield was defined with (i8, i16, i256). It
>>>> would
>>>> help with improving the heuristic. The downside is that it could
>>>> lead
>>>> to un-optimal code, but that's the situation we have now, so...
>>>
>>> Okay, but what concretely are you suggesting here? Clang IRGen is
>>> emitting accesses with consistent sizes, and LLVM is making them
>>> inconsistent. Are you just asking Clang to emit smaller accesses
>>> in the hope that LLVM won’t mess them up?
>>
>>
>> I don't think this has anything to do with bit-fields or Clang's
>> lowering. This seems to "just" be an optimizer issue (one that
>> happens to show up for bit-field access patterns, but also affects
>> other cases). Much-reduced testcase:
>>
>> unsigned short n;
>> void set() { n |= 1; }

To be clear, I agree with you that this is not a frontend problem.
I asked because the question seemed to lead towards changing the width
that Clang uses for accesses.

> We're a bit (heh) tied in what we can do. LLVM's IR doesn't convey
> that we're working with a bitfield, but as many pointed out it's not
> bitfield-specific (I didn't mean to imply that it was *only* bitfields
> affected by this with my initial post, that was just where it showed
> up). The front-end generates code that's perfectly valid for the
> various accesses, and changing what it generates doesn't seem like it
> would be worth the hassle. (E.g. using i1, i3, etc. for the bitfields
> (note I'm *not* suggesting we do this, I'm just using as an example).)
>
> So what we're left with is trying to ensure that we don't generate
> sub-optimal code via the DAG combiner and other passes. Some ideas off
> the top of my head:
>
> 1. Avoid generating byte-sized bitwise operations, because as Richard
> pointed out they can lead to pessimizations.
>
> 2. Come up with a better heuristic to determine if we're going to
> generate inconsistent load/store accesses to the same object.
>
> 3. Transform the machine instructions before register allocation to
> ensure that all accesses to the same address are the same size, or at
> least that the stores are of equal or greater size than the loads.
>
> Option (1) may be the simplest and could potentially solve our
> immediate issue.
>
> Option (2) is difficult because the DAG combiner operates on a single
> BB at a time and doesn't have visibility into past DAGs it modified.
>
> Option (3) allows us to retain the current behavior and gives us
> visibility into the whole function, not just an individual BB.
>
> I kind of like option 3, but I'm open to other suggestions.

What’s the concrete suggestion for (3), that we analyze the uses
of a single address and try to re-widen the access to make it use
a consistent width? Widening accesses is difficult in general,
unless the information is still present that it was originally
wider before DAG combine.

Alternatively, we could move the shortening logic out of DAG combine
and into a separate pass. I'll come up with a more concrete proposal.
In the meantime, some flags were added which may get us past the
initial issue. (Crossing fingers that it doesn't pessimise other
code.)

At any rate, seems like a backend problem. :slight_smile:

Yes. :slight_smile:

-bw