About CodeGen quality

Hi All,

Is there known issue that LLVM is bad at codegen for some language structure, say C bitfield?
Our custom backend generates inefficient code for bitfield access, so I am wondering where
should I look into first.

Thanks.

Regards,
chenwj

Would probably help if you explained which backend you are working on (assuming it’s a publicly available one). An example, with source that can be compiled by “anyone”, along with the generated “bad code” and what you expect to see as “good code” would also help a lot.

From the things I’ve seen, it’s not noticeably worse (or better) than other compilers. But it’s not an area that I’ve spent a LOT of time on, and the combination of generic LLVM operations and the target implementation will determine the outcome - there are lots of clever tricks one can do at the machine-code level, that LLVM can’t “know” in generic ways, since it’s dependent on specific instructions. Most of my experience comes from x86 and ARM, both of which are fairly well established architectures with a good amount of people supporting the code-gen part. If you are using a different target, there may be missing target optimisations that the compiler could do.

I probably can’t really help, just trying to help you make the question as clear as possible, so that those who may be able to help have enough information to work on.

Hi Mats,

It’s private backend. I will try describing what I am dealing with.

struct S {
unsigned int a : 8;
unsigned int b : 8;
unsigned int c : 8;
unsigned int d : 8;

unsigned int e;
}

We want to read S->b for example. The size of struct S is 64 bits, and seems LLVM treats it as i64.
Below is the IR corresponding to S->b, IIRC.

%0 = load i64, *i64 ptr, align 4;
%1 = %0 lshr 8;
%2 = %1 and 255;

Our target doesn’t support load i64, so we have following code in XXXISelLowering.cpp

setOperationAction(ISD::LOAD, MVT::i64, Custom);

Transform load i64 to load v2i32 during type legalization. During op legalization, load v2i32
is found unaligned (4 v.s. 8), so stack load/store instructions are generated. This is one problem.

Besides of that, our target has bitset/bitextract instructions, we want to use them on bitfield
access, too. But don’t know how to do that.

Thanks.

Regards,
chenwj

I understand the problem. Can’t offer any useful help - most likely, you need to add some code to help the instruction selection or some such… but it’s not an area that I’m familiar with…

I may be out to lunch here but this sounds like something that SROA converts into an i64 load. I wonder if disabling it produces IR that is easier for your target to handle.
Of course, this isn’t to say that simply disabling SROA is a viable solution, but it may give you some ideas as to where to go in terms of looking for a solution.

You also may be able to combine such patterns in the SDAG (before legalization) into loads that your target can handle.

This is all kind of speculative but hopefully it sheds some light on what might be going on.

This looks fine. If misaligned load v2i32 isn’t legal, don’t generate it. If it is legal, you might need to mess with your implementation of allowsMisalignedMemoryAccesses. This is generally implemented by pattern-matching the shift and mask operations. ARM has instructions like this if you’re looking for inspiration; look for UBFX, SBFX and BFI. -Eli

Our target doesn't support load i64, so we have following code
in XXXISelLowering.cpp

    setOperationAction(ISD::LOAD, MVT::i64, Custom);

Transform load i64 to load v2i32 during type legalization.

If misaligned load v2i32 isn't legal, don't generate it. If it is legal,
you might need to mess with your implementation of
allowsMisalignedMemoryAccesses.

​Will check that. ​Just a little more explanation about the misaligned
part. We declare i64 is 8 align in the DataLayout, and in "%0 = load i64,
*i64 ptr, align 4" the alignment is 4. In the op legalization stage, it
will go through

    SelectionDAGLegalize::LegalizeLoadOps ->
TargetLowering::expandUnalignedLoad

We don't expect load i64 would be 4 align, so how do I know I will generate
misaligned load v2i32 beforehand? Another question is usually what we do to
handle load i64 if that is not natively supported? Is it correct
transforming load i64 to load v2i32? An existing backend example would be
great.

Besides of that, our target has bitset/bitextract instructions, we want to

use them on bitfield
access, too. But don't know how to do that.

This is generally implemented by pattern-matching the shift and mask
operations. ARM has instructions like this if you're looking for
inspiration; look for UBFX, SBFX and BFI.

​Thanks. Having example​ is good. :slight_smile:

Regards,
chenwj

You can get the alignment by casting the SDNode to LoadSDNode, then calling getAlignment(). I think all in-tree backends which don’t have 64-bit integer registers use the default expansion for an i64 load, which splits the load into two i32 loads. -Eli

Forgot to reply to all

Hi Eli

    struct S {
      unsigned int a : 8;
      unsigned int b : 8;
      unsigned int c : 8;
      unsigned int d : 8;

      unsigned int e;
    }

We want to read S->b for example. The size of struct S is 64 bits, and
seems LLVM treats it as i64.
Below is the IR corresponding to S->b, IIRC.

    %0 = load i64, *i64 ptr, align 4;
    %1 = %0 lshr 8;
    %2 = %1 and 255;

This looks fine.

Why can't we expect InstCombine to simplify this to an 8 bit load, assuming
each of %0 and %1 has only one use ?

Thanks
Ehsan

Here is the complete IR I am dealing with, if that helps discussion.

We don't aggressively narrow loads and stores in IR because it tends to block other optimizations. See https://reviews.llvm.org/D30416.

-Eli

I guess it tends not to block cross block optimization opportunity, or it just happen
https://reviews.llvm.org/D30416 is one?

Regards,
chenwj

For this specific case, modifying source code (or the frontend) to use a struct instead of
bitfield seems to be an easy way since all sizes of bitfields are 8 bits. But you cannot?
For general cases, you may want to enhance backend to emit bitextract/bitset.

​The original test case was modified as it is, the bitfield are not always
8 bits.
And sure, emitting bitextract/bitset will improve the code quality.​

​Regards,
chenwj

I guess my previous reply was a bit too terse. We don’t want to narrow loads early in the optimization pipeline because it tends to hide relationships between bitfields. In particular, overlapping loads/stores are more difficult to optimize. We want EarlyCSE/GVN/DSE/etc. to see the full-width loads and stores to make them more effective. Currently, we do narrowing in DAGCombine. D30416 is a proposal to move that slightly earlier, to the late stages of the IR optimization pipeline, so we can avoid the limitations of SelectionDAG. If you read the whole discussion on that patch, it goes into more detail about this. -Eli

I guess my previous reply was a bit too terse.

We don't want to narrow loads early in the optimization pipeline because
it tends to hide relationships between bitfields. In particular,
overlapping loads/stores are more difficult to optimize. We want
EarlyCSE/GVN/DSE/etc. to see the full-width loads and stores to make them
more effective.

Currently, we do narrowing in DAGCombine. D30416 is a proposal to move
that slightly earlier, to the late stages of the IR optimization pipeline,
so we can avoid the limitations of SelectionDAG.

If you read the whole discussion on that patch, it goes into more detail
about this.

​Thanks. I was only reading the summary of the patch. :slight_smile:

Regards,
chenwj​