How to change CLang struct alignment behaviour?

My target implementation has 2 byte (16 bit ints)
For my target implementation I would want that Clang would always use 2 byte aligned padding for Structs, which would match the size of an int.

The current situation is that for this struct:

struct AA
{
char n;
char m;
char j;
};

I get it aligned by 1:

%a = alloca %struct.AA, align 1

I would want it to implicitly use 4 bytes instead of 3, (or it be aligned by 2 bytes instead of 1).

If I replace the above struct by this

struct AA
{
int n;
char m;
};

I correctly get it aligned by 2:

%a = alloca %struct.AA, align 2

(note that my architecture is 16 bit, so ints are 16 bits)

In summary, I noticed that Clang will compute struct alignments as the max of the alignment of the individual members.

How do I change that behaviour to get structs always (at least) 2 byte aligned ?

John Lluch

Hi Joan,

How do I change that behaviour to get structs always (at least) 2 byte aligned ?

I don't think there's a feature you can toggle for this (except,
maybe, making the alignment of every basic type 2 bytes; but that
would obviously affect arrays and even normal variables too). So you
probably have to modify lib/AST/RecordLayoutBuilder.cpp directly.

Might be worth asking this on the cfe-dev mailing list though. That's
where most of the Clang experts live.

Cheers.

Tim.

Hi Tim,

Thanks for your reply. That’s what I was afraid of. I will try on the cfe-list as you suggest though.

The reason I want structs to be aligned/padded to 2 bytes is because my architecture only has 16 bit operations. I can read (sign and zero extended) and write (truncated) 8 bit data from/to memory, but all intermediate operations in registers are performed in 16 bit registers. This causes LLVM to generate odd tricks such as shifts and byte-swaps, when trying to replace struct ‘memcpy’s by word sized load/store instructions. For 16 bit aligned structs the ‘memcpy’ replacement code is much cleaner. That’s the reason I would want structs to be always 16 bit aligned/padded.

Joan Lluch
Puigsacalm, 7
17458 - Fornells de la Selva
Girona

Tel: 620 28 45 13

Hi Joan,

The reason I want structs to be aligned/padded to 2 bytes is because my architecture only has 16 bit operations. I can read (sign and zero extended) and write (truncated) 8 bit data from/to memory, but all intermediate operations in registers are performed in 16 bit registers.

This is very normal. Mostly it's at 32-bits rather than 16, but it
applies to basically every RISC architecture so LLVM should handle it
well without adjusting the alignment requirements of types.

This causes LLVM to generate odd tricks such as shifts and byte-swaps, when trying to replace struct ‘memcpy’s by word sized load/store instructions.

That sounds odd, as if you've not taught the backend to use those
8-bit loads and stores so it's trying to emulate them with word-sized
ones (early Alpha chips genuinely didn't have byte access so had to do
that kind of thing). You can (and should) probably fix that.

Also, there are a few customization points where you can control how
memcpy is implemented. The function "findOptimalMemOpLowering" lets
you control the type used for the loads and stores, and
MaxStoresPerMemcpy controls when LLVM will call the real memcpy. If
you want even more control you can implement EmitTargetCodeForMemcpy
to do the whole thing.

Cheers.

Tim.

I had already adjusted MaxStoresPerMemcpy to my preferred value, and this works great but for the cases where load/stores are used on non-size aligned structs the odd behaviour still happens. For my 3-char, 3-byte struct test, the memcpy replacement appears to consist on a single byte load and store of the last char (this is correct), followed by a 16 bit move of the first two chars, this is also correct, but the odd thing is that the 16 bit move is performed by picking bytes separately from the source struct, then combining them into 16 bit values by means of shifts, swap, and or, then moved as words to the destination. I don’t know why simple 16 bit load and stores are used instead. Will try to track findOptimalMemOpLowering with the debugger to see if I see some light. So thanks for pointing that out to me.

John

Tel: 620 28 45 13

Hi Tim,

After looking at it a bit further, I think this is a Clang thing. Clang issues “align 2” if the struct has at least one int (2 bytes), but also if the entire struct size is multiple of 2. For example a struct with 4 char members. In these cases the LLVM backend correctly creates word sized load/stores (2 bytes).

However, Clang issues “align 1” for the remaining cases, for example for 3 byte structs with 3 char members. The LLVM backend just follows what’s dictated by Clang regarding alignment and thus it creates 2 byte or 1 byte load/stores instructions accordingly. I have not found a way to override this in LLVM. Any suggestions are appreciated.

I have also posted this problem in the cfe-def list.

John Lluch

Tel: 620 28 45 13

Hi Joan,

After looking at it a bit further, I think this is a Clang thing. Clang issues “align 2” if the struct has at least one int (2 bytes), but also if the entire struct size is multiple of 2. For example a struct with 4 char members. In these cases the LLVM backend correctly creates word sized load/stores (2 bytes).

I'm slightly surprised that it happens based purely on size, but
either way LLVM should be able to cope.

The LLVM backend just follows what’s dictated by Clang regarding alignment and thus it creates 2 byte or 1 byte load/stores instructions accordingly. I have not found a way to override this in LLVM. Any suggestions are appreciated.

That sounds right, but I don't think it explains the shifts you
described before. It should work out a lot better than what you're
seeing. Specifically, a 3 byte struct (for example) ought to either
lower to:

    load i16, load i8 + stores if your target can do misaligned i16 operations.

or

    load i8, load i8, load i8 + stores if not.

Neither of those involve shifting operations. I'd suggest breaking
just after getMemcpyLoadsAndStores and using SelectionDAG::dump to see
exactly what it's created. Then try to work out where that gets
pessimized to shifts, because it's not normal.

Cheers.

Tim.

Also: I'm still pretty certain that "solving" this in Clang is the
wrong approach.

Cheers.

Tim.

Hi Tim,

I agree this should be “solved” in LLVM

Just as a matter of information, this is the test source code and the generated assembly for my target architecture (more comments below):

struct AA
{
char n;
char m;
char j;
};

struct BB
{
char n;
char m;
char j;
char k;
};

extern void convertA( struct AA *a);
extern void convertB( struct BB *b);

void callConvertA()
{
struct AA a = {3, 4};
convertA( &a );
}

void callConvertB()
{
struct BB b = {3, 4};
convertB( &b );
}

callConvertA: ; @callConvertA
; %bb.0: ; %entry
sub SP, #4, SP
mov &.LcallConvertA.a, r0
ld.sb [r0, #2], r1
st.b r1, [SP, #2]
ld.sb [r0, #0], r1
zext r1, r1
ld.sb [r0, #1], r0
zext r0, r0
bswap r0, r0
or r0, r1, r0
st.w r0, [SP, #0]
mov SP, r0
call &convertA
add SP, #4, SP
ret
.Lfunc_end0:
.size callConvertA, .Lfunc_end0-callConvertA
; – End function
.globl callConvertB ; – Begin function callConvertB
.p2align 1
.type callConvertB,@function
callConvertB: ; @callConvertB
; %bb.0: ; %entry
sub SP, #4, SP
mov #0, r0
st.w r0, [SP, #2]
mov #1027, r0
st.w r0, [SP, #0]
mov SP, r0
call &convertB
add SP, #4, SP
ret
.Lfunc_end1:
.size callConvertB, .Lfunc_end1-callConvertB
; – End function
.type .LcallConvertA.a,@object ; @callConvertA.a
.section .rodata,“a”,@progbits
.LcallConvertA.a:
.byte 3 ; 0x3
.byte 4 ; 0x4
.byte 0 ; 0x0
.size .LcallConvertA.a, 3

Please note that for this architecture the destination operand is on the RIGHT HAND SIDE, this is important to know to read the assembly correctly.

“ld.sb" are signextend 8 bit loads
“st,b” are 8 bit trunc stores
“ld.w” are 16 bit word loads
“st.w” are 16 bit word stores

The generated code is functionally correct, but:

  • For callConvertA i8, i8, i8 loads + i8, i16 stores are generated. To get the value of two i8, i8 loads into a i16 store the “zext”+”bswap” (equivalent to shift) + “or” trick is performed
  • For callConvertB i16, i16 loads + i16, i16 stores are generated.

The desired behaviour would be to have i16, i8 loads + i16, i8 stores for the callConvertA.

The only difference that I can spot by debugging the LLVM source code is that getMemcpyLoadsAndStores is called with align = 1 for callConvertA, but it is NOT called for callConvertB.

The Clang generated IR is this:

; ModuleID = ‘add.c’
source_filename = “add.c”
target datalayout = “e-m:e-p:16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S16”
target triple = “cpu74”

%struct.AA = type { i8, i8, i8 }
%struct.BB = type { i8, i8, i8, i8 }

@callConvertA.a = private unnamed_addr constant %struct.AA { i8 3, i8 4, i8 0 }, align 1

; Function Attrs: minsize nounwind optsize
define dso_local void @callConvertA() local_unnamed_addr #0 {
entry:
%a = alloca %struct.AA, align 1
%0 = getelementptr inbounds %struct.AA, %struct.AA* %a, i16 0, i32 0
call void @llvm.lifetime.start.p0i8(i64 3, i8* nonnull %0) #3
call void @llvm.memcpy.p0i8.p0i8.i16(i8* nonnull align 1 %0, i8* align 1 getelementptr inbounds (%struct.AA, %struct.AA* @callConvertA.a, i16 0, i32 0), i16 3, i1 false)
call void @convertA(%struct.AA* nonnull %a) #4
call void @llvm.lifetime.end.p0i8(i64 3, i8* nonnull %0) #3
ret void
}

; Function Attrs: argmemonly nounwind
declare void @llvm.lifetime.start.p0i8(i64, i8* nocapture) #1

; Function Attrs: argmemonly nounwind
declare void @llvm.memcpy.p0i8.p0i8.i16(i8* nocapture writeonly, i8* nocapture readonly, i16, i1) #1

; Function Attrs: minsize optsize
declare dso_local void @convertA(%struct.AA*) local_unnamed_addr #2

; Function Attrs: argmemonly nounwind
declare void @llvm.lifetime.end.p0i8(i64, i8* nocapture) #1

; Function Attrs: minsize nounwind optsize
define dso_local void @callConvertB() local_unnamed_addr #0 {
entry:
%b = alloca i32, align 2
%tmpcast = bitcast i32* %b to %struct.BB*
%0 = bitcast i32* %b to i8*
call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull %0) #3
store i32 1027, i32* %b, align 2
call void @convertB(%struct.BB* nonnull %tmpcast) #4
call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull %0) #3
ret void
}

; Function Attrs: minsize optsize
declare dso_local void @convertB(%struct.BB*) local_unnamed_addr #2

attributes #0 = { minsize nounwind optsize “correctly-rounded-divide-sqrt-fp-math”=“false” “disable-tail-calls”=“false” “less-precise-fpmad”=“false” “no-frame-pointer-elim”=“true” “no-frame-pointer-elim-non-leaf” “no-infs-fp-math”=“false” “no-jump-tables”=“false” “no-nans-fp-math”=“false” “no-signed-zeros-fp-math”=“false” “no-trapping-math”=“false” “stack-protector-buffer-size”=“8” “unsafe-fp-math”=“false” “use-soft-float”=“false” }
attributes #1 = { argmemonly nounwind }
attributes #2 = { minsize optsize “correctly-rounded-divide-sqrt-fp-math”=“false” “disable-tail-calls”=“false” “less-precise-fpmad”=“false” “no-frame-pointer-elim”=“true” “no-frame-pointer-elim-non-leaf” “no-infs-fp-math”=“false” “no-nans-fp-math”=“false” “no-signed-zeros-fp-math”=“false” “no-trapping-math”=“false” “stack-protector-buffer-size”=“8” “unsafe-fp-math”=“false” “use-soft-float”=“false” }
attributes #3 = { nounwind }
attributes #4 = { minsize nounwind optsize }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !“wchar_size”, i32 2}
!1 = !{!“clang version 7.0.1 (tags/RELEASE_701/final)”}

John

Tel: 620 28 45 13

Hi Tim,

I tracked down the issue to a LLVM omission or bug, or that’s what I think. The issue can be easily reproduced for targets that do not support misaligned memory accessed and do not implement allowsMisalignedMemoryAccesses. Let me try to explain what happens:

In getMemcpyLoadsAndStores, the function FindOptimalMemOpLowering is called to obtain the list of Store operations that should be generated. If the memcpy destination is the non-fixed section of the stack, a value of Zero is passed as the destination alignment (DstAlign) to signal that it can be “changed”.

The FindOptimalMemOpLowering starts by calling the target getOptimalMemOpType. Most targets implement the later by returning the largest possible EVT that shall be used to create the load/store pairs. This is generally ok for targets that can handle non-aligned load/stores. If this function is not implemented, then the FindOptimalMemOpLowering does that by finding the largest legal integer type. Again, this is ok if the target would support misaligned accesses. Upon return, the FindOptimalMemOpLowering has found the minimal list of memory operations required to complete the memcpy.

Next, the getMemcpyLoadsAndStores continues by generating the required load/store pairs, which will result into too large loads/stores for targets that do not support misaligned accesses.

Later on, during the legalizeDAG phase, the function LegalizeLoadOps is invoked which eventually calls allowsMemoryAccess function, which only now calls allowsMisalignedMemoryAccesses to determine that it must replace the too long loads by the odd Shifts and OR aggregates of smaller loads, the latter happens inside expandUnalignedLoad

I would have expected, that for targets not supporting misaligned accesses, the list of load/stores generated around the getMemcpyLoadsAndStores function, would have respected such target limitation and would have not produced misaligned load/stores to begin with. Instead, the default implementation defers that for later and ends replacing such incorrect load/stores by odd expanded aggregates.

The workaround that I found was implementing getOptimalMemOpType in a way that rather than finding the highest possible EVT to minimise the number of load/stores, it would provide one that will never get greater than the SrcAlign. So this is what I have now:

EVT CPU74TargetLowering::getOptimalMemOpType(uint64_t Size,
unsigned DstAlign, unsigned SrcAlign, bool IsMemset, bool ZeroMemset,
bool MemcpyStrSrc, MachineFunction &MF) const
{
if ( DstAlign == 0 && !IsMemset )
{
// Return the EVT of the source
DstAlign = SrcAlign;
EVT VT = MVT::i64;
while (DstAlign && DstAlign < VT.getSizeInBits() / 8)
VT = (MVT::SimpleValueType)(VT.getSimpleVT().SimpleTy - 1);

return VT;
}

return MVT::Other;
}

Maybe, this is what we are expected to do after all, but it still looks somewhat weird to me that the default LLVM implementation would not take care of misaligned memory access targets in a more appropriate way. Solving this issue is just a matter of including the code above in FindOptimalMemOpLowering for such targets. The current implementation does just the opposite, and will even use MVT::i64 as the destination align even if the source align is only MVT::i8, which I regard as totally wrong.

This problem is also shared by the MSP430 target, and it’s very easy to reproduce by just compiling the code that I posted before.

Thanks

John Lluch

Hi John,

Hi Tim,

I am not sure about what you mean by “upstream” the function. I attach a text file with my proposed changes in case you can do something with it. It seems to me that the programmer that coded this function was already considering this, or maybe somebody spoiled it later, because it turns out that the problem is just a missing sentence in the function, that I have now added. I have tested it the best that I could for several targets, but I have not applied any systematised testing procedure.

I also found that not only the MSP430 target is affected, but also the MIPS16 target too.

For the MIPS16 you also need to remove the “getOptimalMemOpType” or make it return MVT::Other, because the current implementation returning a concrete MVT prevents the proposed changes to take the desired effect. The changes benefit the MIPS16 (llc -march=mips -mips-os16) because it’s a subtarget that does not support unaligned accesses.

Thanks for your help. The function is attached, it just has an added statement

John Lluch

FindOptimalMemOpLowering.txt (4.22 KB)

Hello Tim,

After a second look at the topic I have gotten a better understanding on what’s going on.

On the original implementation FindOptimalMemOpLowering and getOptimalMemOpType gets called to find the minimal set of load/store pairs to replace a memcpy operation. So for example, a 5 byte memcpy on a 32 bit architecture gets replaced by a 32 bit load/store and a 8 bit extload/truncstore. In a similar way, for a 16 bit architecture this will get replaced by a two 16 bit load/stores and one 8 bit extload/truncstore That’s the optimal way to deal with it, however, for targets not implementing the TargetLowering::allowsMisalignedMemoryAccesses function, this causes a call to expandUnalignedLoad at a later stage, which inserts the undesired shift/or code sequences.

On my previous fix approach, I attempted to solve the issue by getting the FindOptimalMemOpLowering and getOptimalMemOpType to use the smallest possible alignment for both the source and the destination. This avoids the shift/or code because the generated load/stores are always of the smaller type. This kind of works, but that approach generates a larger number of load/store pairs than the original implementation. For example, for a 5 byte memcpy it will generate 5 single byte load/store pairs, which may be more or less optimal depending on the target or whether you aim to optimize for speed or size.

The ultimate reason for the original problem is that the ‘load’ instructions are conservatively created with the source operand alignment. But the loads in reality never attempt to access unaligned offsets. Indeed, all the load/store replacements (regardless of size) are created on aligned offsets with respect to the origin of the memcpy operands, regardless of the target supporting or not misaligned accesses. This means that there’s no need to expand the load/stores into shifts/ors even for targets not supporting misaligned accesses, PROVIDED THAT we know that the initial operand addresses of the memcpy were aligned already.

So this leads to a problem of a more difficult solution, we need to infer more information from the memcpy operands. The current getMemcpyLoadsAndStores implementation already checks for the destination to be the stack frame, and uses this to provide bigger alignments to stores. For the source operand the function calls SelectionDAG::InferPtrAlignment. Unfortunately, the later always returns 0 rather than the proper alignment for the source node, which is the seed of the problem. Instead, if we already knew that pointers to a base Global Address are aligned to 2 for a particular architecture, we should use that information to set SrcAlign, and therefore produce an optimal sequence of load/stores for memcpy replacement.

We need to find the nature of the source and set SrcAlign to the correct value (greater than 1) at the beginning of getMemcpyLoadsAndStores. That would be the perfect fix.

Alternatively, we could provide the correct Alignment argument already to the getMemcpy function

So my question now goes back to my original one, how do I do that?

By the way, if I manually set align to 2 in Clang output files, such as on the line below, (which corresponds to a global var declaration):

@ga = dso_local local_unnamed_addr global %struct.AA { i8 3, i8 4, i8 5, i8 6 }, align 1

replaced by

@ga = dso_local local_unnamed_addr global %struct.AA { i8 3, i8 4, i8 5, i8 6 }, align 2

then everything turns perfect inside LLVM.

So again, I need a hook either in CLang or in LLVM to set the desired align to global vars and stack allocas.

Any ideas?

John Lluch

In terms of increasing the alignment of a variable, there’s an API getOrEnforceKnownAlignment which is used by LLVM optimizations to increase the alignment of global variables and stack allocations, when it appears profitable. There currently isn’t any transform that calls it for memcpy operations, but maybe worth implementing for cases where the size of the memcpy is a small constant.

-Eli