movaps being generated despite alignment 1 being specified

Hello LLVMers,

High order bit:

Presence of a called function is causing a store on an unrelated vector to generate an aligned store rather an unaligned one despite unaligned store being indicated in the associated StoreInst.

Details:

I pulled down the latest source, so this is something I’m finding with the current LLVM. I’m hoping you’ll have an idea what’s going on or at least know if it’s a new issue I should log. It’s related to the stack alignment issue that I know is being worked on, but seems sufficiently different to ask about it here. I checked the bug database for “align” and “movaps” and didn’t see this issue raised.

Ok, the first bit of code here seems to generate correct assembly for me. Basically, it copies the float4 stored at globalV and copies it into the address pointed to by dependentV. Along the way, it creates a <4 x float> and copies globalV into a temporary. I’m working on bridging the gap between the outside of our system and the LLVM generated code, so there is a little extra copying from and to parameters at the boundaries of this function. Since this is just a repro-example, there is very little besides the boundaries here. J I fully admit the constructions below may not be optimal.

; ModuleID = ‘hydra’

target datalayout = “E-p:32:32:32-i1:8:8:8-i8:8:8:8-i32:32:32:32-f32:32:32:32”

define void @evaluateDependents(float* %dependentV, float* %globalV) {

Entry_evaluateDependents:

%Promoted_dependentV_Ptr = alloca <4 x float>, align 16 ; <<4 x float>*> [#uses=2]

%Promoted_globalV_Ptr = alloca <4 x float>, align 16 ; <<4 x float>*> [#uses=2]

%externalVectorPtrCast = bitcast float* %globalV to <4 x float>* ; <<4 x float>*> [#uses=1]

%externalVectorLoaded = load <4 x float>* %externalVectorPtrCast, align 1 ; <<4 x float>> [#uses=1]

store <4 x float> %externalVectorLoaded, <4 x float>* %Promoted_globalV_Ptr, align 1

%globalV1 = load <4 x float>* %Promoted_globalV_Ptr, align 1 ; <<4 x float>> [#uses=1]

br label %Body_evaluateDependents

Body_evaluateDependents: ; preds = %Entry_evaluateDependents

store <4 x float> %globalV1, <4 x float>* %Promoted_dependentV_Ptr, align 1

br label %Exit_evaluateDependents

Exit_evaluateDependents: ; preds = %Body_evaluateDependents

%vectorToDemote = load <4 x float>* %Promoted_dependentV_Ptr, align 1 ; <<4 x float>> [#uses=1]

%externalVectorPtrCast2 = bitcast float* %dependentV to <4 x float>* ; <<4 x float>*> [#uses=1]

store <4 x float> %vectorToDemote, <4 x float>* %externalVectorPtrCast2, align 1

ret void

}

Produces these instructions which obeys all the align 1 directives on the LoadInsts and StoreInsts…

15D10010 sub esp,2Ch

15D10013 mov eax,dword ptr [esp+34h]

15D10017 movups xmm0,xmmword ptr [eax]

15D1001A movups xmmword ptr [esp],xmm0

15D1001E mov eax,dword ptr [esp+30h]

15D10022 movups xmmword ptr [esp+10h],xmm0

15D10027 movups xmm0,xmmword ptr [esp+10h]

15D1002C movups xmmword ptr [eax],xmm0

15D1002F add esp,2Ch

15D10032 ret

Here’s where it gets weird and confusing to me. Let’s make our evaluateDependents function do something else. In addition to copying globalV into dependentV, it’s also going to set a singleton float pointed to by dependentF. We’ll call a function foo to get the value. (I tried setting dependentF directly and that did NOT cause the problem with the generated code). Here’s the LLVM code:

; ModuleID = ‘hydra’

target datalayout = “E-p:32:32:32-i1:8:8:8-i8:8:8:8-i32:32:32:32-f32:32:32:32”

define float @foo(float %Y) {

Entry_foo:

%_ReturnValuePtr = alloca float ; <float*> [#uses=2]

br label %Body_foo

Body_foo: ; preds = %Entry_foo

store float %Y, float* %_ReturnValuePtr, align 1

br label %Exit_foo

Exit_foo: ; preds = %Body_foo

%finalValue = load float* %_ReturnValuePtr, align 1 ; [#uses=1]

ret float %finalValue

}

define void @evaluateDependents(float* %dependentF, float* %dependentV, float* %globalV) {

Entry_evaluateDependents:

%Promoted_dependentV_Ptr = alloca <4 x float>, align 16 ; <<4 x float>*> [#uses=2]

%Promoted_globalV_Ptr = alloca <4 x float>, align 16 ; <<4 x float>*> [#uses=2]

%externalVectorPtrCast = bitcast float* %globalV to <4 x float>* ; <<4 x float>*> [#uses=1]

%externalVectorLoaded = load <4 x float>* %externalVectorPtrCast, align 1 ; <<4 x float>> [#uses=1]

store <4 x float> %externalVectorLoaded, <4 x float>* %Promoted_globalV_Ptr, align 1

%globalV1 = load <4 x float>* %Promoted_globalV_Ptr, align 1 ; <<4 x float>> [#uses=1]

br label %Body_evaluateDependents

Body_evaluateDependents: ; preds = %Entry_evaluateDependents

%fooResult = call float @foo( float 2.000000e+000 ) ; [#uses=1]

store float %fooResult, float* %dependentF, align 1

store <4 x float> %globalV1, <4 x float>* %Promoted_dependentV_Ptr, align 1

br label %Exit_evaluateDependents

Exit_evaluateDependents: ; preds = %Body_evaluateDependents

%vectorToDemote = load <4 x float>* %Promoted_dependentV_Ptr, align 1 ; <<4 x float>> [#uses=1]

%externalVectorPtrCast2 = bitcast float* %dependentV to <4 x float>* ; <<4 x float>*> [#uses=1]

store <4 x float> %vectorToDemote, <4 x float>* %externalVectorPtrCast2, align 1

ret void

}

Here are the instructions for evaluateDependents. The JITter hasn’t compiled foo yet. What’s confusing to me is why did my movups suddenly become a movaps? All the stores and loads have align 1 on them.

15D10012 sub esp,4Ch

15D10015 mov eax,dword ptr [esp+60h]

15D10019 movups xmm0,xmmword ptr [eax]

15D1001C movaps xmmword ptr [esp+8],xmm0 ß why did this become a movaps?

15D10021 movups xmmword ptr [esp+28h],xmm0

15D10026 mov esi,dword ptr [esp+58h]

15D1002A mov edi,dword ptr [esp+5Ch]

15D1002E mov dword ptr [esp],40000000h

15D10035 call X86CompilationCallback (1335030h)

Thanks for the help!

Chuck.

This probably means the compiler believes the stack pointer is 16-byte aligned in non-leaf functions.

This would be correct if (a) the SP was aligned coming in and (b) the size of the stack decrement
(including return address, etc.) is a multiple of 16. I haven’t been following the Linux problems
closely, but I think “the stack issue being worked on” is that (a) is not always correct?

Here are the instructions for evaluateDependents. The JITter hasn’t compiled foo yet. What’s confusing to me is why did my movups suddenly become a movaps? All the stores and loads have align 1 on them.

Hi Chuck,

I believe this is a bug but am unable to reproduce it with the test case you’ve provided. I should be able to see the same problem using llc since the code generator is going through all the same passes. The only difference should be the relocation model.

Please file a bug and provide us with a test case. You should be able to set a break point somewhere in ExecutionEngine.cpp / JIT.cpp and just dump out the bitcode with Module->dump() / print().

Evan

Fixed. See PR1776 and http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20071105/055148.html

Evan