SIMD instructions and memory alignment on X86

Hello all,

I'm currently in the process of debugging a crash occurring in our program. In LLVM 3.2 and 3.3 it appears that JIT generated code is attempting to perform access unaligned memory with a SSE2 instruction. However this only happens under certain conditions that seem (but may not be) related to the stacks state on calling the function.

Our program acts as a front-end, using the LLVM C++ API to generate a JIT generated function. This function is primarily mathematical, so we use the Vector types to take advantage of SIMD instructions (as well as a few SSE2 intrinsics).

This worked in LLVM 2.8 but started failing in 3.2 and has continued to fail in 3.3. It fails with no optimizations applied to the LLVM Function/Module. It crashes with what is reported as a memory access error (accessing 0xffffffff), however it's suggested that this is how the SSE fault raising mechanism appears.

The generated instruction varies, but it seems to often be similar to (I don't have it in front of me, sorry):
movapd xmm0, xmm[ecx+0x???]
Where the xmm register changes, and the second parameter is a memory access.
ECX is always set to 0x7ffffff - however I don't know if this is part of the SSE error reporting process or is part of the situation causing the error.

I haven't worked out exactly what code path etc is causing this crash. I'm hoping that someone can tell me if there were any changed requirements for working with SIMD in LLVM 3.2 (or earlier, we haven't tried 3.0 or 3.1). I currently suspect the use of GlobalVariable (we first discovered the crash when using a feature that uses them), however I have attempted using setAlignment on the GlobalVariables without any change.

As someone off list just told me, perhaps my new bug is the same issue:

  http://llvm.org/bugs/show_bug.cgi?id=16640

Do you happen to be using FastISel?

Solomon

Unfortunately, this doesn't appear to be the bug I'm hitting. I applied the fix to my source and it didn't make a difference.

Also further testing found me getting the same behavior with other SIMD instructions. The common factor is in each case, ECX is set to 0x7fffffff, and it's an operation using xmm ptr ecx+offset .

Additionally, turning the optimization level passed to createJIT down appears to avoid it, so I'm now leaning towards a bug in one of the optimization passes.

I'm going to dig through the passes controlled by that parameter and see if I can narrow down which optimization is causing it.

Peter N

Are you able to send any IR for others to reproduce this issue?

I've attached the module->dump() that our code is producing. Unfortunately this is the smallest test case I have available.

This is before any optimization passes are applied. There are two separate modules in existence at the time, and there are no guarantees about the order the surrounding code calls those functions, so there may be some interaction between them? There shouldn't be, they don't refer to any common memory etc. There is no multi-threading occurring.

The function in module-dump.ll (called crashfunc in this file) is called with
- func_params 0x0018f3b0 double [3]
         [0x0] -11.339976634695301 double
         [0x1] -9.7504239056205506 double
         [0x2] -5.2900856817382804 double
at the time of the exception.

This is compiled on a "i686-pc-win32" triple. All of the non-intrinsic functions referred to in these modules are the standard equivalents from the MSVC library (e.g. @asin is the standard C lib double asin( double ) ).

Hopefully this is reproducible for you.

module-dump.ll (20.2 KB)

module-dump-2.zip (22.5 KB)

After stepping through the produced assembly, I believe I have a culprit.

One of the calls to @frep.x86.sse2.sqrt.pd is modifying the value of ECX - while the produced code is expecting it to still contain its previous value.

Peter N

What is “frep.x86.sse2.sqrt.pd”. I’m only familiar with things prefixed with “llvm.x86”.

Sorry, that should have been llvm.x86.sse2.sqrt.pd

That should map directly to sqrtpd which can’t modify ecx.

In the disassembly, I'm seeing three cases of
call 76719BA1

I am assuming this is the sqrt function as this is the only function called in the LLVM IR.

The code at 76719BA1 is:

76719BA1 push ebp
76719BA2 mov ebp,esp
76719BA4 sub esp,20h
76719BA7 and esp,0FFFFFFF0h
76719BAA fld st(0)
76719BAC fst dword ptr [esp+18h]
76719BB0 fistp qword ptr [esp+10h]
76719BB4 fild qword ptr [esp+10h]
76719BB8 mov edx,dword ptr [esp+18h]
76719BBC mov eax,dword ptr [esp+10h]
76719BC0 test eax,eax
76719BC2 je 76719DCF
76719BC8 fsubp st(1),st
76719BCA test edx,edx
76719BCC js 7671F9DB
76719BD2 fstp dword ptr [esp]
76719BD5 mov ecx,dword ptr [esp]
76719BD8 add ecx,7FFFFFFFh
76719BDE sbb eax,0
76719BE1 mov edx,dword ptr [esp+14h]
76719BE5 sbb edx,0
76719BE8 leave
76719BE9 ret

As you can see at 76719BD5, it modifies ECX .

I don't know that this is the sqrtpd function (for example, I'm not seeing any SSE instructions here?) but whatever it is, it's being called from the IR I attached earlier, and is modifying ECX under some circumstances.

Hmm, maybe sse isn’t being enabled so its falling back to emulating sqrt?

Is there something specifically required to enable SSE? If it's not detected as available (based from the target triple?) then I don't think we enable it specifically.

Also it seems that it should handle converting to/from the vector types, although I can see it getting confused about needing to do that if it thinks SSE isn't available at all.

Hmm, I’m not able to get those .ll files to compile if I disable SSE and I end up with SSE instructions(including sqrtpd) if I don’t disable it.

(Changing subject line as diagnosis has changed)

I'm attaching the compiled code that I've been getting, both with CodeGenOpt::Default and CodeGenOpt::None . The crash isn't occurring with CodeGenOpt::None, but that seems to be because ECX isn't being used - it still gets set to 0x7fffffff by one of the calls to 76719BA1

I notice that X86::SQRTPD[m|r] appear in X86InstrInfo::isHighLatencyDef. I was thinking an optimization might be removing it, but I don't get the sqrtpd instruction even if the createJIT optimization level turned off.

I am trying this with the Release 3.3 code - I'll try it with trunk and see if I get a different result there. Maybe there was a recent commit for this.

function-asm-createJIT-Default.txt (10.7 KB)

function-asm-createJIT-None.txt (20.8 KB)

The calls represent the MSVC _ftol2 function I think.

Oh, excellent point, I agree. My bad. Now that I'm not assuming those are the sqrt, I see the sqrtpd's in the output. Also there are three fptoui's and there are 3 call instances.

(Changing subject line again.)

Now it looks like it's bug #13862

Try adding ECX to the Defs of this part of lib/Target/X86/X86InstrCompiler.td like I’ve done below. I don’t have a Windows machine to test myself.

let Defs = [EAX, EDX, ECX, EFLAGS], FPForm = SpecialFP in {
def WIN_FTOL_32 : I<0, Pseudo, (outs), (ins RFP32:$src),
“# win32 fptoui”,
[(X86WinFTOL RFP32:$src)]>,
Requires<[In32BitMode]>;

def WIN_FTOL_64 : I<0, Pseudo, (outs), (ins RFP64:$src),
“# win32 fptoui”,
[(X86WinFTOL RFP64:$src)]>,
Requires<[In32BitMode]>;
}

Thank you, I'm trying this now.

I don’t think that’s going to work.

That does appear to have worked. All my tests are passing now.

I'll hand this out to our other devs & testers and make sure it's working for them as well (not just on my machine).

Thank you, again.