unnecessary reload of 8-byte struct on i386

Hello folks,

I’ve recently been looking at the generated code for a few functions in Chromium while investigating crashes, and I came across a curious pattern. A smallish repro case is available at https://godbolt.org/z/Dsu1WI . In that case, the function Assembler::emit_arith receives a struct (Operand) by value and passes it by value to another function. That struct is 8 bytes long, so the -O3 generated code uses movsd to copy it up the stack. However, we end up with some loads that aren’t needed, as in the following chunk:

movsd xmm0, qword ptr [ecx] # xmm0 = mem[0],zero

mov dword ptr [esp + 24], edx

movsd qword ptr [esp + 40], xmm0

movsd xmm0, qword ptr [esp + 40] # xmm0 = mem[0],zero

movsd qword ptr [esp + 8], xmm0

As far as I can tell, the fourth line has no effect. On its own, that seems like a small missed opportunity for optimization. However, this sequence of instructions also appears to trigger a hardware bug on a small fraction of devices which sometimes end up storing zero at esp+8. A more in-depth discussion of that issue can be found here: https://bugs.chromium.org/p/v8/issues/detail?id=9774 .

I’m hoping that getting rid of the second load in the sequence above would appease these misbehaving machines (though of course I don’t know that it would), as well as making the code a little smaller for everybody else. Does that sound like a reasonable idea? Would LLVM be interested in a patch related to eliminating reloads like this? Does anybody have advice about where I should start looking, or any reasons it would be very hard to achieve the result I’m hoping for?

Thanks,

Seth

It’s not unheard of for the compiler to work around CPU bugs… but generally, we try to do it in a more disciplined way: with a code generation pass that actually detects the bad sequence in question. I’m not really happy with trying to “get lucky” here to avoid a bug.

This particular missed optimization is a known issue with the LLVM IR representation of “byval”; there’s an implied copy that can’t be easily optimized away at the IR level due to calling convention rules. For ARM targets, clang works around this issue by changing the IR it generates; see ARMABIInfo::classifyArgumentType in clang/lib/CodeGen/TargetInfo.cpp .

-Eli

This just looks like a temporary stack variable wasn’t properly eliminated because the compiler modeled `Operand` internally as an “i64” which i386 doesn’t natively support. A further reduction, with notes about changes that sidestep the bug:

https://godbolt.org/z/zcCguv

Hi Eli,

Thanks for the thoughtful response! To check my understanding, is the following statement true? Given infinite time/resources and a clearer understanding of the exact trigger for this CPU bug, we would like two separate changes: one change in Clang to emit different IR for x86 like we do for ARM, and a second change to search for this instruction sequence and transform it into something harmless.

However, I don’t really know what “something harmless” would be, and this issue affects only a tiny minority of machines, and it feels a little silly to stuff an increasing number of nops in between those instructions until we stop getting these crash reports, so I think the reasonable next step is to leave it alone. Please let me know if you think otherwise, and thanks again!

-Seth

Very interesting, thanks for the analysis and further reduction! Over-aligning the struct seems like a simple workaround that would have little impact elsewhere.

Yes, that’s essentially what I’m thinking.

I’d be okay with ignoring the crash issue, if someone comes up with a patch for the missed optimization. (But we’d still need to evaluate the optimization on its own merits.)

-Eli