I’m seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector loads on AVX.
3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as a single instruction (details below).
In a matrix-matrix inner-kernel, I see a ~25% decrease in performance, which seems to be due to this.
Any ideas why this changed? Thanks!
Zach
LLVM Code:
define <4 x double> @vstore(<4 x double>) {
entry:
%1 = load <4 x double> %0, align 8
ret <4 x double> %1
}
I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector loads
on AVX.
3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as
a single instruction (details below).
In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
which seems to be due to this.
Any ideas why this changed? Thanks!
Hi Zack,
I ran into a similar problem with the R600 backend, and I was able to fix it
by implementing the TargetLowering::allowsUnalignedMemoryAccesses().
Take a look at r184822.
Yes. On Sandybridge 256-bit loads/stores are double pumped. This means that they go in one after the other in two cycles. On Haswell the memory ports are wide enough to allow a 256bit memory operation in one cycle. So, on Sandybridge we split unaligned memory operations into two 128bit parts to allow them to execute in two separate ports. This is also what GCC and ICC do.
It is very possible that the decision to split the wide vectors causes a regression. If the memory ports are busy it is better to double-pump them and save the cost of the insert/extract subvector. Unfortunately, during ISel we don’t have a good way to estimate port pressure. In any case, it is a good idea to revise the heuristics that I put in and to see if it matches the Sandybridge optimization guide. If I remember correctly the optimization guide does not have too much information on this, but Elena looked over it and said that it made sense.
I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector
loads on AVX.
3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as
a single instruction (details below).
In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
which seems to be due to this.
Any ideas why this changed? Thanks!
What is code and architecture? In most loops spliting makes code faster
when ran on ivy bridge, You could dig intel optimization manual for that
recomendation. Perhaps this code is special case.
Thanks for all the the info! I’m still in the process of narrowing down the performance difference in my code. I’m no longer convinced its related to only the unaligned loads/stores alone since extracting this part of the kernel makes the performance difference disappear. I will try to narrow down what is going on and if it seems related LLVM, I will post an example. Thanks again,
If you look at kernel33.s, it has a register spill/reload in the inner loop. This doesn’t appear in the llvm 3.2 version and disappears from the 3.3 version if you remove the "align 8"s from kernel.ll which are making it unaligned. Do the two-instruction unaligned loads increase register pressure? Or is something else going on?
We see multiple regressions after r172868 in ISPC compiler (based on LLVM optimizer). The regressions are due to spill/reloads, which are due to increase register pressure. This matches Zach’s analysis. We’ve filed bug 17285 for this problem.
Is there any possibility to avoid splitting in case of multiple loads going together?