AVX broadcast Vs. vector constant pool load

Hey guys,

I’m currently investigating broadcasts from the constant pool on Sandy Bridge. I see this comment in llvm/lib/Target/X86/X86ISelLowering.cpp:

// Handle the broadcasting a single constant scalar from the constant pool
// into a vector. On Sandybridge it is still better to load a constant vector
// from the constant pool and not to broadcast it from a scalar.

Would anyone be able to explain why it is better to load a vector from the constant pool rather than broadcast a scalar?

I checked out Agner Fog’s tables, but it wasn’t so obvious to me…

vmovaps y, m256:
  Uops: 1
  Lat: 4
  Throughput: 1

vbroadcastsd y, m64:
  Uops: 2
  Lat: [Not or cannot be measured]
  Throughput: 1

Thanks in advance,
Cameron

I don’t remember exactly why I did this. I vaguely remember looking at this with one of the Sandybridge architects and following his suggestion.

When I look at it now, it looks like broadcasting the scalar would be faster because the 256 bit load on sandy bridge is double pumped.

I am CC-ing Elena, who should be able to tell.