Vectorizing structure reads, writes, etc on X86-64 AVX

I am a first time poster, so I apologize if this is an obvious
question or out of scope for LLVM. I am an LLVM user. I don't really
know anything about hacking on LLVM, but I do know a bit about
compilation generally.

I am on x86-64 and I am interested in structure reads, writes, and
constants being optimized to use vector registers when the alignment
and sizes are right. I have created a gist of a small example:

The assembly is produced with

llc -O3 -march=x86-64 -mcpu=corei7-avx

The key idea is that we have a structure like this:

%athing = type { float, float, float, float, float, float, i16, i16,
i8, i8, i8, i8 }

That works out to be 32 bytes, so it can fit in YMM registers.

If I have two pointers to arrays of these things:

@one = external global %athing
@two = external global %athing

and then I do a copy from one to the other

  %a = load %athing* @two
  store %athing %a, %athing* @one

Then the code that is generated uses the XMM registers for the floats,
but does 12 loads and then 12 stores.

In contrast, if I manually cast to a properly sized float vector I get
the desired single load and single store:

  %two_vector = bitcast %athing* @two to <8 x float>*
  %b = load <8 x float>* %two_vector
  %one_vector = bitcast %athing* @one to <8 x float>*
  store <8 x float> %b, <8 x float>* %one_vector

The rest of the file demonstrates that the code for modifying these
vectors is pretty good, but has examples of bad ways to initialize the
structure and a good way to initialize it. If I try to store a
constant struct, I get 13 stores. If I try to assemble a vector by
casting <2 x i16> to float then <4 x i8> to float and installing them
into a single <8 x float>, I do get the desired single store, but I
get very complicated constants that are loaded from memory. In
contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8> as
I go, then I get the desired initialization with no loads and just
modifications of the single YMM register. (Even this last one,
however, doesn't have the best assembly because the words and bytes
are not inserted into the vector simultaneously, but instead

I am kind of surprised that the obvious code didn't get optimized the
way I expected and even the tedious version of the initialization
isn't optimal either. I would like to know if a transformation of one
to the other is feasible in LLVM (I know anything is possible, but
what is feasible in this situation?) or if I should implement a
transformation like this in my front-end and settle for the
initialization that comes out.

Thank you for your time,


Hi Jay -

I’m surprised by the codegen for your examples too, but LLVM has an expectation that a front-end and IR optimizer will use llvm.memcpy liberally:

"Any ld-ld-st-st sequence over this should have been converted to llvm.memcpy by the frontend."
"The optimizer should really avoid this case by converting large object/array copies to llvm.memcpy"

So for example with clang:

$ cat copy.c
struct bagobytes {
int i0;
int i1;

void foo(struct bagobytes* a, struct bagobytes* b) {
*b = *a;

$ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o -
define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 {

call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4, i1 false), !tbaa.struct !6
ret void

It may still be worth filing a bug (or seeing if one is already open) for one of your simple examples.

Thank you for your reply. FWIW, I wrote the .ll by hand after taking
the C program, using clang to emit the llvm and seeing the memcpy. The
memcpy version that clang generates gets compiled into assembly that
uses the large sequence of movs and does not use the vector hardware
at all. When I started debugging, I took that clang produced .ll and
started to write it different ways trying to get different results.


If the memcpy version isn’t getting optimized into larger memory operations, that definitely sounds like a bug worth filing.

Lowering of memcpy is affected by the size of the copy, alignments of the source and dest, and CPU target. You may be able to narrow down the problem by changing those parameters.

From: "Sanjay Patel via llvm-dev" <>
To: "Jay McCarthy" <>
Cc: "llvm-dev" <>
Sent: Tuesday, November 3, 2015 12:30:51 PM
Subject: Re: [llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX

If the memcpy version isn't getting optimized into larger memory
operations, that definitely sounds like a bug worth filing.

Lowering of memcpy is affected by the size of the copy, alignments of
the source and dest, and CPU target. You may be able to narrow down
the problem by changing those parameters.

The relevant target-specific logic is in X86TargetLowering::getOptimalMemOpType, looking at that might help in understanding what's going on.


Thanks, Hal.

That code is very readable. Basically, the following has to be true
- not a memset or memzero [check]
- no implicit floats [check]
- size greater than 16 [check, it's 32]
- ! isUnalignedMem16Slow [check?]
- int256, fp256, or sse2, or sse1 is around [check]

That last condition is:
- src & dst alignment is 0 or greater than 16

I think this is true, because I'm reading from a giant array of these
things, so the memory should be aligned to the object size. Assuming
that's wrong, I added an explicit alignment attribute.

I think part of the problem is that the memcpy that gets generated
isn't for the structure, but for the structures bitcast into character

  %17 = bitcast %struct.sprite* %9 to i8*
  %18 = bitcast %struct.sprite* %16 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %18, i64 32, i32
4, i1 false)

So even though the original struct pointers were aligned at 32, the
byte arrays that are created lose that alignment information.

If this is correct, would you recommend this as just an error that
will be fixed with a little test case?

BTW, Here's a tiny C program that demonstrates the "problem":

typedef struct {
  float dx; float dy;
  float mx; float my;
  float theta; float a;
  short spr; short pal;
  char layer;
  char r; char g; char b;
} sprite;

sprite *spr_static; // or array of [1024] // or add
__attribute__ ((align_value(32)))
sprite *spr_dynamic; // or array of [1024] // or add __attribute__

void copy(int i, int j) {
  spr_dynamic[i] = spr_static[j];



Hi Jay -

I see the slow, small accesses using an older clang [Apple LLVM version 7.0.0 (clang-700.1.76)], but this looks fixed on trunk. I made a change that comes into play if you don’t specify a particular CPU:

$ ./clang -O1 -mavx copy.c -S -o -

movslq %edi, %rax
movq _spr_dynamic@GOTPCREL(%rip), %rcx
movq (%rcx), %rcx
shlq $5, %rax
movslq %esi, %rdx
movq _spr_static@GOTPCREL(%rip), %rsi
movq (%rsi), %rsi
shlq $5, %rdx
vmovups (%rsi,%rdx), %ymm0 <— 32-byte load
vmovups %ymm0, (%rcx,%rax) <— 32-byte store
popq %rbp

Oh that's great. I'll just update and go from there. Thanks so much
and sorry for the noise.


No problem. Please do file bugs if you see anything that looks suspicious.

The x86 memcpy lowering still has that FIXME comment that I haven’t gotten back around to, and we have at least one other potential improvement: