Disable vectorization for unaligned data

What is the proper solution to disable auto-vectorization for unaligned data?

I have an out of tree target and I added this:

bool OpusTargetLowering::allowsUnalignedMemoryAccesses(EVT VT, bool *Fast) const {
if (VT.isVector())
return false;


}

After that, I could see that vectorization is still done on unaligned data except that llvm will copy the data back and forth from the source to the top of the stack and work from there. This is very costly, I rather get scalar operations.

Then I tried to add:

unsigned getMemoryOpCost(unsigned Opcode, Type *Src,
unsigned Alignment,
unsigned AddressSpace) const {
if (Src->isVectorTy() && Alignment != 16)
return 10000; // <== high number to try to avoid unaligned load/store.

return TargetTransformInfo::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace);
}

Except that this doesn’t work because Alignment will always be 4 even for data like:

int data[16][16] attribute ((aligned (16))),

Because individual element are still 4-byte aligned.

I am not sure what is the right way to do it?
Thanks.

What is the proper solution to disable auto-vectorization for unaligned
data?

Why are you trying to do this? If auto-vectorization is making a
given loop slower on your target, that means the cost metrics are off,
and we should fix them. If code size is an issue, you should tell the
optimizer that you want to optimize for size.

-Eli

Because unaligned load/store are illegal on my target.
But ExpandUnalignedStore expand to too many load/store.

It seem that ExpandUnalignedStore is called after the vectorization cost analysis is done and not taken into account.

We will have to hook up some logic in the loop vectorizer that computes the alignment of the vectorized version of the memory access so that we can pass it to “getMemoryOpCost". Currently, as you have observed, we will just pass the scalar loop’s memory access alignment which will be pessimistic.

Instcombine will later replace the alignment to a stronger variant for vectorized code but that is obviously to late for the cost model in the vectorizer.

Ok any quick workaround to limit vectorization to 16-byte aligned 128-bit data then?

All the memory copying done by ExpandUnalignedStore/ExpandUnalignedLoad is just too expensive.

No, I am afraid not without computing alignment based on the scalar code.

In order to limit vectorization to 16-byte aligned data we need to know that data is 16-byte aligned. The way we vectorize we won’t know that until after we have vectorized. As you have observed we will pass “4” to getMemoryOpCost in the loop vectorizer (as that is the only thing that can be inferred from a consecutive scalar access like “aligned_ptr += 32bit”).

scalar code -> estimate cost based on scalar instructions -> vectorize -> vectorized code -> ... -> instcombine (calls ComputeMaskedBits) which computes better alignment for pointer accesses like “aligned_ptr += 128bit”.

I will have to work on this soon as ARM also has pretty inefficient unaligned vector loads.

If I got you right, this is the classic case for loop peeling. I thought LLVM’s vectorizer had something like that already in.

If I got you right, this is the classic case for loop peeling. I thought LLVM’s vectorizer had something like that already in.

No we don’t have loop peeling.

The problem is even more fundamental than this. In the vectorizer we pass the alignment of the scalar loop access which is of course lower than what is required.we need to compute alignment based on the first access only and the vector access size. But we don’t to this at the moment.

Yes but they can be very slow depending on the alignment( more micro ops).