Handling misaligned array accesses

Hi,

I have tried couple of c test cases with llvm to see if we handle misaligned accesses, but it seems we do not have transformations to align loop accesses. Misaligned accesses can worsen performance depending on the underlying target (severity of crossing cache line boundaries)

One example:
//unaligned load and store

int foo(short *a, int m){
int i;
for(i=1; i<m ; i++)
a[i] *=2;
return a[3];
}

IR generated though clang -O3 -mllvm -disable-llvm-optzns. Passed this through opt -O3 and the loop vectorizer adds vector code for this loop, but the GEP access starts at offset 1.

vector.body: ; preds = %vector.body, %vector.body.preheader.new
%index = phi i64 [ 0, %vector.body.preheader.new ], [ %index.next.3, %vector.body ]
%niter = phi i64 [ %unroll_iter, %vector.body.preheader.new ], [ %niter.nsub.3, %vector.body ]
%offset.idx = or i64 %index, 1
%9 = getelementptr inbounds i16, i16* %a, i64 %offset.idx
%10 = bitcast i16* %9 to <8 x i16>*
%wide.load = load <8 x i16>, <8 x i16>* %10, align 2, !tbaa !2
%11 = shl <8 x i16> %wide.load, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
%12 = bitcast i16* %9 to <8 x i16>*
store <8 x i16> %11, <8 x i16>* %12, align 2, !tbaa !2

Is there a reason we don’t support loop peeling for alignment handling?

Thanks,

Anna

From: "Anna Thomas" <anna@azul.com>
To: llvm-dev@lists.llvm.org
Cc: hfinkel@anl.gov, anemet@apple.com
Sent: Thursday, May 12, 2016 4:20:24 PM
Subject: Handling misaligned array accesses

Hi,

I have tried couple of c test cases with llvm to see if we handle
misaligned accesses, but it seems we do not have transformations to
align loop accesses. Misaligned accesses can worsen performance
depending on the underlying target (severity of crossing cache line
boundaries)

One example:
//unaligned load and store

int foo(short *a, int m){
int i;
for(i=1; i<m ; i++)
a[i] *=2;
return a[3];
}

IR generated though clang -O3 -mllvm -disable-llvm-optzns. Passed
this through opt -O3 and the loop vectorizer adds vector code for
this loop, but the GEP access starts at offset 1.

vector.body: ; preds = %vector.body, %vector.body.preheader.new
%index = phi i64 [ 0, %vector.body.preheader.new ], [ %index.next.3,
%vector.body ]
%niter = phi i64 [ %unroll_iter, %vector.body.preheader.new ], [
%niter.nsub.3, %vector.body ]
%offset.idx = or i64 %index, 1
%9 = getelementptr inbounds i16, i16* %a, i64 %offset.idx
%10 = bitcast i16* %9 to <8 x i16>*
%wide.load = load <8 x i16>, <8 x i16>* %10, align 2, !tbaa !2
%11 = shl <8 x i16> %wide.load, <i16 1, i16 1, i16 1, i16 1, i16 1,
i16 1, i16 1, i16 1>
%12 = bitcast i16* %9 to <8 x i16>*
store <8 x i16> %11, <8 x i16>* %12, align 2, !tbaa !2

Is there a reason we don’t support loop peeling for alignment
handling?

No. AFAIK, just no one has done the work to implement it yet. I'd certainly be quite interested in it, however. Several targets I care about would benefit from alignment-based peeling.

-Hal