Bitwise AND blocks re-association

Hi All,

I’m writing to ask for the suggestion.

I’m working on the AMDGPU project. I was recently analyzing our backend performance looking for optimization opportunities and found that LoadStore vectorizer is unable to handle simple case.

The deeper look revealed that the reason was in a i64 to i32 truncation.

The OpenCL code looks like this:

__kernel void read_linear(__global float *input,__global float *output)

{

float val = 0.0f;

uint gid = get_global_id(0);

val = val + input[gid + 0];

val = val + input[gid + 1];

val = val + input[gid + 2];

val = val + input[gid + 3];

val = val + input[gid + 4];

*** and so on… ***

output[gid] = val;

}

I’d expect loads to be combined since they all have same base and offsets are constants and consecutively increasing.

That is exactly what happens if I change ‘uint’ to ‘ulong’ in the assignment: uint gid = get_global_id(0) => ulong gid = get_global_id(0) – loads are combined as expected.

The reason is simple:

OpenCL get_global_id(uint) returns i64

The result is assigned to i32

Clang generates trunc i64 to i32 as AND with FFFFFFFF:

%call = tail call i64 @_Z13get_global_idj(i32 0) #2

%idxprom = and i64 %call, 4294967295

%arrayidx = getelementptr inbounds float, float addrspace(1)* %input, i64 %idxprom

%0 = load float, float addrspace(1)* %arrayidx, align 4, !tbaa !7

So, LoadStoreVectorizer cannot combine with the next load:

%add2 = add i64 %call, 1

%idxprom3 = and i64 %add2, 4294967295

%arrayidx4 = getelementptr inbounds float, float addrspace(1)* %input, i64 %idxprom3

%1 = load float, float addrspace(1)* %arrayidx4, align 4, !tbaa !7

Basically we have a case: C1 & ( A + C2 ) => C1 & A + C1 & C2 that is not always legal.

In our case that really is trunc(A+C) => trunc(A) + trunc(C) that is legal.

NaryReassociate cannot handle this because it only considers ADD, MUL, GEP

Reassociate cannot handle this either just because it explicitly looks for “X&~X == 0, X|~X == -1.” or “Y ^ X^X → Y”

We definitely should be able to optimize 0xFFFFFFFF & ( A + 1 ) to 0xFFFFFFFF & A + 1 given that bitwise AND with FFFFFFFF is a special case.

We maybe want be able to enhance bitwise operation processing to handle more common cases.

I’d like to know your opinion on this topic.

Thanks in advance.

Alexander.

I’m not sure what you’re expecting reassociation to do here. If you can prove “gid + 1” doesn’t wrap in your example (because of some property of the get_global_id() function), the AND is a no-op, so you can just kill it. And if you can’t prove “gid + 1” doesn’t wrap, your proposed transformation is illegal. -Eli