Improve the gathering of the elements so that unwanted EXT operations can be avoided

Hi,

I am working on a problem in which a kernel is SLP vectorized and lead to generation of insertelements followed by zext. On lowering, the assembly looks like below:

    vmovd   %r9d, %xmm0
    vpinsrb $1, (%rdi,%rsi), %xmm0, %xmm0
    vpinsrb $2, (%rdi,%rsi,2), %xmm0, %xmm0
    vpinsrb $3, (%rdi,%rcx), %xmm0, %xmm0
    vpinsrb $4, (%rdi,%rsi,4), %xmm0, %xmm0
    vpinsrb $5, (%rdi,%rax), %xmm0, %xmm0
    vpinsrb $6, (%rdi,%rcx,2), %xmm0, %xmm0
    vpinsrb $7, (%rdi,%r8), %xmm0, %xmm0
    vpmovzxbw       %xmm0, %xmm0            # xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
    vpmaddwd        (%rdx), %xmm0, %xmm0

After all the insrb, xmm0 looks like
xmm0=xmm0[0],xmm0[1],xmm0[2],xmm0[3],xmm0[4],xmm0[5],xmm0[6],xmm0[7],zero,zero,zero,zero,zero,zero,zero,zero
Here vpmovzxbw perform the extension of i8 to i16. But it is expensive operation and I want to remove it.

Optimization
Place the value in correct location while inserting so that zext can be avoid.
While lowering, we can write a custom lowerOperation for zero_extend_vector_inreg opcode. We can override the current default operation with my custom in the legalization step.

The changes proposed are state below:

    vpinsrb $2, (%rdi,%rsi), %xmm0, %xmm0
    vpinsrb $4, (%rdi,%rsi,2), %xmm0, %xmm0
    vpinsrb $6, (%rdi,%rcx), %xmm0, %xmm0
    vpinsrb $8, (%rdi,%rsi,4), %xmm0, %xmm0
    vpinsrb $a, (%rdi,%rax), %xmm0, %xmm0
    vpinsrb $c, (%rdi,%rcx,2), %xmm0, %xmm0
    vpinsrb $e, (%rdi,%r8), %xmm0, %xmm0 # xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero            
    vpmaddwd        (%rdx), %xmm0, %xmm0

Is my approach sound?
What is the convention or the practice followed by our community to handle such custom operation?
In which part of selectionDAG framework, this should go?

Thanks in advance.

Thanks,
Rohit Aggarwal

Gentle Reminder!

CC @RKSimon

@rohitaggarwal007 can you paste the IR here? Using interval vpinsrb looks good to me, but we need to clear xmm0 at the beginning.

Hi @phoebe

I am attaching the file for the full IR.

this file is generated with -S emit-llvm : sample_gather_llvm.txt (3.4 KB)

The IR before DAG is present in IR_before_DAG.txt (2.9 KB)

Snippet from IR_before_DAG.txt

%7 = load i8, ptr %arrayidx.7, align 1, !tbaa !5
%8 = insertelement <8 x i8> poison, i8 %0, i64 0
%9 = insertelement <8 x i8> %8, i8 %1, i64 1
%10 = insertelement <8 x i8> %9, i8 %2, i64 2
%11 = insertelement <8 x i8> %10, i8 %3, i64 3
%12 = insertelement <8 x i8> %11, i8 %4, i64 4
%13 = insertelement <8 x i8> %12, i8 %5, i64 5
%14 = insertelement <8 x i8> %13, i8 %6, i64 6
%15 = insertelement <8 x i8> %14, i8 %7, i64 7
%16 = zext <8 x i8> %15 to <8 x i32>
%17 = load <8 x i16>, ptr %b, align 2, !tbaa !8
%18 = sext <8 x i16> %17 to <8 x i32>
%19 = mul <8 x i32> %18, %16
%20 = shufflevector <8 x i32> %19, <8 x i32> %19, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%21 = shufflevector <8 x i32> %19, <8 x i32> %19, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%22 = add <4 x i32> %20, %21
%23 = shufflevector <4 x i32> %22, <4 x i32> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%rdx.shuf = shufflevector <8 x i32> %23, <8 x i32> poison, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
%bin.rdx = add <8 x i32> %23, %rdx.shuf
%rdx.shuf12 = shufflevector <8 x i32> %bin.rdx, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
%bin.rdx13 = add <8 x i32> %bin.rdx, %rdx.shuf12
%rdx.shuf14 = shufflevector <8 x i32> %bin.rdx13, <8 x i32> poison, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
%bin.rdx15 = add <8 x i32> %bin.rdx13, %rdx.shuf14
%24 = extractelement <8 x i32> %bin.rdx15, i32 0
ret i32 %24

Folding this:

    t102: v8i8 = BUILD_VECTOR t9, t11, t16, t20, t25, t29, t33, t37
    t100: v8i16 = zero_extend t102

to:

    t102: v16i8 = BUILD_VECTOR t9, 0, t11, 0, t16, 0, t20, 0, t25, 0, t29, t33, 0, t37, 0
    t100: v8i16 = bitcast t102

Seems a good combine to investigate - this can probably be attempted generically in DAGCombiner::visitZERO_EXTEND with some suitable legality checks.

1 Like

Thank you, let me try it.

Should this hold true for these cases also?

  1. t102: v4i8 = BUILD_VECTOR t9, t11, t16, t20
    t100: v4i32 = zero_extend t102
    to:
    t102: v16i8 = BUILD_VECTOR t9, 0, 0, 0, t11, 0, 0, 0, t16, 0, 0, 0, t20, 0, 0, 0
    t100: v4i32 = bitcast t102

  2. t102: v2i8 = BUILD_VECTOR t9, t11
    t100: v2i64 = zero_extend t102
    to:
    t102: v16i8 = BUILD_VECTOR t9, 0, 0, 0, 0, 0, 0, 0, t11, 0, 0, 0, 0, 0, 0, 0
    t100: v2i64 = bitcast t102

Thanks,
Rohit Aggarwal

SGTM (these are all for little endian of course)