Improve the gathering of the elements so that unwanted EXT operations can be avoided

rohitaggarwal007 · March 24, 2025, 10:37am

Hi,

I am working on a problem in which a kernel is SLP vectorized and lead to generation of insertelements followed by zext. On lowering, the assembly looks like below:

    vmovd   %r9d, %xmm0
    vpinsrb $1, (%rdi,%rsi), %xmm0, %xmm0
    vpinsrb $2, (%rdi,%rsi,2), %xmm0, %xmm0
    vpinsrb $3, (%rdi,%rcx), %xmm0, %xmm0
    vpinsrb $4, (%rdi,%rsi,4), %xmm0, %xmm0
    vpinsrb $5, (%rdi,%rax), %xmm0, %xmm0
    vpinsrb $6, (%rdi,%rcx,2), %xmm0, %xmm0
    vpinsrb $7, (%rdi,%r8), %xmm0, %xmm0
    vpmovzxbw       %xmm0, %xmm0            # xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
    vpmaddwd        (%rdx), %xmm0, %xmm0

After all the insrb, xmm0 looks like
xmm0=xmm0[0],xmm0[1],xmm0[2],xmm0[3],xmm0[4],xmm0[5],xmm0[6],xmm0[7],zero,zero,zero,zero,zero,zero,zero,zero
Here vpmovzxbw perform the extension of i8 to i16. But it is expensive operation and I want to remove it.

Optimization
Place the value in correct location while inserting so that zext can be avoid.
While lowering, we can write a custom lowerOperation for zero_extend_vector_inreg opcode. We can override the current default operation with my custom in the legalization step.

The changes proposed are state below:

    vpinsrb $2, (%rdi,%rsi), %xmm0, %xmm0
    vpinsrb $4, (%rdi,%rsi,2), %xmm0, %xmm0
    vpinsrb $6, (%rdi,%rcx), %xmm0, %xmm0
    vpinsrb $8, (%rdi,%rsi,4), %xmm0, %xmm0
    vpinsrb $a, (%rdi,%rax), %xmm0, %xmm0
    vpinsrb $c, (%rdi,%rcx,2), %xmm0, %xmm0
    vpinsrb $e, (%rdi,%r8), %xmm0, %xmm0 # xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero            
    vpmaddwd        (%rdx), %xmm0, %xmm0

Is my approach sound?
What is the convention or the practice followed by our community to handle such custom operation?
In which part of selectionDAG framework, this should go?

Thanks in advance.

Thanks,
Rohit Aggarwal

rohitaggarwal007 · March 27, 2025, 5:11am

Gentle Reminder!

phoebe · March 27, 2025, 8:39am

CC @RKSimon

@rohitaggarwal007 can you paste the IR here? Using interval vpinsrb looks good to me, but we need to clear xmm0 at the beginning.

rohitaggarwal007 · March 27, 2025, 10:01am

Hi @phoebe

I am attaching the file for the full IR.

this file is generated with -S emit-llvm : sample_gather_llvm.txt (3.4 KB)

The IR before DAG is present in IR_before_DAG.txt (2.9 KB)

Snippet from IR_before_DAG.txt

%7 = load i8, ptr %arrayidx.7, align 1, !tbaa !5
%8 = insertelement <8 x i8> poison, i8 %0, i64 0
%9 = insertelement <8 x i8> %8, i8 %1, i64 1
%10 = insertelement <8 x i8> %9, i8 %2, i64 2
%11 = insertelement <8 x i8> %10, i8 %3, i64 3
%12 = insertelement <8 x i8> %11, i8 %4, i64 4
%13 = insertelement <8 x i8> %12, i8 %5, i64 5
%14 = insertelement <8 x i8> %13, i8 %6, i64 6
%15 = insertelement <8 x i8> %14, i8 %7, i64 7
%16 = zext <8 x i8> %15 to <8 x i32>
%17 = load <8 x i16>, ptr %b, align 2, !tbaa !8
%18 = sext <8 x i16> %17 to <8 x i32>
%19 = mul <8 x i32> %18, %16
%20 = shufflevector <8 x i32> %19, <8 x i32> %19, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%21 = shufflevector <8 x i32> %19, <8 x i32> %19, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%22 = add <4 x i32> %20, %21
%23 = shufflevector <4 x i32> %22, <4 x i32> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%rdx.shuf = shufflevector <8 x i32> %23, <8 x i32> poison, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
%bin.rdx = add <8 x i32> %23, %rdx.shuf
%rdx.shuf12 = shufflevector <8 x i32> %bin.rdx, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
%bin.rdx13 = add <8 x i32> %bin.rdx, %rdx.shuf12
%rdx.shuf14 = shufflevector <8 x i32> %bin.rdx13, <8 x i32> poison, <8 x i32> <i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
%bin.rdx15 = add <8 x i32> %bin.rdx13, %rdx.shuf14
%24 = extractelement <8 x i32> %bin.rdx15, i32 0
ret i32 %24

RKSimon · March 27, 2025, 4:23pm

Folding this:

    t102: v8i8 = BUILD_VECTOR t9, t11, t16, t20, t25, t29, t33, t37
    t100: v8i16 = zero_extend t102

to:

    t102: v16i8 = BUILD_VECTOR t9, 0, t11, 0, t16, 0, t20, 0, t25, 0, t29, t33, 0, t37, 0
    t100: v8i16 = bitcast t102

Seems a good combine to investigate - this can probably be attempted generically in DAGCombiner::visitZERO_EXTEND with some suitable legality checks.

rohitaggarwal007 · April 1, 2025, 5:30am

Thank you, let me try it.

Should this hold true for these cases also?

t102: v4i8 = BUILD_VECTOR t9, t11, t16, t20
t100: v4i32 = zero_extend t102
to:
t102: v16i8 = BUILD_VECTOR t9, 0, 0, 0, t11, 0, 0, 0, t16, 0, 0, 0, t20, 0, 0, 0
t100: v4i32 = bitcast t102
t102: v2i8 = BUILD_VECTOR t9, t11
t100: v2i64 = zero_extend t102
to:
t102: v16i8 = BUILD_VECTOR t9, 0, 0, 0, 0, 0, 0, 0, t11, 0, 0, 0, 0, 0, 0, 0
t100: v2i64 = bitcast t102

Thanks,
Rohit Aggarwal

RKSimon · April 1, 2025, 8:17am

SGTM (these are all for little endian of course)

Topic		Replies	Views
Extending vector operations LLVM Dev List Archives	13	205	July 23, 2008
SelectionDAG scalarizes vector operations. LLVM Dev List Archives	14	191	February 9, 2012
global type legalization? LLVM Dev List Archives	12	120	September 15, 2010
Generalizing shuffle vector LLVM Dev List Archives	6	156	September 30, 2008
Vector instructions LLVM Dev List Archives	4	154	June 30, 2008

Improve the gathering of the elements so that unwanted EXT operations can be avoided

Related topics