scalarrepl fails to promote array of vector

Hi all,

I want to use scalarrepl pass to eliminate the allocation of mat_alloc which is of type [4 x <4 x float>] in the following program.

$cat test.ll

; ModuleID = ‘test.ll’

define void @main(<4 x float>* %inArg, <4 x float>* %outArg, [4 x <4 x float>]* %constants) nounwind {
entry:
%inArg1 = load <4 x float>* %inArg
%mat_alloc = alloca [4 x <4 x float>]
%matVal = load [4 x <4 x float>]* %constants
store [4 x <4 x float>] %matVal, [4 x <4 x float>]* %mat_alloc
%0 = getelementptr inbounds [4 x <4 x float>]* %mat_alloc, i32 0, i32 0
%1 = load <4 x float>* %0
%2 = fmul <4 x float> %1, %inArg1
%3 = getelementptr inbounds [4 x <4 x float>]* %mat_alloc, i32 0, i32 1
%4 = load <4 x float>* %3
%5 = fmul <4 x float> %4, %inArg1
%6 = fadd <4 x float> %2, %5
%7 = getelementptr inbounds [4 x <4 x float>]* %mat_alloc, i32 0, i32 2
%8 = load <4 x float>* %7
%9 = fmul <4 x float> %8, %inArg1
%10 = fadd <4 x float> %6, %9
%11 = getelementptr inbounds [4 x <4 x float>]* %mat_alloc, i32 0, i32 3
%12 = load <4 x float>* %11
%13 = fadd <4 x float> %10, %12
%14 = getelementptr <4 x float>* %outArg, i32 1
store <4 x float> %13, <4 x float>* %14
ret void
}

$ opt -S -stats -scalarrepl test.ll

No transformation is performed. I’ve examined the source code of scalarrepl. It seems this pass does not handle array allocations. Is there other transformation pass I can use to eliminate this allocation?

Thanks,
David

Hi all,

I want to use scalarrepl pass to eliminate the allocation of mat_alloc which is of type [4 x <4 x float>] in the following program.

$cat test.ll

; ModuleID = 'test.ll'

define void @main(<4 x float>* %inArg, <4 x float>* %outArg, [4 x <4 x float>]* %constants) nounwind {
entry:
  %inArg1 = load <4 x float>* %inArg
  %mat_alloc = alloca [4 x <4 x float>]
  %matVal = load [4 x <4 x float>]* %constants
  store [4 x <4 x float>] %matVal, [4 x <4 x float>]* %mat_alloc
  %0 = getelementptr inbounds [4 x <4 x float>]* %mat_alloc, i32 0, i32 0
  %1 = load <4 x float>* %0
  %2 = fmul <4 x float> %1, %inArg1
  %3 = getelementptr inbounds [4 x <4 x float>]* %mat_alloc, i32 0, i32 1
  %4 = load <4 x float>* %3
  %5 = fmul <4 x float> %4, %inArg1
  %6 = fadd <4 x float> %2, %5
  %7 = getelementptr inbounds [4 x <4 x float>]* %mat_alloc, i32 0, i32 2
  %8 = load <4 x float>* %7
  %9 = fmul <4 x float> %8, %inArg1
  %10 = fadd <4 x float> %6, %9
  %11 = getelementptr inbounds [4 x <4 x float>]* %mat_alloc, i32 0, i32 3
  %12 = load <4 x float>* %11
  %13 = fadd <4 x float> %10, %12
  %14 = getelementptr <4 x float>* %outArg, i32 1
  store <4 x float> %13, <4 x float>* %14
  ret void
}

$ opt -S -stats -scalarrepl test.ll

No transformation is performed. I've examined the source code of scalarrepl. It seems this pass does not handle array allocations. Is there other transformation pass I can use to eliminate this allocation?

Hi David,

ScalarRepl gets shy about loads and stores of the entire aggregate:

  %matVal = load [4 x <4 x float>]* %constants
  store [4 x <4 x float>] %matVal, [4 x <4 x float>]* %mat_alloc

It is possible to generalize scalarrepl to handle these similar to the way it handles memcpy, but noone has done that yet. Also, it's not generally recommended to do stuff like this, because you'll get inefficient code from many parts of the optimizer and code generator.

-Chris

Hi Chris,

Thanks for your reply.

You said that scalarRepl gets shy about loads and stores of the entire aggregate. Then I use a test case:

; ModuleID = ‘test1.ll’
define i32 @fun(i32* nocapture %X, i32 %i) nounwind uwtable readonly {
%stackArray = alloca <4 x i32>
%XC = bitcast i32* %X to <4 x i32>*
%arrayVal = load <4 x i32>* %XC
store <4 x i32> %arrayVal, <4 x i32>* %stackArray
%arrayVal1 = load <4 x i32>* %stackArray
%1 = extractelement <4 x i32> %arrayVal1, i32 1
ret i32 %1
}

$ opt -S -stats -scalarrepl test1.ll
; ModuleID = ‘test1.ll’

define i32 @fun(i32* nocapture %X, i32 %i) nounwind uwtable readonly {
%XC = bitcast i32* %X to <4 x i32>*
%arrayVal = load <4 x i32>* %XC
%1 = extractelement <4 x i32> %arrayVal, i32 1
ret i32 %1
}

Hi Fan,

You said that scalarRepl gets shy about loads and stores of the entire
aggregate. Then I use a test case:

; ModuleID = 'test1.ll'
define i32 @fun(i32* nocapture %X, i32 %i) nounwind uwtable readonly {
   %stackArray = alloca <4 x i32>
   %XC = bitcast i32* %X to <4 x i32>*
   %arrayVal = load <4 x i32>* %XC
   store <4 x i32> %arrayVal, <4 x i32>* %stackArray
   %arrayVal1 = load <4 x i32>* %stackArray
   %1 = extractelement <4 x i32> %arrayVal1, i32 1
   ret i32 %1
}

$ opt -S -stats -scalarrepl test1.ll
; ModuleID = 'test1.ll'

define i32 @fun(i32* nocapture %X, i32 %i) nounwind uwtable readonly {
   %XC = bitcast i32* %X to <4 x i32>*
   %arrayVal = load <4 x i32>* %XC
   %1 = extractelement <4 x i32> %arrayVal, i32 1
   ret i32 %1
}
===-------------------------------------------------------------------------===
                           ... Statistics Collected ...
===-------------------------------------------------------------------------===

1 mem2reg - Number of alloca's promoted with a single store
1 scalarrepl - Number of allocas promoted

You can see that the stackArray is eliminated,

I think you may be confusing arrays and vectors: there is no stack array in
your example, only the vector <4 x i32>. As a general rule hardly any
optimization is done for loads and stores of arrays because front-ends don't
produce them much. Much more effort is made for vectors because they can be
important for getting good performance.

Ciao, Duncan.

  although there is loads and

Thanks Duncan and Chris!

I have this problem solved after I add the target layout definition at the beginning of the ii source code. It seems that the optimization pass rely on this information during transformation. I’ll figure it out. All the allocations including the array of vector in the previous examples are eliminated.

Now my compiler can generate pretty neat and efficient code. Thanks!

Cheers!
David

Thanks Duncan!

I have this problem solved after I add the target layout definition at the beginning of the ii source code. It seems that the optimization pass rely on this information during transformation. All the allocations including the array of vector in the previous examples are eliminated.

Now my compiler can generate pretty neat and efficient code. Thanks!

Cheers!
David