Optimization generate super long function definition

Hi all,

sorry to have sent the same question around. I am quite desperately looking for a solution to this problem and I figured the mailing list is the best bet.

In my code, I generate the following function:

define i32 @gl.qi([500 x i32] %x, i32 %i) {
entry:
  %x. = alloca [500 x i32]
  %i. = alloca i32
  %0 = alloca [500 x i32]
  store [500 x i32] %x, [500 x i32]* %x.
  store i32 %i, i32* %i.
  %x.1 = load [500 x i32], [500 x i32]* %x.
  %i.2 = load i32, i32* %i.
  store [500 x i32] %x.1, [500 x i32]* %0
  %1 = icmp slt i32 %i.2, 500
  br i1 %1, label %in-bound, label %out-of-bound

out-of-bound:                                     ; preds = %entry
  call void @gen.panic(i8* getelementptr inbounds ([22 x i8], [22 x i8]* @pool.str.2, i32 0, i32 0))
  unreachable

in-bound:                                         ; preds = %entry
  %2 = getelementptr inbounds [500 x i32], [500 x i32]* %0, i32 0, i32 %i.2
  %idx = load i32, i32* %2
  ret i32 %idx
}

the high level functionality is to use %i to index %x, and if %i is out of bound, a panic function is called instead.

consider the store line:

  store [500 x i32] %x, [500 x i32]* %x.

once I pass this function to opt -O1 -S --verify --verify-each, it generates code like this:

define i32 @gl.qi([500 x i32] %x, i32 %i) local_unnamed_addr {
entry:
  %0 = alloca [500 x i32], align 4
  %x.fca.0.extract = extractvalue [500 x i32] %x, 0
  %x.fca.1.extract = extractvalue [500 x i32] %x, 1
  %x.fca.2.extract = extractvalue [500 x i32] %x, 2
  %x.fca.3.extract = extractvalue [500 x i32] %x, 3
  %x.fca.4.extract = extractvalue [500 x i32] %x, 4
  %x.fca.5.extract = extractvalue [500 x i32] %x, 5

until 500. I put the number to 50000 and it won’t stop.

This is puzzling. I am not sure why must a store command be expanded to a sequence of etractvalues then stores? Is there a way to turn off this particular optimization without turning off the whole optimization?

Or am I looking at the wrong way to do this simple task?

Hey Jason Hu,

So I think this is the SROA pass breaking up the load/store of the aggregate (the [500 x i32] array) into individual load/stores so that it can see if any are unused or can have their stores forwarded. This is a bit of LLVM that I personally find pretty dumb - but if you look at how Clang generates code for your above pattern it’ll output memcpy intrinsics instead of doing load/stores, which gets around this issue.

Note: the compiler hasn’t hung, it’s just spinning and comparing a few thousand instructions generated from SROA against each other, you might need to wait until the end of the universe but it should complete eventually :wink:

Cheers,
-Neil.

Hey Neil,

thank you for replying and the information.

I think it is dumb… I have changed my implementation to passing a byval array pointer. In this case, this optimization doesn’t trigger.

I think the expansion in this optimization is quite useless because byval array pointer achieves the same (I suppose) while the generated size remains small. This reveals unnecessarily complex execution detail.

One solution would be to teach SROA that there is nothing to propagate here so splitting the struct is useless.

That said, it would be split roughly the same way by the backend as you treat the array as a value.

Passing a (byval) pointer seems like the right option to me.

Cheers,

 Johannes