mem2reg optimization

Hi all,

While profiling LLVM I noticed that a significant amount of time is spent in the RewriteSingleStoreAlloca function. Especially for fairly long basic blocks, linearly searching for uses before a store isn’t fast. So I was wondering if there isn’t a faster way to locate these uses?

Are there any other known opportunities for optimization? I noticed that register allocation takes a big bite out of JIT compilation performance, and there’s a big reliance on STL containers (which are often not as fast as specialized custom containers).

Thanks,

Nicolas Capens

Hi all,

While profiling LLVM I noticed that a significant amount of time is spent in the RewriteSingleStoreAlloca function. Especially for fairly long basic blocks, linearly searching for uses before a store isn’t fast. So I was wondering if there isn’t a faster way to locate these uses?

This should be easy to fix, particularly if there are many “single store allocas” being accessed in the same block. The check is verifying that the store happens before any loads in the same block. Can you provide a .bc testcase please?

Are there any other known opportunities for optimization? I noticed that register allocation takes a big bite out of JIT compilation performance, and there’s a big reliance on STL containers (which are often not as fast as specialized custom containers).

There are many possibilities and many people are working on reducing compile time. If you’re hitting some very slow case, it is best to file a bugzilla with a testcase.

-Chris

YES! I just finished a patch 2 hours ago to fix this. I'll get it in ASAP.
mem2reg totally dropped off the radar once I did this.

                                                 -Dave

Hi Dave,

Did that patch of yours ever make it into trunk? I can't seem to find any
related checkin for PromoteMemoryToRegister.cpp. I've been doing some extra
profiling lately and the RewriteSingleStoreAlloca function alone is taking a
whopping 63% of execution time.

Thanks!

Nicolas

I will commit it today along with some other things. I've been having a lot
of trouble building llvm-gcc but I think I've struggled enouigh now and will
just hope our testing on this end has been enough.

Thanks for the prod.

                                          -Dave

Hi Dave,

As an exercise I tried to fix this myself, and I think I have a working
patch (attached). My own tests are all working wonderfully, and at fantastic
performance!

I'm looking forward to your patch to see whether we used the same approach
or whether things could be improved further.

Anyway, I've re-profiled the code and found ComputeLiveInBlocks to be the
main hotspot now. Again it uses a linear search over the instructions so
there's a chance this can be optimized as well...

Kind regards,

Nicolas

PromoteMemoryToRegister.patch (2.64 KB)

I forgot to ask: I found that uses appear to be stored in reverse order
(i.e. use_begin() gives us an iterator that starts with the last use and
incrementing it brings us to the first one). Is that correct and is it
something I can always rely on for analysis? Thanks.

I don't quite understand how the logic of your patch is the same. In the
original code, we're looking for a Load that uses the alloca before it is
stored. Your code loops through the use list, first looking for the use
that is the store. It then keeps going and if it sees a match with a load,
it prevents promotion.

Are use lists guaranteed to be ordered in reverse instruction order? That
is instructions that use values later appear earlier in the use list? Can we
really rely on ordering of the use list?

My patch builds a map from BasicBlock to lists of loads and stores in that
block in an initialization phase along with ordering information about the
loads and stores. RewriteSingleStoreAlloca then queries that information
to determine if a load appears before the single store.

I like your approach of using the use lists but I'm not sure the ordering is
guaranteed. If it is, your approach is superior.

                                                    -Dave

I got my patch updated to work with TOT. Here it is. Comments welcome.
I still prefer Nicolas' solution if his use list ordering assumption is valid.

                                                       -Dave

mem2reg.patch (23.8 KB)

Isn't the ordering an artifact of the creation of the users? If you
run mem2reg after other optimizations, why would you still have any
ordering guarantees?

Andrew

>> I like your approach of using the use lists but I'm not sure the
>> ordering is guaranteed. If it is, your approach is superior.
>
> I got my patch updated to work with TOT. Here it is. Comments welcome.
> I still prefer Nicolas' solution if his use list ordering assumption is
> valid.

Isn't the ordering an artifact of the creation of the users? If you

I think so.

run mem2reg after other optimizations, why would you still have any
ordering guarantees?

Yes, that's exactly what I'm worried about and why I didn't use the use lists
in my solution.

I'll check in my patch if folks find it acceptable. It was done pretty
quickly so I may make another pass over it to clean it up a bit.

                                                     -Dave

Hi Andrew,

I can't see any reason not to run mem2reg as the first pass (except maybe run a CFG simplify first, which doesn't alter the use list as far as I know). Also, I doubt that any pass can actually make a valid change in the order of loads and stores. And finally, why would any pass want to change the order of the use list in the first place?

Only passes that create new stores and/or loads and don't insert them into the use list in the same reverse order could cause problems. But as far as I know that's only during codegen (e.g. specific instructions that need an operand to be in memory, and spills and loads added during register allocation).

Anyway, my experience with LLVM is quite limited so I might be overlooking something important. So if Dave's solution works in all cases and has comparable performance then that's obviously better. I'll try to compare performance early next week and keep you posted.

Kind regards,

Nicolas

Hi all,

I haven't had a chance yet to do some accurate profiling, but does anyone
have some more insight in the use list order? *bump*

It seems very valuable to me (from a performance standpoint) to keep the use
list in (reverse) order. But if this can't be guaranteed it might still be
interesting to have some fast variants of passes that do make some
assumptions (like my patch for mem2reg). I believe I can optimize more
passes by making such assumption about the order of the use lists, so any
and all information about that is welcome.

Thanks a lot,

Nicolas

Hi all,

I haven't had a chance yet to do some accurate profiling, but does anyone
have some more insight in the use list order? *bump*

I thought this was answered; use lists are not currently in
any intentional order.

It seems very valuable to me (from a performance standpoint) to keep the
use
list in (reverse) order. But if this can't be guaranteed it might still be
interesting to have some fast variants of passes that do make some
assumptions (like my patch for mem2reg). I believe I can optimize more
passes by making such assumption about the order of the use lists, so any
and all information about that is welcome.

Keeping use lists sorted, in general, seems expensive. Take a pass
like LICM; when an Instruction is hoisted from a loop, its position
in any use lists it's on may need to change if they are to remain
sorted. Determining the new position would require determining
where the other instructions in the list are.

The property of being able to arbitrarily reorder LLVM passes
is an important design goal. I don't think we want to have variants
that make assumptions about use list ordering unless there are
very significant advantages.

That said, it's an interesting idea. Would you be able to
perform the same optimizations using an analysis pass that
provides a topological ordering for all Instructions?

Dan

I agree with Dan here. It would be an interesting property, but it just isn't feasible to maintain any sane ordering here.

-Chris

My patch builds a map from BasicBlock to lists of loads and stores in that
block in an initialization phase along with ordering information about the
loads and stores. RewriteSingleStoreAlloca then queries that information
to determine if a load appears before the single store.

I like your approach of using the use lists but I'm not sure the ordering
is guaranteed. If it is, your approach is superior.

I got my patch updated to work with TOT. Here it is. Comments welcome.

Hi Dave,

Great. I'd like to get this in, but would really like to do a profile first. Could you please include a testcase? Even if it is something contrived with llvm-gcc, I would really like to understand what is going on.

I still prefer Nicolas' solution if his use list ordering assumption is valid.

It isn't.

-Chris

I like your approach of using the use lists but I'm not sure the
ordering
is guaranteed. If it is, your approach is superior.

I got my patch updated to work with TOT. Here it is. Comments
welcome.

Hi Dave,

Great. I'd like to get this in, but would really like to do a profile
first. Could you please include a testcase? Even if it is something
contrived with llvm-gcc, I would really like to understand what is
going on.

Looking more closely at the approach, I infer that the (logical in hindsight) issue are cases where mem2reg scans a BB to see the relative ordering of two instructions. For example, the single store case. I could imagine that if you saw lots of single store allocas in the middle of a large BB that mem2reg would be very very slow.

As far as the approach goes, here's a counterproposal that is a bit less complex. I suggest adding a single DenseMap<Instruction*,

to Mem2reg. If there is an entry in this map for a load/

store instruction, it would store the instruction's index in its basic block. This means that to check dominance of two instrs in a block, you just need to compare their two indices (which is fast). Really small blocks (< 10 instructions?) would never be entered into map, to keep the size of the map low. The first time a query is made within a large block, it would compute the indices of all loads and stores within the block.

The nice thing about using a single map is that it makes updating the map really trivial. Just use TheMap.erase(I) when deleting load and store instructions. The numbering can be purely local to a block and the relative ordering of memops don't get invalidated when a load or store is nuked.

Does this seem like a reasonable approach? If so, please split the various parts of mem2reg out into helper methods as a prepatch. This will make review of the subsequent changes much easier.

-Chris

Great. I'd like to get this in, but would really like to do a profile
first. Could you please include a testcase? Even if it is something
contrived with llvm-gcc, I would really like to understand what is
going on.

Agreed. I believe I have a testcase that shows the problem.

> I still prefer Nicolas' solution if his use list ordering assumption
> is valid.

It isn't.

Ok then, I'll file an official bug with the testcase and then check this in to
resolve the bug.

                                                -Dave

Looking more closely at the approach, I infer that the (logical in
hindsight) issue are cases where mem2reg scans a BB to see the
relative ordering of two instructions. For example, the single store

Correct.

case. I could imagine that if you saw lots of single store allocas in
the middle of a large BB that mem2reg would be very very slow.

Right.

As far as the approach goes, here's a counterproposal that is a bit
less complex. I suggest adding a single DenseMap<Instruction*,
> to Mem2reg. If there is an entry in this map for a load/
store instruction, it would store the instruction's index in its basic
block. This means that to check dominance of two instrs in a block,
you just need to compare their two indices (which is fast). Really
small blocks (< 10 instructions?) would never be entered into map, to
keep the size of the map low. The first time a query is made within a
large block, it would compute the indices of all loads and stores
within the block.

This is a simpler approach and should have the same effect. I'll code it up.

The nice thing about using a single map is that it makes updating the
map really trivial. Just use TheMap.erase(I) when deleting load and
store instructions. The numbering can be purely local to a block and
the relative ordering of memops don't get invalidated when a load or
store is nuked.

Good points.

Does this seem like a reasonable approach? If so, please split the
various parts of mem2reg out into helper methods as a prepatch. This
will make review of the subsequent changes much easier.

Which various parts are you referring to? You mean the stuff that populates
and updates the map?

                                               -Dave

As far as the approach goes, here's a counterproposal that is a bit
less complex. I suggest adding a single DenseMap<Instruction*,
> to Mem2reg. If there is an entry in this map for a load/
store instruction, it would store the instruction's index in its basic
block. This means that to check dominance of two instrs in a block,
you just need to compare their two indices (which is fast). Really
small blocks (< 10 instructions?) would never be entered into map, to
keep the size of the map low. The first time a query is made within a
large block, it would compute the indices of all loads and stores
within the block.

This is a simpler approach and should have the same effect. I'll code it up.

Great! Thanks,

Does this seem like a reasonable approach? If so, please split the
various parts of mem2reg out into helper methods as a prepatch. This
will make review of the subsequent changes much easier.

Which various parts are you referring to? You mean the stuff that populates
and updates the map?

I'm criticizing the existing code :), which is just very poorly structured. It would be nice to do a refactoring patch first which just splits the various "scan a BB" places out into helper methods, instead of them being inlined into larger pieces of code. This will make the mem2reg algo easier to read and will make the algorithmic change easier to review.

If you want, I can change one of the places that does a scan to show what I mean,

-Chris