[Proposal] An integrated, optimized addition to IRBuilder for function-local variables

All,

Recently, while developing a language based on LLVM (https://github.com/taekwonbilly/optricks), I began to create a framework for function-local variables that may be useful to others.

A function-local variable is simply a value stored on the stack (and created by the llvm alloc instruction). When one wants to update it, one simply can use the store instruction, and when one wants to determine its value, one would use the load instruction.

However, when one uses a series of load and store instructions, some store instructions may not have any effect and some load instructions may be unnecessarily recalculating values. I propose that additional functionality be added to the llvm:IRBuilder class that can solve this problem without need for an optimization pass.

While there exist passes at a later stage that take care of this problem, they spent significant time attempting to determine when this optimization applies. Additionally, doing this when the IR is generated allows for more optimizations to be done during this stage (instead of generating the instructions and then deleting or modifying them).

The current way that I have designed the code involves a with a similar form to below.

struct Variable {
Value* pointerToMemory;
std::map<BasicBlock*,Value*> mostRecentValue;
}

The variable pointerToMemory simply holds the actual pointer to the data. The map mostRecentVaue stores a cache of the most recent value loaded from memory in that block. If the mapped value for a block is NULL, the current value is assumed to have been modified by some external process (e.g. as an argument another function) and must be loaded the next time it is required.

Likewise, five functions should be added to IRBuilder that allow for the use of these variables.

The first function,
Variable* createVariable(Value* pointer)
will create the variable with the provided pointer to the memory.

The second function,
Value* getValue(Variable* var)
will return a Value* representing the contents of the variable at the end of the current BasicBlock.

It will be implemented as follows:
check if the current block is in the map (cache) and is non-null, and if so return it.
check if the current block is in the map and is null, and if so update the cache with a load instruction, then return that load instruction
create a blank phi-node, update the cache to map this block to the phi node, then return the phi node

The phi-node acts as a placeholder and will be updated when the no more IR will be added to the function.

The third function,
void setValue(Variable* var, Value* newVal)
will update the contents of the variable var at the end of the current BasicBlock.

This is implemented by simply setting the map’s value at the current BasicBlock to be newVal.

Setting the variable to a value of NULL using this method is considered invalid.

The fourth function,
void updatePointer()
will flush all updates to the pointer so that it can be used in an external process.

Specifically, this will involve storing the current value of the map if it is non-null, and setting the map at the current BasicBlock to be null to indicate that the variable must be loaded from memory the next time it is used.

The final function,
void finalizeVariablesInCurrentFunction()
would clean up all of the PHINodes created for variables in this function.

For each PHINode, this method would retrieve the cached value of each BasicBlock predecessor (which represents the last value in that block) and use that as a predecessor for the PHINode.

If there is only one predecessor or all the cached values of predecessors are the same, the method should run replaceAllUsesWith on the PHINode with the unique value.

This replacement should then be coupled with basic optimizations such as folding the addition of two constants.

Overall, I believe that this addition will be very useful to those who use LLVM for reasons of both performance and additional functionality.

The preliminary (although terribly messy) implementation of this can be found here (https://github.com/taekwonbilly/optricks/blob/075fa3c2c89c97c010783a5d91f4a57afc63dfb3/Optricks/containers/settings.hpp) and is called “LazyLocation”

Even if this is not added to the main LLVM source, I would appreciate any comments about it.

Thanks,
Billy Moses

All,

Recently, while developing a language based on LLVM (
https://github.com/taekwonbilly/optricks), I began to create a framework
for function-local variables that may be useful to others.

A function-local variable is simply a value stored on the stack (and
created by the llvm alloc instruction). When one wants to update it, one
simply can use the store instruction, and when one wants to determine its
value, one would use the load instruction.

However, when one uses a series of load and store instructions, some store
instructions may not have any effect and some load instructions may be
unnecessarily recalculating values. I propose that additional functionality
be added to the llvm:IRBuilder class that can solve this problem without
need for an optimization pass.

While there exist passes at a later stage that take care of this problem,
they spent significant time attempting to determine when this optimization
applies.

Can you supply some test cases and measurements (please include the raw
data as well) that exhibit the performance issue as compared to the
approach you propose?

Is this purely a performance optimization?

One thing to take into consideration is that the passes that form SSA from
these alloca/load/store patterns are function passes and with the new
PassManager and other upcoming changes, these transformations can be
parallelized at function granularity without any extra work from the
frontend. The approach you suggest puts the work of forming SSA into the
path of creating the IR for functions, which would require the frontend's
IR generation to be parallelized in order to gain performance from
parallelism.

On the other hand, if the approach you suggest can be implemented in such a
way that it avoids creating so many alloca/load/store instructions in the
first place (operating on the already-in-cache function that is being
constructed), there may be a potentially large memory bandwidth win which
results in a potentially large net speedup.

-- Sean Silva