We are building a neural network compiler using LLVM, see https://github.com/Microsoft/ELL.
We want to put the neural network weights into a bunch of global float arrays because it allows us to more easily leverage
Flash RAM on small embedded devices. For example, it enables these kinds of scenarios:
We are finding some pretty bad compiler performance in some cases. For example, this github gist contains a bitcode file which is a neural network compiled by ELL and it has about 30mb of floating point data, and when we put that through llc it takes 262 seconds to compile (on an Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz), but if we strip out the weights the “code” component of our neural network inference takes only 2 seconds to compile.
We’ve noticed a good improvement in LLVM 8.0 in this area, but we think there’s still a lot more that could be done. For example,
is it possible to dump big arrays of global floating point data into a binary without invoking huge assembly writer overhead?
Perhaps what is happening is the optimizer is trying to optimized away unused floats but we would like to disable that and just
tell the compiler dump the floats into the object file, don’t bother trying to optimize them….