Incremental compilation and recognizing distinct bitcode

For my project, the step of using LLVM to optimize and generate machine code for a module is much slower than everything else. I realize a significant performance improvement if I can do “incremental compilation” and avoid invoking the LLVM code generator if the underlying object has not changed.

My current strategy is as follows: for each module:

  • write bitcode out to “module.bc.new”

  • if “module.bc” exists, then compare (byte-by-byte) with “module.bc.new”. If they match, then skip compilation

  • move “module.bc.new” to “module.bc” (known to be different at this point)

  • generate “module.o” (expensive step)

However, I am finding that occasionally I will write out different bitcode for the same input, which causes gratuitous recompilation. If I run llvm-dis on “module.bc” and “module.bc.new” in these cases, the output is identical, as expected.

Is it expected that the actual bitcode may change from run to run, perhaps as a result of ASLR?

Is there a better way for me to check that a Module* structure just built is (not) identical to that from a previous run?

For my project, the step of using LLVM to optimize and generate machine
code for a module is much slower than everything else. I realize a
significant performance improvement if I can do "incremental compilation"
and avoid invoking the LLVM code generator if the underlying object has not
changed.

My current strategy is as follows: for each module:
- write bitcode out to "module.bc.new"
- if "module.bc" exists, then compare (byte-by-byte) with
"module.bc.new". If they match, then skip compilation
- move "module.bc.new" to "module.bc" (known to be different at this point)
- generate "module.o" (expensive step)

However, I am finding that occasionally I will write out different bitcode
for the same input, which causes gratuitous recompilation. If I run
llvm-dis on "module.bc" and "module.bc.new" in these cases, the output is
identical, as expected.

Is it expected that the actual bitcode may change from run to run, perhaps
as a result of ASLR?

If the input is exactly the same, LLVM should generate the same bitcode
from run to run. That said, we don't have very good testing infrastructure
for this, so it's possible you're tripping over a bug.

Note that llvm-dis by default doesn't print out all the information in a
.bc file; try passing "-preserve-ll-uselistorder=true".

Is there a better way for me to check that a Module* structure just built
is (not) identical to that from a previous run?

I don't think so. It's not really a common operation. (Compilers which
support incremental compilation generally use some other mechanism.)

-Eli

For my project, the step of using LLVM to optimize and generate machine code for a module is much slower than everything else. I realize a significant performance improvement if I can do "incremental compilation" and avoid invoking the LLVM code generator if the underlying object has not changed.

My current strategy is as follows: for each module:
- write bitcode out to "module.bc.new"
- if "module.bc" exists, then compare (byte-by-byte) with "module.bc.new". If they match, then skip compilation
- move "module.bc.new" to "module.bc" (known to be different at this point)
- generate "module.o" (expensive step)

However, I am finding that occasionally I will write out different bitcode for the same input, which causes gratuitous recompilation. If I run llvm-dis on "module.bc" and "module.bc.new" in these cases, the output is identical, as expected.

Is it expected that the actual bitcode may change from run to run, perhaps as a result of ASLR?

No, for instance it is not expected that clang would generate a different bitcode.
I assume you’re using your own fronted to generate the IR? You may not be deterministic when creating it.

Diffing the output of "llvm-bcanalyzer -dump” may help.

Is there a better way for me to check that a Module* structure just built is (not) identical to that from a previous run?

You may check what we do with ThinLTO (lib/LTO/ThinLTOCodeGenerator) to perform incremental LTO, i.e. hashing the module content and checking on disk if it exists. This may or may not be able to be included nicely into your flow better than scripting for instance.