Can LLVM emit machine code faster with no optimization passes?

Hello,

Recently Jonathan Blow posted a short screencast discussing build time of his compiler with when no optimizations are run on the user’s code.

Part 1: https://www.youtube.com/watch?v=HLk4eiGUic8
Part 2: https://www.youtube.com/watch?v=mIjGYbol0O4

He discusses what parts are taking the longest to compile, and the ultimately shows this:
http://i.imgur.com/BkbKcJK.png
…which shows that emitting LLVM IR in memory and LLVMTargetMachineEmitToFile is the bottleneck in his compiler toolchain.

In fact, it was significantly faster to emit C to disk and compile with MSVC than to emit LLVM in memory and call LLVMTargetMachineEmitToFile.

His conclusion is that he will not depend on LLVM when his users compile with optimizations off, instead directly emitting x86_64 machine code into an object file.

Needless to say, this is duplicate effort. Is there any way the LLVM project could speed things up when no optimizations are run?

As another compiler author (http://ziglang.org/), I want to compete with Jon’s speed without having to duplicate the effort that LLVM already solves.

Thanks for your time.
Andrew

Hi,

This is hard to answer in the abstract. There are multiple knobs that impact compile time (optimization level, using fast-isel, etc).
Ideally we’d have an example of C output and IR output so that we can reproduce offline and profile what happens.

I don’t think this can be discussed in this generality. Without knowing how llvm is invoked, what sort of optimisations msvc was running. So this would be best investigated with a concrete .ll file (llvm is a nice infrastructure in that it makes it easy to share .bc/.ll files to capture intermediate states of the compiler) which you can try to push through clang, opt, or llc to see whether it matches the speed of msvc. But just as food for though: What if msvc did some minimal optimisations, found out that half the sourcecode is unreachable and removes it, while llvm with no optimisations just compiles everything?

  • Matthias

Right, that's a possibility. It also depends if the (code) workload is
prone to heavy DCE, which could be the case here, but it is not
necessarily true for all cases.

Still, invoking a separate project on Windows, which has costly forks
and all the parsing and building internal representations should
pretty much even the odds.

What I took from this video is that LLVM JIT is still too slow for
some usages || we're not good at communicating how to make the JIT
faster.

It also goes back to the discussion that O1 destroys too much of the
debug information to be useful but O0 is too dumb to have a smart
debugging experience.

Maybe that's why people use LLVM's JIT at O0?

But I'm certainly not a JIT expert, so these are just vague ramblings
not backed by any facts... :slight_smile:

cheers,
--renato

llvm is actually extremely slow when it has to remove lots of dead code. I experienced that in the beginning when working on our llvm backend. I had some bugs in our code generator that caused about half of the llvm IR code to be dead, and compiling that code with -O1 made llvm extremely slow.

Another thing that makes llvm incredibly slow is loading/storing large aggregates directly (I know, now, that you're not supposed to do that). I guess it's the generation of the resulting spilling code that takes forever. See e.g. llvm array values - Pastebin.com

All that said: we will also keep our original code generators in our compiler, and keep llvm as an option to optimise extra. In terms of speed, our code generators are much less complex and hence much faster than llvm's. We don't have instruction selection, but directly generate assembler via virtual methods of our parse tree node classes. That would be very hard to beat, even if things have gotten slower lately due to the addition of extra abstraction layers to support generating JVM bytecode and, yes, LLVM IR :slight_smile:

There are also a few other reasons, but they're not relevant to this thread. (*)

Jonas

(*) We support several platforms that LLVM no longer supports and/or will probably never support (OS/2, 16 and 32 bit MS-DOS, Gameboy Advance, Amiga, Darwin/PowerPC), and the preference of some code generator/optimisation developers to write Pascal rather than C++ (our compiler is a self-hosted Pascal compiler)

It is clear that there are passes in LLVM that are non-linear with regards to the size of the function, particularly the size of the generated machine code, and badly generated IR can easily trigger this with very few and relatively simple IR instructions - in particular load/store of large data structures, and LLVM doesn’t “understand” that this is unsuitable and perhaps should be converted to memcpy() [or a loop, or whatever]. In fact, I think it’s more important to have a basic-block than a function, but I’ve not investigated the exact details. From memory and my relatively small effort, it’s about “selecting the right instruction(s)”, and this becomes a “for each instruction, check all other instructions being generated” - which is O(N^2) in complexity.

I suggested, but never completed, a pass to translate “large load to memcpy” - probably as a separate pass, rather than the current “memcpy optimisation” pass that does the opposite, takes small calls to memcpy and translates to the relevant load and store operations. Maybe, one of those months full of Sundays, I’ll get around to it…

Of course, without actually knowing what the original code and/or generated IR is, it’s hard to say if the problem Jonas and I has seen is actually the problem that the original post is about. It is certainly a PLAUSIBLE scenario, but perhaps not the only one.

But just as food for though: What if msvc did some minimal
optimisations, found out that half the sourcecode is unreachable and
removes it, while llvm with no optimisations just compiles everything?

llvm is actually extremely slow when it has to remove lots of dead code. I experienced that in the beginning when working on our llvm backend. I had some bugs in our code generator that caused about half of the llvm IR code to be dead, and compiling that code with -O1 made llvm extremely slow.

When you encounter such a problem i encourage you to file a bug and give people the opportunity to analyze and - in case - perhaps fix the underlying issue. Here I don’t know what you mean by “extremely slow” or “incredibly slow” - nor your basis of comparison.

Thanks
Gerolf

If I remember correctly, the reply on the LLVM chat when I first mentioned
this was "but don't do that, clang has code go avoid that, so you should do
the same" (in other words, LLVM is not supposed to do well in the example
given, because you are supposed to use memcpy or similar to copy larger
data structures) - I think this is a but "unfriendly" to compiler writers,
but I see the point in some way.

I'm not sure there are other cases, where this symptom is obvious.